2026-01-10 22:50:57 +08:00
2026-01-10 22:50:57 +08:00
2026-01-10 22:50:57 +08:00
2026-01-10 22:50:57 +08:00
2026-01-10 20:15:32 +08:00
2026-01-10 20:15:32 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 23:50:31 +08:00
2026-01-10 20:15:32 +08:00
2026-01-10 22:50:57 +08:00
2026-01-10 22:50:57 +08:00
2026-01-10 22:50:57 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00

auto-caption

Auto Caption is a cross-platform real-time caption display software.

| 简体中文 | English | 日本語 |

v1.1.0 has been released, adding the GLM-ASR cloud caption model and OpenAI compatible model translation...

📥 Download

Software Download: GitHub Releases

Vosk Model Download: Vosk Models

SOSV Model Download: Shepra-ONNX SenseVoice Model

📚 Documentation

Auto Caption User Manual

Caption Engine Documentation

Changelog

Features

  • Generate captions from audio output or microphone input
  • Supports calling local Ollama models, cloud-based OpenAI compatible models, or cloud-based Google Translate API for translation
  • Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support
  • Rich caption style settings (font, font size, font weight, font color, background color, etc.)
  • Flexible caption engine selection (Aliyun Gummy cloud model,GLM-ASR cloud model, local Vosk model, local SOSV model, or you can develop your own model)
  • Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines")
  • Subtitle record display and export (supports exporting .srt and .json formats)

📖 Basic Usage

⚠️ Note: Currently, only the latest version of the software on Windows platform is maintained, while the last versions for other platforms remain at v1.0.0.

The software has been adapted for Windows, macOS, and Linux platforms. The tested platform information is as follows:

OS Version Architecture System Audio Input System Audio Output
Windows 11 24H2 x64
macOS Sequoia 15.5 arm64 Additional config required
Ubuntu 24.04.2 x64

Additional configuration is required to capture system audio output on macOS and Linux platforms. See Auto Caption User Manual for details.

After downloading the software, you need to select the corresponding model according to your needs and then configure the model.

Accuracy Real-time Deployment Type Supported Languages Translation Notes
Gummy Very good 😊 Very good 😊 Cloud / Alibaba Cloud 10 languages Built-in translation Paid, 0.54 CNY/hour
glm-asr-2512 Very good 😊 Poor 😞 Cloud / Zhipu AI 4 languages Requires additional configuration Paid, approximately 0.72 CNY/hour
Vosk Poor 😞 Very good 😊 Local / CPU Over 30 languages Requires additional configuration Supports many languages
SOSV Average 😐 Average 😐 Local / CPU 5 languages Requires additional configuration Only one model
Self-developed 🤔 🤔 Custom Custom Custom Develop your own using Python according to the documentation

If you choose a model other than Gummy, you also need to configure your own translation model.

Configuring Translation Models

Note: Translation is not real-time. The translation model is only called after each sentence recognition is completed.

Ollama Local Model

Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: qwen2.5:0.5b, qwen3:0.6b.

Before using this model, you need to confirm that the Ollama software is installed on your local machine and that you have downloaded the required large language model. Simply add the name of the large model you want to call to the Model Name field in the settings, and ensure that the Base URL field is empty.

OpenAI Compatible Model

If you feel the translation effect of the local Ollama model is not good enough, or don't want to install the Ollama model locally, then you can use cloud-based OpenAI compatible models.

Here are some model provider Base URLs:

The API Key needs to be obtained from the corresponding model provider.

Google Translate API

Note: Google Translate API is not available in some regions.

No configuration required, just connect to the internet to use.

Using Gummy Model

The international version of Alibaba Cloud services does not seem to provide the Gummy model, so non-Chinese users may not be able to use the Gummy caption engine at present.

To use the default Gummy caption engine (using cloud models for speech recognition and translation), you first need to obtain an API KEY from Alibaba Cloud Bailian platform, then add the API KEY to the software settings or configure it in the environment variables (only Windows platform supports reading API KEY from environment variables), so that the model can be used normally. Related tutorials:

Using the GLM-ASR Model

Before using it, you need to obtain an API KEY from the Zhipu AI platform and add it to the software settings.

For API KEY acquisition, see: Quick Start.

Using Vosk Model

The recognition effect of the Vosk model is poor, please use it with caution.

To use the Vosk local caption engine, first download the model you need from the Vosk Models page, unzip the model locally, and add the path of the model folder to the software settings.

Using SOSV Model

The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model

⌨️ Using in Terminal

The software adopts a modular design and can be divided into two parts: the main software body and caption engine. The main software calls caption engine through a graphical interface. Audio acquisition and speech recognition functions are implemented in the caption engine, which can be used independently without the main software.

Caption engine is developed using Python and packaged as executable files via PyInstaller. Therefore, there are two ways to use caption engine:

  1. Use the source code of the project's caption engine part and run it with a Python environment that has the required libraries installed
  2. Run the packaged executable file of the caption engine through the terminal

For runtime parameters and detailed usage instructions, please refer to the User Manual.

python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh

⚙️ Built-in Subtitle Engines

Currently, the software comes with 4 caption engines, with new engines under development. Their detailed information is as follows.

Gummy Subtitle Engine (Cloud)

Developed based on Tongyi Lab's Gummy Speech Translation Model, using Alibaba Cloud Bailian API to call this cloud model.

Model Parameters:

  • Supported audio sample rate: 16kHz and above
  • Audio sample depth: 16bit
  • Supported audio channels: Mono
  • Recognizable languages: Chinese, English, Japanese, Korean, German, French, Russian, Italian, Spanish
  • Supported translations:
    • Chinese → English, Japanese, Korean
    • English → Chinese, Japanese, Korean
    • Japanese, Korean, German, French, Russian, Italian, Spanish → Chinese or English

Network Traffic Consumption:

The caption engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:


48000\ \text{samples/second} \times 2\ \text{bytes/sample} \times 1\ \text{channel}  = 93.75\ \text{KB/s}

The engine only uploads data when receiving audio streams, so the actual upload rate may be lower. The return traffic consumption of model results is small and not considered here.

GLM-ASR Caption Engine (Cloud)

https://docs.bigmodel.cn/en/guide/models/sound-and-video/glm-asr-2512

Vosk Subtitle Engine (Local)

Developed based on vosk-api. The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation.

SOSV Subtitle Engine (Local)

SOSV is an integrated package, mainly based on Shepra-ONNX SenseVoice, with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese.

🚀 Project Setup

Install Dependencies

npm install

Build Subtitle Engine

First enter the engine folder and execute the following commands to create a virtual environment (requires Python 3.10 or higher, with Python 3.12 recommended):

cd ./engine
# in ./engine folder
python -m venv .venv
# or
python3 -m venv .venv

Then activate the virtual environment:

# Windows
.venv/Scripts/activate
# Linux or macOS
source .venv/bin/activate

Then install dependencies (this step might result in errors on macOS and Linux, usually due to build failures, and you need to handle them based on the error messages):

pip install -r requirements.txt

Then use pyinstaller to build the project:

pyinstaller ./main.spec

Note that the path to the vosk library in main-vosk.spec might be incorrect and needs to be configured according to the actual situation (related to the version of the Python environment).

# Windows
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
# Linux or macOS
vosk_path = str(Path('./.venv/lib/python3.x/site-packages/vosk').resolve())

After the build completes, you can find the executable file in the engine/dist folder. Then proceed with subsequent operations.

Run Project

npm run dev

Build Project

# For windows
npm run build:win
# For macOS
npm run build:mac
# For Linux
npm run build:linux
Description
A cross-platform subtitle display software. 一个跨平台的字幕显示软件。
Readme MIT 34 MiB
Languages
TypeScript 43.1%
Vue 30.8%
Python 25.2%
JavaScript 0.4%
CSS 0.3%
Other 0.2%