2025-09-08 15:19:10 +08:00
2025-09-08 15:19:10 +08:00
2025-09-08 15:19:10 +08:00
2025-09-08 15:19:10 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 23:50:31 +08:00
2025-09-08 15:19:10 +08:00
2025-09-08 15:19:10 +08:00
2025-09-08 15:19:10 +08:00
2025-09-08 15:19:10 +08:00
2025-09-08 15:19:10 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00
2025-05-11 21:41:22 +08:00

auto-caption

Auto Caption is a cross-platform real-time caption display software.

| 简体中文 | English | 日本語 |

Version 1.0.0 has been released, with the addition of the SOSV local caption model. More caption models are being developed...

📥 Download

Software Download: GitHub Releases

Vosk Model Download: Vosk Models

SOSV Model Download: Shepra-ONNX SenseVoice Model

📚 Documentation

Auto Caption User Manual

Caption Engine Documentation

Changelog

Features

  • Generate captions from audio output or microphone input
  • Supports translation by calling local Ollama models or cloud-based Google Translate API
  • Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support
  • Rich caption style settings (font, font size, font weight, font color, background color, etc.)
  • Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, local SOSV model, or you can develop your own model)
  • Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines")
  • Subtitle record display and export (supports exporting .srt and .json formats)

📖 Basic Usage

The software has been adapted for Windows, macOS, and Linux platforms. The tested platform information is as follows:

OS Version Architecture System Audio Input System Audio Output
Windows 11 24H2 x64
macOS Sequoia 15.5 arm64 Additional config required
Ubuntu 24.04.2 x64

Additional configuration is required to capture system audio output on macOS and Linux platforms. See Auto Caption User Manual for details.

After downloading the software, you need to select the corresponding model according to your needs and then configure the model.

Recognition Quality Deployment Type Supported Languages Translation Notes
Gummy Excellent 😊 Alibaba Cloud 10 languages Built-in Paid, 0.54 CNY/hour
Vosk Poor 😞 Local / CPU Over 30 languages Requires setup Supports many languages
SOSV Fair 😐 Local / CPU 5 languages Requires setup Only one model available
DIY Development 🤔 Custom Custom Custom Develop your own using Python according to documentation

If you choose to use the Vosk or SOSV model, you also need to configure your own translation model.

Configuring Translation Models

Note: Translation is not real-time. The translation model is only called after each sentence recognition is completed.

Ollama Local Model

Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: qwen2.5:0.5b, qwen3:0.6b.

Before using this model, you need to ensure that Ollama software is installed on your machine and the required large language model has been downloaded. Simply add the name of the large model you want to call to the Ollama field in the settings.

Google Translate API

Note: Google Translate API is not available in some regions.

No configuration required, just connect to the internet to use.

Using Gummy Model

The international version of Alibaba Cloud services does not seem to provide the Gummy model, so non-Chinese users may not be able to use the Gummy caption engine at present.

To use the default Gummy caption engine (using cloud models for speech recognition and translation), you first need to obtain an API KEY from Alibaba Cloud Bailian platform, then add the API KEY to the software settings or configure it in the environment variables (only Windows platform supports reading API KEY from environment variables), so that the model can be used normally. Related tutorials:

Using Vosk Model

The recognition effect of the Vosk model is poor, please use it with caution.

To use the Vosk local caption engine, first download the model you need from the Vosk Models page, unzip the model locally, and add the path of the model folder to the software settings.

Using SOSV Model

The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model

⚙️ Built-in Subtitle Engines

Currently, the software comes with 3 caption engines, with new engines under development. Their detailed information is as follows.

Gummy Subtitle Engine (Cloud)

Developed based on Tongyi Lab's Gummy Speech Translation Model, using Alibaba Cloud Bailian API to call this cloud model.

Model Parameters:

  • Supported audio sample rate: 16kHz and above
  • Audio sample depth: 16bit
  • Supported audio channels: Mono
  • Recognizable languages: Chinese, English, Japanese, Korean, German, French, Russian, Italian, Spanish
  • Supported translations:
    • Chinese → English, Japanese, Korean
    • English → Chinese, Japanese, Korean
    • Japanese, Korean, German, French, Russian, Italian, Spanish → Chinese or English

Network Traffic Consumption:

The caption engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:


48000\ \text{samples/second} \times 2\ \text{bytes/sample} \times 1\ \text{channel}  = 93.75\ \text{KB/s}

The engine only uploads data when receiving audio streams, so the actual upload rate may be lower. The return traffic consumption of model results is small and not considered here.

Vosk Subtitle Engine (Local)

Developed based on vosk-api. The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation.

SOSV Subtitle Engine (Local)

SOSV is an integrated package, mainly based on Shepra-ONNX SenseVoice, with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese.

Planned New Subtitle Engines

The following are candidate models that will be selected based on model performance and ease of integration.

🚀 Project Setup

Install Dependencies

npm install

Build Subtitle Engine

First enter the engine folder and execute the following commands to create a virtual environment (requires Python 3.10 or higher, with Python 3.12 recommended):

cd ./engine
# in ./engine folder
python -m venv .venv
# or
python3 -m venv .venv

Then activate the virtual environment:

# Windows
.venv/Scripts/activate
# Linux or macOS
source .venv/bin/activate

Then install dependencies (this step might result in errors on macOS and Linux, usually due to build failures, and you need to handle them based on the error messages):

pip install -r requirements.txt

Then use pyinstaller to build the project:

pyinstaller ./main.spec

Note that the path to the vosk library in main-vosk.spec might be incorrect and needs to be configured according to the actual situation (related to the version of the Python environment).

# Windows
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
# Linux or macOS
vosk_path = str(Path('./.venv/lib/python3.x/site-packages/vosk').resolve())

After the build completes, you can find the executable file in the engine/dist folder. Then proceed with subsequent operations.

Run Project

npm run dev

Build Project

# For windows
npm run build:win
# For macOS
npm run build:mac
# For Linux
npm run build:linux
Description
A cross-platform subtitle display software. 一个跨平台的字幕显示软件。
Readme MIT 34 MiB
Languages
TypeScript 43.1%
Vue 30.8%
Python 25.2%
JavaScript 0.4%
CSS 0.3%
Other 0.2%