release v1.1.0

This commit is contained in:
himeditator
2026-01-10 22:50:57 +08:00
parent 086ea90a5f
commit 0dc70d491e
20 changed files with 207 additions and 114 deletions

View File

@@ -3,7 +3,7 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption is a cross-platform real-time caption display software.</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.1.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
@@ -14,7 +14,7 @@
| <b>English</b>
| <a href="./README_ja.md">日本語</a> |
</p>
<p><i>Version 1.0.0 has been released, with the addition of the SOSV local caption model. The current features are basically complete, and there are no further development plans...</i></p>
<p><i>v1.1.0 has been released, adding the GLM-ASR cloud caption model and OpenAI compatible model translation...</i></p>
</div>
![](./assets/media/main_en.png)
@@ -38,15 +38,17 @@ SOSV Model Download: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditat
## ✨ Features
- Generate captions from audio output or microphone input
- Supports translation by calling local Ollama models or cloud-based Google Translate API
- Supports calling local Ollama models, cloud-based OpenAI compatible models, or cloud-based Google Translate API for translation
- Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support
- Rich caption style settings (font, font size, font weight, font color, background color, etc.)
- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, local SOSV model, or you can develop your own model)
- Flexible caption engine selection (Aliyun Gummy cloud model,GLM-ASR cloud model, local Vosk model, local SOSV model, or you can develop your own model)
- Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines")
- Subtitle record display and export (supports exporting `.srt` and `.json` formats)
## 📖 Basic Usage
> ⚠️ Note: Currently, only the latest version of the software on Windows platform is maintained, while the last versions for other platforms remain at v1.0.0.
The software has been adapted for Windows, macOS, and Linux platforms. The tested platform information is as follows:
| OS Version | Architecture | System Audio Input | System Audio Output |
@@ -60,14 +62,15 @@ Additional configuration is required to capture system audio output on macOS and
After downloading the software, you need to select the corresponding model according to your needs and then configure the model.
| | Recognition Quality | Deployment Type | Supported Languages | Translation | Notes |
| ------------------------------------------------------------ | ------------------- | ------------------ | ------------------- | ------------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Excellent 😊 | Alibaba Cloud | 10 languages | Built-in | Paid, 0.54 CNY/hour |
| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Local / CPU | Over 30 languages | Requires setup | Supports many languages |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Fair 😐 | Local / CPU | 5 languages | Requires setup | Only one model available |
| DIY Development | 🤔 | Custom | Custom | Custom | Develop your own using Python according to [documentation](./docs/engine-manual/zh.md) |
| | Accuracy | Real-time | Deployment Type | Supported Languages | Translation | Notes |
| ------------------------------------------------------------ | -------- | --------- | --------------- | ------------------- | ----------- | ----- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Very good 😊 | Very good 😊 | Cloud / Alibaba Cloud | 10 languages | Built-in translation | Paid, 0.54 CNY/hour |
| [glm-asr-2512](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512) | Very good 😊 | Poor 😞 | Cloud / Zhipu AI | 4 languages | Requires additional configuration | Paid, approximately 0.72 CNY/hour |
| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Very good 😊 | Local / CPU | Over 30 languages | Requires additional configuration | Supports many languages |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Average 😐 | Average 😐 | Local / CPU | 5 languages | Requires additional configuration | Only one model |
| Self-developed | 🤔 | 🤔 | Custom | Custom | Custom | Develop your own using Python according to the [documentation](./docs/engine-manual/en.md) |
If you choose to use the Vosk or SOSV model, you also need to configure your own translation model.
If you choose a model other than Gummy, you also need to configure your own translation model.
### Configuring Translation Models
@@ -79,7 +82,18 @@ If you choose to use the Vosk or SOSV model, you also need to configure your own
> Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
Before using this model, you need to ensure that [Ollama](https://ollama.com/) software is installed on your machine and the required large language model has been downloaded. Simply add the name of the large model you want to call to the `Ollama` field in the settings.
Before using this model, you need to confirm that the [Ollama](https://ollama.com/) software is installed on your local machine and that you have downloaded the required large language model. Simply add the name of the large model you want to call to the `Model Name` field in the settings, and ensure that the `Base URL` field is empty.
#### OpenAI Compatible Model
If you feel the translation effect of the local Ollama model is not good enough, or don't want to install the Ollama model locally, then you can use cloud-based OpenAI compatible models.
Here are some model provider `Base URL`s:
- OpenAI: https://api.openai.com/v1
- DeepSeek: https://api.deepseek.com
- Alibaba Cloud: https://dashscope.aliyuncs.com/compatible-mode/v1
The API Key needs to be obtained from the corresponding model provider.
#### Google Translate API
@@ -96,6 +110,12 @@ To use the default Gummy caption engine (using cloud models for speech recogniti
- [Get API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configure API Key through Environment Variables](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### Using the GLM-ASR Model
Before using it, you need to obtain an API KEY from the Zhipu AI platform and add it to the software settings.
For API KEY acquisition, see: [Quick Start](https://docs.bigmodel.cn/en/guide/start/quick-start).
### Using Vosk Model
> The recognition effect of the Vosk model is poor, please use it with caution.
@@ -133,7 +153,7 @@ python main.py \
## ⚙️ Built-in Subtitle Engines
Currently, the software comes with 3 caption engines, with new engines under development. Their detailed information is as follows.
Currently, the software comes with 4 caption engines, with new engines under development. Their detailed information is as follows.
### Gummy Subtitle Engine (Cloud)
@@ -160,6 +180,10 @@ $$
The engine only uploads data when receiving audio streams, so the actual upload rate may be lower. The return traffic consumption of model results is small and not considered here.
### GLM-ASR Caption Engine (Cloud)
https://docs.bigmodel.cn/en/guide/models/sound-and-video/glm-asr-2512
### Vosk Subtitle Engine (Local)
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation.
@@ -168,16 +192,6 @@ Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advanta
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) is an integrated package, mainly based on [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html), with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese.
### Planned New Subtitle Engines
The following are candidate models that will be selected based on model performance and ease of integration.
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
## 🚀 Project Setup
![](./assets/media/structure_en.png)