diff --git a/README.md b/README.md index c8802a3..962b6cf 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@
Auto Caption 是一个跨平台的实时字幕显示软件。
-
+
@@ -14,14 +14,18 @@
| English
| 日本語 |
v0.7.0 版本已经发布,优化了软件界面,添加了日志记录显示。本地的字幕引擎正在尝试开发中,预计以 Python 代码的形式进行发布...
+v1.0.0 版本已经发布,新增 SOSV 本地字幕模型。更多的字幕模型正在尝试开发中...
 ## 📥 下载 -[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases) +软件下载:[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases) + +Vosk 模型下载:[Vosk Models](https://alphacephei.com/vosk/models) + +SOSV 模型下载:[ Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) ## 📚 相关文档 @@ -29,51 +33,83 @@ [字幕引擎说明文档](./docs/engine-manual/zh.md) -[项目 API 文档](./docs/api-docs/) - [更新日志](./docs/CHANGELOG.md) ## ✨ 特性 - 生成音频输出或麦克风输入的字幕 +- 支持调用本地 Ollama 模型或云端 Google 翻译 API 进行翻译 - 跨平台(Windows、macOS、Linux)、多界面语言(中文、英语、日语)支持 - 丰富的字幕样式设置(字体、字体大小、字体粗细、字体颜色、背景颜色等) -- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、本地 Vosk 模型、自己开发的模型) +- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、本地 Vosk 模型、本地 SOSV 模型、还可以自己开发模型) - 多语言识别与翻译(见下文“⚙️ 自带字幕引擎说明”) - 字幕记录展示与导出(支持导出 `.srt` 和 `.json` 格式) ## 📖 基本使用 -软件已经适配了 Windows、macOS 和 Linux 平台。测试过的平台信息如下: +软件已经适配了 Windows、macOS 和 Linux 平台。测试过的主流平台信息如下: | 操作系统版本 | 处理器架构 | 获取系统音频输入 | 获取系统音频输出 | | ------------------ | ---------- | ---------------- | ---------------- | | Windows 11 24H2 | x64 | ✅ | ✅ | | macOS Sequoia 15.5 | arm64 | ✅ [需要额外配置](./docs/user-manual/zh.md#macos-获取系统音频输出) | ✅ | | Ubuntu 24.04.2 | x64 | ✅ | ✅ | -| Kali Linux 2022.3 | x64 | ✅ | ✅ | -| Kylin Server V10 SP3 | x64 | ✅ | ✅ | -macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,详见[Auto Caption 用户手册](./docs/user-manual/zh.md)。 +macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,详见 [Auto Caption 用户手册](./docs/user-manual/zh.md)。 -> 国际版的阿里云服务并没有提供 Gummy 模型,因此目前非中国用户无法使用 Gummy 字幕引擎。 +下载软件后,需要根据自己的需求选择对应的模型,然后配置模型。 + +| | 识别效果 | 部署类型 | 支持语言 | 翻译 | 备注 | +| ------------------------------------------------------------ | -------- | ------------- | ---------- | ---------- | ---------------------------------------------------------- | +| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 很好😊 | 云端 / 阿里云 | 10 种 | 自带翻译 | 收费,0.54CNY / 小时 | +| [Vosk](https://alphacephei.com/vosk) | 较差😞 | 本地 / CPU | 超过 30 种 | 需额外配置 | 支持的语言非常多 | +| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | 本地 / CPU | 5 种 | 需额外配置 | 仅有一个模型 | +| 自己开发 | 🤔 | 自定义 | 自定义 | 自定义 | 根据[文档](./docs/engine-manual/zh.md)使用 Python 自己开发 | + +如果你选择使用 Vosk 或 SOSV 模型,你还需要配置自己的翻译模型。 + +### 配置翻译模型 + + + +> 注意:翻译不是实时的,翻译模型只会在每句话识别完成后再调用。 + +#### Ollama 本地模型 + +> 注意:使用参数量过大的模型会导致资源消耗和翻译延迟较大。建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`。 + +使用该模型之前你需要确定本机安装了 [Ollama](https://ollama.com/) 软件,并已经下载了需要的大语言模型。只需要将需要调用的大模型名称添加到设置中的 `Ollama` 字段中。 + +#### Google 翻译 API + +> 注意:Google 翻译 API 在部分地区无法使用。 + +无需任何配置,联网即可使用。 + +### 使用 Gummy 模型 + +> 国际版的阿里云服务似乎并没有提供 Gummy 模型,因此目前非中国用户可能无法使用 Gummy 字幕引擎。 如果要使用默认的 Gummy 字幕引擎(使用云端模型进行语音识别和翻译),首先需要获取阿里云百炼平台的 API KEY,然后将 API KEY 添加到软件设置中或者配置到环境变量中(仅 Windows 平台支持读取环境变量中的 API KEY),这样才能正常使用该模型。相关教程: - [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key) - [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables) +### 使用 Vosk 模型 + > Vosk 模型的识别效果较差,请谨慎使用。 -如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。 +如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。 - + -**如果你觉得上述字幕引擎不能满足你的需求,而且你会 Python,那么你可以考虑开发自己的字幕引擎。详细说明请参考[字幕引擎说明文档](./docs/engine-manual/zh.md)。** +### 使用 SOSV 模型 + +使用 SOSV 模型的方式和 Vosk 一样,下载地址如下:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model ## ⚙️ 自带字幕引擎说明 -目前软件自带 2 个字幕引擎,正在规划新的引擎。它们的详细信息如下。 +目前软件自带 3 个字幕引擎,正在规划新的引擎。它们的详细信息如下。 ### Gummy 字幕引擎(云端) @@ -102,7 +138,12 @@ $$ ### Vosk 字幕引擎(本地) -基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。目前只支持生成音频对应的原文,不支持生成翻译内容。 +基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。该字幕引擎的优点是可选的语言模型非常多(超过 30 种),缺点是识别效果比较差,且生成内容没有标点符号。 + + +### SOSV 字幕引擎(本地) + +[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) 是一个整合包,该整合包主要基于 [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html),并添加了端点检测模型和标点恢复模型。该模型支持识别的语言有:英语、中文、日语、韩语、粤语。 ### 新规划字幕引擎 @@ -112,6 +153,7 @@ $$ - [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) - [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) - [FunASR](https://github.com/modelscope/FunASR) +- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit) ## 🚀 项目运行 @@ -128,6 +170,7 @@ npm install 首先进入 `engine` 文件夹,执行如下指令创建虚拟环境(需要使用大于等于 Python 3.10 的 Python 运行环境,建议使用 Python 3.12): ```bash +cd ./engine # in ./engine folder python -m venv .venv # or @@ -149,12 +192,6 @@ source .venv/bin/activate pip install -r requirements.txt ``` -如果在 Linux 系统上安装 `samplerate` 模块报错,可以尝试使用以下命令单独安装: - -```bash -pip install samplerate --only-binary=:all: -``` - 然后使用 `pyinstaller` 构建项目: ```bash diff --git a/README_en.md b/README_en.md index 38386e3..db74918 100644 --- a/README_en.md +++ b/README_en.md @@ -3,7 +3,7 @@Auto Caption is a cross-platform real-time caption display software.
-
+
@@ -14,14 +14,18 @@
| English
| 日本語 |
Version 0.7.0 has been released, imporving the software interface and adding software log display. The local caption engine is under development and is expected to be released in the form of Python code...
+Version 1.0.0 has been released, with the addition of the SOSV local caption model. More caption models are being developed...
 ## 📥 Download -[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases) +Software Download: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases) + +Vosk Model Download: [Vosk Models](https://alphacephei.com/vosk/models) + +SOSV Model Download: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) ## 📚 Documentation @@ -29,16 +33,15 @@ [Caption Engine Documentation](./docs/engine-manual/en.md) -[Project API Documentation (Chinese)](./docs/api-docs/) - [Changelog](./docs/CHANGELOG.md) ## ✨ Features - Generate captions from audio output or microphone input +- Supports translation by calling local Ollama models or cloud-based Google Translate API - Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support - Rich caption style settings (font, font size, font weight, font color, background color, etc.) -- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, self-developed model) +- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, local SOSV model, or you can develop your own model) - Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines") - Subtitle record display and export (supports exporting `.srt` and `.json` formats) @@ -51,29 +54,63 @@ The software has been adapted for Windows, macOS, and Linux platforms. The teste | Windows 11 24H2 | x64 | ✅ | ✅ | | macOS Sequoia 15.5 | arm64 | ✅ [Additional config required](./docs/user-manual/en.md#capturing-system-audio-output-on-macos) | ✅ | | Ubuntu 24.04.2 | x64 | ✅ | ✅ | -| Kali Linux 2022.3 | x64 | ✅ | ✅ | -| Kylin Server V10 SP3 | x64 | ✅ | ✅ | Additional configuration is required to capture system audio output on macOS and Linux platforms. See [Auto Caption User Manual](./docs/user-manual/en.md) for details. -> The international version of Alibaba Cloud services does not provide the Gummy model, so non-Chinese users currently cannot use the Gummy caption engine. -To use the default Gummy caption engine (which uses cloud-based models for speech recognition and translation), you first need to obtain an API KEY from the Alibaba Cloud Bailian platform. Then add the API KEY to the software settings or configure it in environment variables (only Windows platform supports reading API KEY from environment variables) to properly use this model. Related tutorials: +After downloading the software, you need to select the corresponding model according to your needs and then configure the model. -- [Obtaining API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key) -- [Configuring API Key through Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables) +| | Recognition Quality | Deployment Type | Supported Languages | Translation | Notes | +| ------------------------------------------------------------ | ------------------- | ------------------ | ------------------- | ------------- | ---------------------------------------------------------- | +| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Excellent 😊 | Alibaba Cloud | 10 languages | Built-in | Paid, 0.54 CNY/hour | +| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Local / CPU | Over 30 languages | Requires setup | Supports many languages | +| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Fair 😐 | Local / CPU | 5 languages | Requires setup | Only one model available | +| DIY Development | 🤔 | Custom | Custom | Custom | Develop your own using Python according to [documentation](./docs/engine-manual/zh.md) | -> The recognition performance of Vosk models is suboptimal, please use with caution. +If you choose to use the Vosk or SOSV model, you also need to configure your own translation model. -To use the Vosk local caption engine, first download your required model from [Vosk Models](https://alphacephei.com/vosk/models) page, extract the model locally, and add the model folder path to the software settings. Currently, the Vosk caption engine does not support translated captions. +### Configuring Translation Models - + -**If you find the above caption engines don't meet your needs and you know Python, you may consider developing your own caption engine. For detailed instructions, please refer to the [Caption Engine Documentation](./docs/engine-manual/en.md).** +> Note: Translation is not real-time. The translation model is only called after each sentence recognition is completed. + +#### Ollama Local Model + +> Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`. + +Before using this model, you need to ensure that [Ollama](https://ollama.com/) software is installed on your machine and the required large language model has been downloaded. Simply add the name of the large model you want to call to the `Ollama` field in the settings. + +#### Google Translate API + +> Note: Google Translate API is not available in some regions. + +No configuration required, just connect to the internet to use. + +### Using Gummy Model + +> The international version of Alibaba Cloud services does not seem to provide the Gummy model, so non-Chinese users may not be able to use the Gummy caption engine at present. + +To use the default Gummy caption engine (using cloud models for speech recognition and translation), you first need to obtain an API KEY from Alibaba Cloud Bailian platform, then add the API KEY to the software settings or configure it in the environment variables (only Windows platform supports reading API KEY from environment variables), so that the model can be used normally. Related tutorials: + +- [Get API KEY](https://help.aliyun.com/zh/model-studio/get-api-key) +- [Configure API Key through Environment Variables](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables) + +### Using Vosk Model + +> The recognition effect of the Vosk model is poor, please use it with caution. + +To use the Vosk local caption engine, first download the model you need from the [Vosk Models](https://alphacephei.com/vosk/models) page, unzip the model locally, and add the path of the model folder to the software settings. + + + +### Using SOSV Model + +The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model ## ⚙️ Built-in Subtitle Engines -Currently, the software comes with 2 subtitle engines, with new engines under development. Their detailed information is as follows. +Currently, the software comes with 3 caption engines, with new engines under development. Their detailed information is as follows. ### Gummy Subtitle Engine (Cloud) @@ -92,7 +129,7 @@ Developed based on Tongyi Lab's [Gummy Speech Translation Model](https://help.al **Network Traffic Consumption:** -The subtitle engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately: +The caption engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately: $$ 48000\ \text{samples/second} \times 2\ \text{bytes/sample} \times 1\ \text{channel} = 93.75\ \text{KB/s} @@ -102,7 +139,11 @@ The engine only uploads data when receiving audio streams, so the actual upload ### Vosk Subtitle Engine (Local) -Developed based on [vosk-api](https://github.com/alphacep/vosk-api). Currently only supports generating original text from audio, does not support translation content. +Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation. + +### SOSV Subtitle Engine (Local) + +[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) is an integrated package, mainly based on [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html), with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese. ### Planned New Subtitle Engines @@ -112,6 +153,7 @@ The following are candidate models that will be selected based on model performa - [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) - [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) - [FunASR](https://github.com/modelscope/FunASR) +- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit) ## 🚀 Project Setup @@ -128,6 +170,7 @@ npm install First enter the `engine` folder and execute the following commands to create a virtual environment (requires Python 3.10 or higher, with Python 3.12 recommended): ```bash +cd ./engine # in ./engine folder python -m venv .venv # or @@ -149,12 +192,6 @@ Then install dependencies (this step might result in errors on macOS and Linux, pip install -r requirements.txt ``` -If you encounter errors when installing the `samplerate` module on Linux systems, you can try installing it separately with this command: - -```bash -pip install samplerate --only-binary=:all: -``` - Then use `pyinstaller` to build the project: ```bash diff --git a/README_ja.md b/README_ja.md index ef68759..9996fe6 100644 --- a/README_ja.md +++ b/README_ja.md @@ -3,7 +3,7 @@Auto Caption はクロスプラットフォームのリアルタイム字幕表示ソフトウェアです。
-
+
@@ -14,14 +14,18 @@
| English
| 日本語 |
バージョン 0.7.0 がリリースされ、ソフトウェアインターフェースが最適化され、ログ記録表示機能が追加されました。ローカルの字幕エンジンは現在開発中であり、Pythonコードの形式でリリースされる予定です...
+v1.0.0 バージョンがリリースされ、SOSV ローカル字幕モデルが追加されました。より多くの字幕モデルが開発中です...
 ## 📥 ダウンロード -[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases) +ソフトウェアダウンロード: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases) + +Vosk モデルダウンロード: [Vosk Models](https://alphacephei.com/vosk/models) + +SOSV モデルダウンロード: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) ## 📚 関連ドキュメント @@ -29,16 +33,15 @@ [字幕エンジン説明ドキュメント](./docs/engine-manual/ja.md) -[プロジェクト API ドキュメント(中国語)](./docs/api-docs/) - [更新履歴](./docs/CHANGELOG.md) ## ✨ 特徴 - 音声出力またはマイク入力からの字幕生成 +- ローカルのOllamaモデルまたはクラウドベースのGoogle翻訳APIを呼び出して翻訳をサポート - クロスプラットフォーム(Windows、macOS、Linux)、多言語インターフェース(中国語、英語、日本語)対応 - 豊富な字幕スタイル設定(フォント、フォントサイズ、フォント太さ、フォント色、背景色など) -- 柔軟な字幕エンジン選択(阿里雲 Gummy クラウドモデル、ローカル Vosk モデル、独自開発モデル) +- 柔軟な字幕エンジン選択(阿里云Gummyクラウドモデル、ローカルVoskモデル、ローカルSOSVモデル、または独自にモデルを開発可能) - 多言語認識と翻訳(下記「⚙️ 字幕エンジン説明」参照) - 字幕記録表示とエクスポート(`.srt` および `.json` 形式のエクスポートに対応) @@ -56,24 +59,59 @@ macOS および Linux プラットフォームでシステムオーディオ出力を取得するには追加設定が必要です。詳細は[Auto Captionユーザーマニュアル](./docs/user-manual/ja.md)をご覧ください。 -> 阿里雲の国際版サービスでは Gummy モデルを提供していないため、現在中国以外のユーザーは Gummy 字幕エンジンを使用できません。 +ソフトウェアをダウンロードした後、自分のニーズに応じて対応するモデルを選択し、モデルを設定する必要があります。 -デフォルトの Gummy 字幕エンジン(クラウドベースのモデルを使用した音声認識と翻訳)を使用するには、まず阿里雲百煉プラットフォームから API KEY を取得する必要があります。その後、API KEY をソフトウェア設定に追加するか、環境変数に設定します(Windows プラットフォームのみ環境変数からの API KEY 読み取りをサポート)。関連チュートリアル: +| | 認識効果 | デプロイタイプ | 対応言語 | 翻訳 | 備考 | +| ------------------------------------------------------------ | -------- | ----------------- | ---------- | ---------- | ---------------------------------------------------------- | +| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 良好😊 | クラウド / 阿里云 | 10種 | 内蔵翻訳 | 有料、0.54CNY / 時間 | +| [Vosk](https://alphacephei.com/vosk) | 不良😞 | ローカル / CPU | 30種以上 | 追加設定必要 | 対応言語が非常に多い | +| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | ローカル / CPU | 5種 | 追加設定必要 | モデルは一つのみ | +| 自前開発 | 🤔 | カスタム | カスタム | カスタム | [ドキュメント](./docs/engine-manual/zh.md)に従ってPythonで自前開発 | -- [API KEY の取得(中国語)](https://help.aliyun.com/zh/model-studio/get-api-key) -- [環境変数を通じて API Key を設定(中国語)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables) +VoskまたはSOSVモデルを使用する場合、独自の翻訳モデルも設定する必要があります。 -> Vosk モデルの認識精度は低いため、注意してご使用ください。 +### 翻訳モデルの設定 -Vosk ローカル字幕エンジンを使用するには、まず [Vosk Models](https://alphacephei.com/vosk/models) ページから必要なモデルをダウンロードし、ローカルに解凍した後、モデルフォルダのパスをソフトウェア設定に追加してください。現在、Vosk 字幕エンジンは字幕の翻訳をサポートしていません。 + - +> 注意:翻訳はリアルタイムではありません。翻訳モデルは各文の認識が完了した後にのみ呼び出されます。 -**上記の字幕エンジンがご要望を満たさず、かつ Python の知識をお持ちの場合、独自の字幕エンジンを開発することも可能です。詳細な説明は[字幕エンジン説明書](./docs/engine-manual/ja.md)をご参照ください。** +#### Ollama ローカルモデル + +> 注意:パラメータ数が多すぎるモデルを使用すると、リソース消費と翻訳遅延が大きくなります。1B未満のパラメータ数のモデルを使用することを推奨します。例:`qwen2.5:0.5b`、`qwen3:0.6b`。 + +このモデルを使用する前に、ローカルマシンに[Ollama](https://ollama.com/)ソフトウェアがインストールされ、必要な大規模言語モデルがダウンロードされていることを確認してください。必要な大規模モデル名を設定の`Ollama`フィールドに追加するだけでOKです。 + +#### Google翻訳API + +> 注意:Google翻訳APIは一部の地域では使用できません。 + +設定不要で、ネット接続があれば使用できます。 + +### Gummyモデルの使用 + +> 阿里云の国際版サービスにはGummyモデルが提供されていないため、現在中国以外のユーザーはGummy字幕エンジンを使用できない可能性があります。 + +デフォルトのGummy字幕エンジン(クラウドモデルを使用した音声認識と翻訳)を使用するには、まず阿里云百煉プラットフォームのAPI KEYを取得し、API KEYをソフトウェア設定に追加するか環境変数に設定する必要があります(Windowsプラットフォームのみ環境変数からのAPI KEY読み取りをサポート)。関連チュートリアル: + +- [API KEYの取得](https://help.aliyun.com/zh/model-studio/get-api-key) +- [環境変数へのAPI Keyの設定](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables) + +### Voskモデルの使用 + +> Voskモデルの認識効果は不良のため、注意して使用してください。 + +Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードし、ローカルにモデルを解凍し、モデルフォルダのパスをソフトウェア設定に追加してください。 + + + +### SOSVモデルの使用 + +SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りです:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model ## ⚙️ 字幕エンジン説明 -現在、ソフトウェアには2つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。 +現在、ソフトウェアには3つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。 ### Gummy 字幕エンジン(クラウド) @@ -102,7 +140,11 @@ $$ ### Vosk字幕エンジン(ローカル) -[vosk-api](https://github.com/alphacep/vosk-api) をベースに開発されています。現在は音声に対応する原文の生成のみをサポートしており、翻訳コンテンツはサポートしていません。 +[vosk-api](https://github.com/alphacep/vosk-api)をベースに開発。この字幕エンジンの利点は選択可能な言語モデルが非常に多く(30言語以上)、欠点は認識効果が比較的悪く、生成内容に句読点がないことです。 + +### SOSV 字幕エンジン(ローカル) + +[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)は統合パッケージで、主に[Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)をベースにし、エンドポイント検出モデルと句読点復元モデルを追加しています。このモデルが認識をサポートする言語は:英語、中国語、日本語、韓国語、広東語です。 ### 新規計画字幕エンジン @@ -112,6 +154,7 @@ $$ - [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) - [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) - [FunASR](https://github.com/modelscope/FunASR) +- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit) ## 🚀 プロジェクト実行 @@ -128,6 +171,7 @@ npm install まず `engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成します(Python 3.10 以上が必要で、Python 3.12 が推奨されます): ```bash +cd ./engine # ./engine フォルダ内 python -m venv .venv # または @@ -149,12 +193,6 @@ source .venv/bin/activate pip install -r requirements.txt ``` -Linux システムで `samplerate` モジュールのインストールに問題が発生した場合、以下のコマンドで個別にインストールを試すことができます: - -```bash -pip install samplerate --only-binary=:all: -``` - その後、`pyinstaller` を使用してプロジェクトをビルドします: ```bash diff --git a/assets/media/config_en.png b/assets/media/config_en.png new file mode 100644 index 0000000..37ba249 Binary files /dev/null and b/assets/media/config_en.png differ diff --git a/assets/media/config_ja.png b/assets/media/config_ja.png new file mode 100644 index 0000000..d8ea35c Binary files /dev/null and b/assets/media/config_ja.png differ diff --git a/assets/media/config_zh.png b/assets/media/config_zh.png new file mode 100644 index 0000000..2692085 Binary files /dev/null and b/assets/media/config_zh.png differ diff --git a/assets/media/engine_en.png b/assets/media/engine_en.png new file mode 100644 index 0000000..a16db66 Binary files /dev/null and b/assets/media/engine_en.png differ diff --git a/assets/media/engine_ja.png b/assets/media/engine_ja.png new file mode 100644 index 0000000..c43e4d4 Binary files /dev/null and b/assets/media/engine_ja.png differ diff --git a/assets/media/engine_zh.png b/assets/media/engine_zh.png new file mode 100644 index 0000000..b733477 Binary files /dev/null and b/assets/media/engine_zh.png differ diff --git a/assets/media/main_en.png b/assets/media/main_en.png index 662aa39..ab3e85c 100644 Binary files a/assets/media/main_en.png and b/assets/media/main_en.png differ diff --git a/assets/media/main_ja.png b/assets/media/main_ja.png index 9bb1fcf..accaa95 100644 Binary files a/assets/media/main_ja.png and b/assets/media/main_ja.png differ diff --git a/assets/media/main_zh.png b/assets/media/main_zh.png index 27d96bb..1f986f2 100644 Binary files a/assets/media/main_zh.png and b/assets/media/main_zh.png differ diff --git a/assets/media/vosk_en.png b/assets/media/vosk_en.png deleted file mode 100644 index ea991b9..0000000 Binary files a/assets/media/vosk_en.png and /dev/null differ diff --git a/assets/media/vosk_ja.png b/assets/media/vosk_ja.png deleted file mode 100644 index e09051c..0000000 Binary files a/assets/media/vosk_ja.png and /dev/null differ diff --git a/assets/media/vosk_zh.png b/assets/media/vosk_zh.png deleted file mode 100644 index 72e2111..0000000 Binary files a/assets/media/vosk_zh.png and /dev/null differ diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index 09f264c..c34075e 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -156,15 +156,20 @@ - 更清晰的日志输出 -## v0.8.0 +## v1.0.0 -2025-09-?? +2025-09-08 ### 新增功能 -- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭、在字幕引擎启动过程中也可选择关闭字幕引擎 -- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译、支持调用 Google 翻译 API 进行翻译 +- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭;在字幕引擎启动过程中可选择关闭字幕引擎 +- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译;支持调用 Google 翻译 API 进行翻译 +- 添加新的翻译模型:添加 SOSV 模型,支持识别英语、中文、日语、韩语、粤语 +- 添加录音功能:可以将字幕引擎识别的音频保存为 .wav 文件 +- 添加多行字幕功能,用户可以设置字幕窗口显示的字幕的行数 ### 优化体验 +- 优化部分提示信息显示位置 +- 替换重采样模型,提高音频重采样质量 - 带有额外信息的标签颜色改为与主题色一致 \ No newline at end of file diff --git a/docs/TODO.md b/docs/TODO.md index df0bb9e..978a984 100644 --- a/docs/TODO.md +++ b/docs/TODO.md @@ -21,6 +21,8 @@ - [x] 复制字幕记录可选择只复制最近的字幕记录 *2025/08/18* - [x] 添加颜色主题设置 *2025/08/18* - [x] 前端页面添加日志内容展示 *2025/08/19* +- [x] 添加 Ollama 模型用于本地字幕引擎的翻译 *2025/09/04* +- [x] 验证 / 添加基于 sherpa-onnx 的字幕引擎 *2025/09/06* ## 待完成 @@ -29,7 +31,6 @@ ## 后续计划 -- [ ] 添加 Ollama 模型用于本地字幕引擎的翻译 - [ ] 验证 / 添加基于 FunASR 的字幕引擎 - [ ] 减小软件不必要的体积 diff --git a/docs/engine-manual/en.md b/docs/engine-manual/en.md index 226c6d0..f03ca46 100644 --- a/docs/engine-manual/en.md +++ b/docs/engine-manual/en.md @@ -1,175 +1,207 @@ -# Caption Engine Documentation +# Caption Engine Documentation -Corresponding Version: v0.6.0 +Corresponding version: v1.0.0 - + -## Introduction to the Caption Engine +## Introduction to the Caption Engine -The so-called caption engine is essentially a subprogram that continuously captures real-time streaming data from the system's audio input (microphone) or output (speakers) and invokes an audio-to-text model to generate corresponding captions for the audio. The generated captions are converted into JSON-formatted string data and passed to the main program via standard output (ensuring the string can be correctly interpreted as a JSON object by the main program). The main program reads and interprets the caption data, processes it, and displays it in the window. +The so-called caption engine is actually a subprocess that captures streaming data from system audio input (microphone) or output (speaker) in real-time, and invokes an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into JSON-formatted string data and transmitted to the main program through standard output (ensuring that the string received by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and displays it in the window. -**The communication standard between the caption engine process and the Electron main process is: [caption engine api-doc](../api-docs/caption-engine.md).** +**Communication between the caption engine process and Electron main process follows the standard: [caption engine api-doc](../api-docs/caption-engine.md).** -## Workflow +## Execution Flow -The communication flow between the main process and the caption engine: +Process of communication between main process and caption engine: -### Starting the Engine +### Starting the Engine -- **Main Process**: Uses `child_process.spawn()` to launch the caption engine process. -- **Caption Engine Process**: Creates a TCP Socket server thread. After creation, it outputs a JSON object string via standard output, containing a `command` field with the value `connect`. -- **Main Process**: Monitors the standard output of the caption engine process, attempts to split it line by line, parses it into a JSON object, and checks if the `command` field value is `connect`. If so, it connects to the TCP Socket server. +- Electron main process: Use `child_process.spawn()` to start the caption engine process +- Caption engine process: Create a TCP Socket server thread, after creation output a JSON object converted to string via standard output, containing the `command` field with value `connect` +- Main process: Listen to the caption engine process's standard output, try to split the standard output by lines, parse it into a JSON object, and check if the object's `command` field value is `connect`. If so, connect to the TCP Socket server -### Caption Recognition +### Caption Recognition -- **Caption Engine Process**: The main thread monitors system audio output, sends audio data chunks to the caption engine for parsing, and outputs the parsed caption data object strings via standard output. -- **Main Process**: Continues to monitor the standard output of the caption engine and performs different operations based on the `command` field of the parsed object. +- Caption engine process: Create a new thread to monitor system audio output, put acquired audio data chunks into a shared queue (`shared_data.chunk_queue`). The caption engine continuously reads audio data chunks from the shared queue and parses them. The caption engine may also create a new thread to perform translation operations. Finally, the caption engine sends parsed caption data object strings through standard output +- Electron main process: Continuously listen to the caption engine's standard output and take different actions based on the parsed object's `command` field -### Closing the Engine +### Stopping the Engine -- **Main Process**: When the user closes the caption engine via the frontend, the main process sends a JSON object string with the `command` field set to `stop` to the caption engine process via Socket communication. -- **Caption Engine Process**: Receives the object string, parses it, and if the `command` field is `stop`, sets the global variable `thread_data.status` to `stop`. -- **Caption Engine Process**: The main thread's loop for monitoring system audio output ends when `thread_data.status` is not `running`, releases resources, and terminates. -- **Main Process**: Detects the termination of the caption engine process, performs corresponding cleanup, and provides feedback to the frontend. +- Electron main process: When the user operates to close the caption engine in the frontend, the main process sends an object string with `command` field set to `stop` to the caption engine process through Socket communication +- Caption engine process: Receive the caption data object string sent by the main engine process, parse the string into an object. If the object's `command` field is `stop`, set the value of global variable `shared_data.status` to `stop` +- Caption engine process: Main thread continuously monitors system audio output, when `thread_data.status` value is not `running`, end the loop, release resources, and terminate execution +- Electron main process: If the caption engine process termination is detected, perform corresponding processing and provide feedback to the frontend -## Implemented Features +## Implemented Features -The following features are already implemented and can be reused directly. +The following features have been implemented and can be directly reused. -### Standard Output +### Standard Output -Supports printing general information, commands, and error messages. +Can output regular information, commands, and error messages. -Example: +Examples: -```python -from utils import stdout, stdout_cmd, stdout_obj, stderr -stdout("Hello") # {"command": "print", "content": "Hello"}\n -stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n -stdout_obj({"command": "print", "content": "Hello"}) -stderr("Error Info") -``` +```python +from utils import stdout, stdout_cmd, stdout_obj, stderr +# {"command": "print", "content": "Hello"}\n +stdout("Hello") +# {"command": "connect", "content": "8080"}\n +stdout_cmd("connect", "8080") +# {"command": "print", "content": "print"}\n +stdout_obj({"command": "print", "content": "print"}) +# sys.stderr.write("Error Info" + "\n") +stderr("Error Info") +``` -### Creating a Socket Service +### Creating Socket Service -This Socket service listens on a specified port, parses content sent by the Electron main program, and may modify the value of `thread_data.status`. +This Socket service listens on a specified port, parses content sent by the Electron main program, and may change the value of `shared_data.status`. -Example: +Example: -```python -from utils import start_server -from utils import thread_data -port = 8080 -start_server(port) -while thread_data == 'running': - # do something - pass -``` +```python +from utils import start_server +from utils import shared_data +port = 8080 +start_server(port) +while thread_data == 'running': + # do something + pass +``` -### Audio Capture +### Audio Acquisition -The `AudioStream` class captures audio data and is cross-platform, supporting Windows, Linux, and macOS. Its initialization includes two parameters: +The `AudioStream` class is used to acquire audio data, with cross-platform implementation supporting Windows, Linux, and macOS. The class initialization includes two parameters: -- `audio_type`: The type of audio to capture. `0` for system output audio (speakers), `1` for system input audio (microphone). -- `chunk_rate`: The frequency of audio data capture, i.e., the number of audio chunks captured per second. +- `audio_type`: Audio acquisition type, 0 for system output audio (speaker), 1 for system input audio (microphone) +- `chunk_rate`: Audio data acquisition frequency, number of audio chunks acquired per second, default is 10 -The class includes three methods: +The class contains four methods: -- `open_stream()`: Starts audio capture. -- `read_chunk() -> bytes`: Reads an audio chunk. -- `close_stream()`: Stops audio capture. +- `open_stream()`: Start audio acquisition +- `read_chunk() -> bytes`: Read an audio chunk +- `close_stream()`: Close audio acquisition +- `close_stream_signal()`: Thread-safe closing of system audio input stream -Example: +Example: -```python -from sysaudio import AudioStream -audio_type = 0 -chunk_rate = 20 -stream = AudioStream(audio_type, chunk_rate) -stream.open_stream() -while True: - data = stream.read_chunk() - # do something with data - pass -stream.close_stream() -``` +```python +from sysaudio import AudioStream +audio_type = 0 +chunk_rate = 20 +stream = AudioStream(audio_type, chunk_rate) +stream.open_stream() +while True: + data = stream.read_chunk() + # do something with data + pass +stream.close_stream() +``` -### Audio Processing +### Audio Processing -The captured audio stream may require preprocessing before conversion to text. Typically, multi-channel audio needs to be converted to mono, and resampling may be necessary. This project provides three audio processing functions: +Before converting audio streams to text, preprocessing may be required. Usually, multi-channel audio needs to be converted to single-channel audio, and resampling may also be needed. This project provides two audio processing functions: -- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Converts a multi-channel audio chunk to mono. -- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Converts a multi-channel audio chunk to mono and resamples it. -- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Resamples a mono audio chunk. +- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Convert multi-channel audio chunks to single-channel audio chunks +- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`: Convert current multi-channel audio data chunks to single-channel audio data chunks, then perform resampling -## Features to Be Implemented in the Caption Engine +Example: -### Audio-to-Text Conversion +```python +from sysaudio import AudioStream +from utils import merge_chunk_channels +stream = AudioStream(1) +while True: + raw_chunk = stream.read_chunk() + chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000) + # do something with chunk +``` -After obtaining a suitable audio stream, it needs to be converted to text. Typically, various models (cloud-based or local) are used for this purpose. Choose the appropriate model based on requirements. +## Features to be Implemented by the Caption Engine -This part is recommended to be encapsulated as a class with three methods: +### Audio to Text Conversion -- `start(self)`: Starts the model. -- `send_audio_frame(self, data: bytes)`: Processes the current audio chunk data. **The generated caption data is sent to the Electron main process via standard output.** -- `stop(self)`: Stops the model. +After obtaining suitable audio streams, the audio stream needs to be converted to text. Generally, various models (cloud or local) are used to implement audio-to-text conversion. Appropriate models should be selected according to requirements. -Complete caption engine examples: +It is recommended to encapsulate this as a class, implementing four methods: -- [gummy.py](../../engine/audio2text/gummy.py) -- [vosk.py](../../engine/audio2text/vosk.py) +- `start(self)`: Start the model +- `send_audio_frame(self, data: bytes)`: Process current audio chunk data, **generated caption data is sent to Electron main process through standard output** +- `translate(self)`: Continuously retrieve data chunks from `shared_data.chunk_queue` and call `send_audio_frame` method to process data chunks +- `stop(self)`: Stop the model -### Caption Translation +Complete caption engine examples: -Some speech-to-text models do not provide translation. If needed, a translation module must be added. +- [gummy.py](../../engine/audio2text/gummy.py) +- [vosk.py](../../engine/audio2text/vosk.py) +- [sosv.py](../../engine/audio2text/sosv.py) -### Sending Caption Data +### Caption Translation -After obtaining the text for the current audio stream, it must be sent to the main program. The caption engine process passes caption data to the Electron main process via standard output. +Some speech-to-text models do not provide translation. If needed, an additional translation module needs to be added, or built-in translation modules can be used. -The content must be a JSON string, with the JSON object including the following parameters: +Example: -```typescript -export interface CaptionItem { - command: "caption", - index: number, // Caption sequence number - time_s: string, // Start time of the current caption - time_t: string, // End time of the current caption - text: string, // Caption content - translation: string // Caption translation -} -``` +```python +from utils import google_translate, ollama_translate +text = "This is a translation test." +google_translate("", "en", text, "time_s") +ollama_translate("qwen3:0.6b", "en", text, "time_s") +``` -**Note: Ensure the buffer is flushed after each JSON output to guarantee the Electron main process receives a string that can be parsed as a JSON object.** +### Caption Data Transmission -It is recommended to use the project's `stdout_obj` function for sending. +After obtaining the text from the current audio stream, the text needs to be sent to the main program. The caption engine process transmits caption data to the Electron main process through standard output. -### Command-Line Parameter Specification +The transmitted content must be a JSON string, where the JSON object needs to contain the following parameters: -Custom caption engine settings are provided via command-line arguments. The current project uses the following parameters: +```typescript +export interface CaptionItem { + command: "caption", + index: number, // Caption sequence number + time_s: string, // Current caption start time + time_t: string, // Current caption end time + text: string, // Caption content + translation: string // Caption translation +} +``` -```python -import argparse -if __name__ == "__main__": - parser = argparse.ArgumentParser(description='Convert system audio stream to text') - # Common parameters - parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk') - parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input') - parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second') - parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server') - # Gummy-specific parameters - parser.add_argument('-s', '--source_language', default='en', help='Source language code') - parser.add_argument('-t', '--target_language', default='zh', help='Target language code') - parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model') - # Vosk-specific parameters - parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.') -``` +**Note that the buffer must be flushed after each caption JSON data output to ensure that the Electron main process receives strings that can be interpreted as JSON objects each time.** It is recommended to use the project's existing `stdout_obj` function for transmission. -For example, to use the Gummy model with Japanese as the source language, Chinese as the target language, and system audio output captions with 0.1s audio chunks, the command-line arguments would be: +### Command Line Parameter Specification -```bash -python main.py -e gummy -s ja -t zh -a 0 -c 10 -k{{ $t('status.about.desc') }}