release v0.4.0

- 更新 README 和用户手册，增加 Vosk 引擎的使用说明 - 修改构建配置，支持 Vosk 引擎的打包 - 更新版本号至 0.4.0，准备发布新功能
2026-02-04 04:14:42 +08:00 · 2025-07-11 01:33:04 +08:00
parent d354a6fefa
commit 0b8b823b2e
33 changed files with 283 additions and 93 deletions
--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@@ -72,3 +72,18 @@
 ### 修复bug

 - 修复使用系统主题时暗色系统载入为亮色的问题
+
+## v0.4.0
+
+2025-07-11
+
+添加了 Vosk 本地字幕引擎，更新了项目文档，继续优化使用体验。
+
+### 新增功能
+
+- 添加了基于 Vosk 的字幕引擎， **当前 Vosk 字幕引擎暂不支持翻译**
+- 更新用户界面，增加 Vosk 引擎选项和模型路径设置
+
+### 优化体验
+
+- 字幕窗口右上角图标的颜色改为和字幕原文字体颜色一致
--- a/docs/TODO.md
+++ b/docs/TODO.md
@@ -14,7 +14,6 @@
 ## 待完成

 - [ ] 添加 Ollama 模型用于本地字幕引擎的翻译
- [ ] 更新 README 和用户手册（字幕引擎构建、Vosk 模型获取和使用）
 - [ ] 添加本地字幕引擎
  - [ ] 验证 / 添加基于 FunASR 的字幕引擎
 - [ ] 减小软件不必要的体积
--- a/docs/engine-manual/en.md
+++ b/docs/engine-manual/en.md
@@ -1,6 +1,6 @@
 # Caption Engine Documentation

-Corresponding Version: v0.3.0
+Corresponding Version: v0.4.0

 ![](../../assets/media/structure_en.png)

@@ -80,6 +80,10 @@ def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
            break
 ```

+### Caption Translation
+
+Some speech-to-text models don't provide translation functionality, requiring an additional translation module. This part can use either cloud-based translation APIs or local translation models.
+
 ### Data Transmission

 After obtaining the text of the current audio stream, it needs to be transmitted to the main program. The caption engine process passes the caption data to the Electron main process through standard output.
@@ -149,4 +153,4 @@ Data receiver code is as follows:

 ## Reference Code

-The `main-gummy.py` file under the `caption-engine` folder in this project serves as the entry point for the default caption engine. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing data from the caption engine. You can read and understand the implementation details and the complete execution process of the caption engine as needed.
+The `main-gummy.py` file under the `caption-engine` folder in this project serves as the entry point for the default caption engine. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing data from the caption engine. You can read and understand the implementation details and the complete execution process of the caption engine as needed.
--- a/docs/engine-manual/ja.md
+++ b/docs/engine-manual/ja.md
@@ -1,6 +1,6 @@
 # 字幕エンジンの説明文書

-対応バージョン：v0.3.0
+対応バージョン：v0.4.0

 この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。

@@ -82,6 +82,10 @@ def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
            break
 ```

+### 字幕翻訳
+
+音声認識モデルによっては翻訳機能を提供していないため、別途翻訳モジュールを追加する必要があります。この部分にはクラウドベースの翻訳APIを使用することも、ローカルの翻訳モデルを使用することも可能です。
+
 ### データの伝送

 現在の音声ストリームのテキストを得たら、それをメインプログラムに渡す必要があります。字幕エンジンプロセスは標準出力を通じて電子メール主プロセスに字幕データを渡します。
@@ -121,4 +125,4 @@ sys.stdout.reconfigure(line_buffering=True)
 ...
 ```

-データ受信側のコードは
+データ受信側のコードは
--- a/docs/engine-manual/zh.md
+++ b/docs/engine-manual/zh.md
@@ -1,6 +1,6 @@
 # 字幕引擎说明文档

-对应版本：v0.3.0
+对应版本：v0.4.0

 ![](../../assets/media/structure_zh.png)

@@ -80,6 +80,10 @@ def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
            break
 ```

+### 字幕翻译
+
+有的语音转文字模型并不提供翻译，需要再添加一个翻译模块。这部分可以使用云端翻译 API 也可以使用本地翻译模型。
+
 ### 数据传递

 在获取到当前音频流的文字后，需要将文字传递给主程序。字幕引擎进程通过标准输出将字幕数据传递给 electron 主进程。
--- a/docs/img/02_en.png
+++ b/docs/img/02_en.png
--- a/docs/img/02_ja.png
+++ b/docs/img/02_ja.png
--- a/docs/img/02_zh.png
+++ b/docs/img/02_zh.png
--- a/docs/user-manual/en.md
+++ b/docs/user-manual/en.md
@@ -1,6 +1,6 @@
 # Auto Caption User Manual

-Corresponding Version: v0.3.0
+Corresponding Version: v0.4.0

 ## Software Introduction

@@ -14,26 +14,30 @@ On Linux platforms, it can only generate captions for audio input (microphone),

 ### Software Limitations

-To use the default caption service, you need to obtain an API KEY from Alibaba Cloud.
+To use the Gummy caption engine, you need to obtain an API KEY from Alibaba Cloud.

 Additional configuration is required to capture audio output on macOS platform.

 The software is built using Electron, so the software size is inevitably large.

-## Software Usage
+## Preparation for Using Gummy Engine

-### Preparing the Alibaba Cloud Model Studio API KEY
+To use the default caption engine provided by the software (Alibaba Cloud Gummy), you need to obtain an API KEY from the Alibaba Cloud Bailian platform. Then add the API KEY to the software settings or configure it in environment variables (only Windows platform supports reading API KEY from environment variables).

-To use the default caption engine (Alibaba Cloud Gummy), you need to obtain an API KEY from the Alibaba Cloud Model Studio and configure it in your local environment variables.
+**The international version of Alibaba Cloud services does not provide the Gummy model, so non-Chinese users currently cannot use the default caption engine.**

-**The international version of Alibaba Cloud does not provide the Gummy model, so non-Chinese users currently cannot use the default caption engine. I am trying to develop a new local caption engine to ensure that all users have access to a default caption engine.**
+Alibaba Cloud provides detailed tutorials for this part, which can be referenced:

-Alibaba Cloud provides detailed tutorials for this:
+- [Obtaining API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key)
+- [Configuring API Key through Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)

- [Obtain API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configure API Key in Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
+## Preparation for Using Vosk Engine

-### Capturing System Audio Output on macOS
+To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings. Currently, the Vosk caption engine does not support translated caption content.
+
+![](../../assets/media/vosk_en.png)
+
+## Capturing System Audio Output on macOS

 > Based on the [Setup Multi-Output Device](https://github.com/ExistentialAudio/BlackHole/wiki/Multi-Output-Device) tutorial

@@ -57,6 +61,8 @@ Once BlackHole is confirmed installed, in the `Audio MIDI Setup` page, click the

 Now the caption engine can capture system audio output and generate captions.

+## Software Usage
+
 ### Modifying Settings

 Caption settings can be divided into three categories: general settings, caption engine settings, and caption style settings. Note that changes to general settings take effect immediately. For the other two categories, after making changes, you need to click the "Apply" option in the upper right corner of the corresponding settings module for the changes to take effect. If you click "Cancel Changes," the current modifications will not be saved and will revert to the previous state.
@@ -77,9 +83,9 @@ In the caption control window, you can see the records of all collected captions

 ## Caption Engine

-The so-called caption engine is actually a subprocess that real-time captures system audio input (recording) or output (playback) streaming data and uses an audio-to-text model to generate captions for the corresponding audio. The generated captions are output as JSON data converted to strings and returned to the main program. The main program reads the caption data, processes it, and displays it in the window.
+The so-called caption engine is essentially a subprogram that captures real-time streaming data from system audio input (recording) or output (playback), and invokes speech-to-text models to generate corresponding captions. The generated captions are converted into JSON-formatted strings and passed to the main program through standard output. The main program reads the caption data, processes it, and displays it in the window.

-The software provides a default caption engine. If you need other caption engines, you can call them by enabling the custom engine option (other engines need to be developed specifically for this software). The engine path is the path to the custom caption engine on your computer, and the engine command is the runtime parameters for the custom caption engine, which need to be filled out according to the rules of the specific caption engine.
+The software provides two default caption engines. If you need other caption engines, you can invoke them by enabling the custom engine option (other engines need to be specifically developed for this software). The engine path refers to the location of the custom caption engine on your computer, while the engine command represents the runtime parameters of the custom caption engine, which should be configured according to the rules of that particular caption engine.

 ![](../img/02_en.png)

--- a/docs/user-manual/ja.md
+++ b/docs/user-manual/ja.md
@@ -1,6 +1,6 @@
 # Auto Caption ユーザーマニュアル

-対応バージョン：v0.3.0
+対応バージョン：v0.4.0

 この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。

@@ -16,26 +16,30 @@ Linux プラットフォームでは、オーディオ入力（マイク）か

 ### ソフトウェアの欠点

-デフォルトの字幕サービスを使用するには、アリババクラウドの API KEY を取得する必要があります。
+Gummy 字幕エンジンを使用するには、アリババクラウドの API KEY を取得する必要があります。

 macOS プラットフォームでオーディオ出力を取得するには追加の設定が必要です。

 ソフトウェアは Electron で構築されているため、そのサイズは避けられないほど大きいです。

-## ソフトウェアの使用方法
+## Gummyエンジン使用前の準備

-### 百炼プラットフォームの API KEY の準備
+ソフトウェアが提供するデフォルトの字幕エンジン（Alibaba Cloud Gummy）を使用するには、Alibaba Cloud百煉プラットフォームからAPI KEYを取得する必要があります。その後、API KEYをソフトウェア設定に追加するか、環境変数に設定します（Windowsプラットフォームのみ環境変数からのAPI KEY読み取りをサポート）。

-ソフトウェアが提供するデフォルトの字幕エンジン（アリババクラウド Gummy）を使用するには、アリババクラウド百炼プラットフォームから API KEY を取得し、ローカル環境変数に設定する必要があります。
+**Alibaba Cloudの国際版サービスではGummyモデルを提供していないため、現在中国以外のユーザーはデフォルトの字幕エンジンを使用できません。**

-**アリババクラウドの国際版には Gummy モデルが提供されていないため、中国以外のユーザーは現在、デフォルトの字幕エンジンを使用できません。すべてのユーザーが利用できるように、新しいローカルの字幕エンジンを開発中です。**
-
-アリババクラウドは詳細なチュートリアルを提供していますので、以下のリンクを参照してください：
+この部分についてAlibaba Cloudは詳細なチュートリアルを提供しており、以下を参照できます：

 - [API KEY の取得（中国語）](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数を通じて API Key を設定する（中国語）](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
+- [環境変数を通じて API Key を設定（中国語）](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)

-### macOS でのシステムオーディオ出力の取得方法
+## Voskエンジン使用前の準備
+
+Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。現在、Vosk字幕エンジンは字幕の翻訳をサポートしていません。
+
+![](../../assets/media/vosk_ja.png)
+
+## macOS でのシステムオーディオ出力の取得方法

 > [マルチ出力デバイスの設定](https://github.com/ExistentialAudio/BlackHole/wiki/Multi-Output-Device) チュートリアルに基づいて作成

@@ -60,6 +64,8 @@ BlackHoleのインストールが確認できたら、`オーディオ MIDI 設

 これで字幕エンジンがシステムオーディオ出力をキャプチャし、字幕を生成できるようになります。

+## ソフトウェアの使い方
+
 ### 設定の変更

 字幕の設定は3つのカテゴリーに分かれます：一般的な設定、字幕エンジンの設定、字幕スタイルの設定。注意すべき点として、一般的な設定の変更は即座に適用されます。しかし、他の2つの設定については、変更後に該当する設定モジュール右上の「適用」オプションをクリックすることで初めて変更が有効になります。「変更を取り消す」を選択すると、現在の変更は保存されず、前回の状態に戻ります。
@@ -80,9 +86,9 @@ BlackHoleのインストールが確認できたら、`オーディオ MIDI 設

 ## 字幕エンジン

-字幕エンジンとは、実際にはサブプログラムであり、システムの音声入力（録音）または出力（音声再生）のストリーミングデータをリアルタイムで取得し、音声からテキストに変換するモデルを利用して対応する音声の字幕を生成します。生成された字幕はIPC経由で文字列に変換されたJSONデータとして出力され、メインプログラムに返されます。メインプログラムは字幕データを読み取り、処理してウィンドウ上に表示します。
+字幕エンジンとは、システムのオーディオ入力（録音）または出力（再生音）のストリーミングデータをリアルタイムで取得し、音声テキスト変換モデルを呼び出して対応する字幕を生成するサブプログラムです。生成された字幕は JSON 形式の文字列に変換され、標準出力を通じてメインプログラムに渡されます。メインプログラムは字幕データを読み取り、処理した後、ウィンドウに表示します。

-ソフトウェアはデフォルトの字幕エンジンを提供しており、他の字幕エンジンが必要な場合は、カスタムエンジンオプションを開いて他の字幕エンジンを呼び出すことができます（他のエンジンはこのソフトウェアに対して開発する必要があります）。エンジンパスは、あなたのコンピュータ上のカスタム字幕エンジンのパスであり、エンジンコマンドはカスタム字幕エンジンの実行パラメータです。これらの部分は、その字幕エンジンの規則に従って記入する必要があります。
+ソフトウェアには2つのデフォルトの字幕エンジンが用意されています。他の字幕エンジンが必要な場合、カスタムエンジンオプションを有効にすることで呼び出すことができます（他のエンジンはこのソフトウェア向けに特別に開発する必要があります）。エンジンパスはコンピュータ上のカスタム字幕エンジンの場所を指し、エンジンコマンドはカスタム字幕エンジンの実行パラメータを表します。これらは該当する字幕エンジンの規則に従って設定する必要があります。

 ![](../img/02_ja.png)

--- a/docs/user-manual/zh.md
+++ b/docs/user-manual/zh.md
@@ -1,6 +1,6 @@
 # Auto Caption 用户手册

-对应版本：v0.3.0
+对应版本：v0.4.0

 ## 软件简介

@@ -14,21 +14,17 @@ Auto Caption 是一个跨平台的字幕显示软件，能够实时获取系统

 ### 软件缺点

-要使用默认字幕服务需要获取阿里云的 API KEY。
+要使用默认的 Gummy 字幕引擎需要获取阿里云的 API KEY。

 在 macOS 平台获取音频输出需要额外配置。

 软件使用 Electron 构建，因此软件体积不可避免的较大。

-## 软件使用
-
-### 准备阿里云百炼平台 API KEY
+## Gummy 引擎使用前准备

 要使用软件提供的默认字幕引擎（阿里云 Gummy），需要从阿里云百炼平台获取 API KEY，然后将 API KEY 添加到软件设置中或者配置到环境变量中（仅 Windows 平台支持读取环境变量中的 API KEY）。

-![](../../assets/media/api_zh.png)
-
-**国际版的阿里云服务并没有提供 Gummy 模型，因此目前非中国用户无法使用默认字幕引擎。我正在开发新的本地字幕引擎，以确保所有用户都有默认字幕引擎可以使用。**
+**国际版的阿里云服务并没有提供 Gummy 模型，因此目前非中国用户无法使用默认字幕引擎。**

 这部分阿里云提供了详细的教程，可参考：

@@ -36,7 +32,13 @@ Auto Caption 是一个跨平台的字幕显示软件，能够实时获取系统

 - [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)

-### macOS 获取系统音频输出
+## Vosk 引擎使用前准备
+
+如果要使用 Vosk 本地字幕引擎，首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地，并将对应的模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
+
+![](../../assets/media/vosk_zh.png)
+
+## macOS 获取系统音频输出

 > 基于 [Setup Multi-Output Device](https://github.com/ExistentialAudio/BlackHole/wiki/Multi-Output-Device) 教程编写

@@ -60,6 +62,8 @@ brew install blackhole-64ch

 现在字幕引擎就能捕获系统的音频输出并生成字幕了。

+## 软件使用
+
 ### 修改设置

 字幕设置可以分为三类：通用设置、字幕引擎设置、字幕样式设置。需要注意的是，修改通用设置是立即生效的。但是对于其他两类设置，修改后需要点击对应设置模块右上角的“应用”选项，更改才会真正生效。如果点击“取消更改”那么当前修改将不会被保存，而是回退到上次修改的状态。
@@ -80,9 +84,9 @@ brew install blackhole-64ch

 ## 字幕引擎

-所谓的字幕引擎实际上是一个子程序，它会实时获取系统音频输入（录音）或输出（播放声音）的流式数据，并调用音频转文字的模型生成对应音频的字幕。生成的字幕通过 IPC 输出为转换为字符串的 JSON 数据，并返回给主程序。主程序读取字幕数据，处理后显示在窗口上。
+所谓的字幕引擎实际上是一个子程序，它会实时获取系统音频输入（录音）或输出（播放声音）的流式数据，并调用音频转文字的模型生成对应音频的字幕。生成的字幕通过转换为字符串的 JSON 数据，并通过标准输出传递给主程序。主程序读取字幕数据，处理后显示在窗口上。

-软件提供了一个默认的字幕引擎，如果你需要其他的字幕引擎，可以通过打开自定义引擎选项来调用其他字幕引擎（其他引擎需要针对该软件进行开发）。其中引擎路径是自定义字幕引擎在你的电脑上的路径，引擎指令是自定义字幕引擎的运行参数，这部分需要按该字幕引擎的规则进行填写。
+软件提供了两个默认的字幕引擎，如果你需要其他的字幕引擎，可以通过打开自定义引擎选项来调用其他字幕引擎（其他引擎需要针对该软件进行开发）。其中引擎路径是自定义字幕引擎在你的电脑上的路径，引擎指令是自定义字幕引擎的运行参数，这部分需要按该字幕引擎的规则进行填写。

 ![](../img/02_zh.png)