docs(readme): 更新说明并添加终端使用指南

2026-02-04 04:06:09 +08:00 · 2025-11-02 20:53:56 +08:00
parent e6a65f8362
commit 383e582a2d
8 changed files with 422 additions and 5 deletions
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@
        | <a href="./README_en.md">English</a>
        | <a href="./README_ja.md">日本語</a> |
    </p>
-    <p><i>v1.0.0 版本已经发布，新增 SOSV 本地字幕模型。更多的字幕模型正在尝试开发中...</i></p>
+    <p><i>v1.0.0 版本已经发布，新增 SOSV 本地字幕模型。当前功能已经基本完整，暂无继续开发计划...</i></p>
 </div>

 ![](./assets/media/main_zh.png)
@@ -107,6 +107,29 @@ macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置，

 使用 SOSV 模型的方式和 Vosk 一样，下载地址如下：https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model

+## ⌨️ 在终端中使用
+
+软件采用模块化设计，可用分为软件主体和字幕引擎两部分，软件主体通过图形界面调用字幕引擎。核心的音频获取和音频识别功能都在字幕引擎中实现，而字幕引擎是可用脱离软件主体单独使用的。
+
+字幕引擎使用 Python 开发，通过 PyInstaller 打包为可执行文件。因此字幕引擎有两种使用方式：
+
+1. 使用项目字幕引擎部分的源代码，使用安装了对应库的 Python 环境进行运行
+2. 使用打包好的字幕引擎的可执行文件，通过终端运行
+
+运行参数和详细使用介绍请参考[用户手册](./docs/user-manual/zh.md#单独使用字幕引擎)。
+
+```bash
+python main.py \
+-e gummy \
+-k sk-******************************** \
+-a 0 \
+-d 1 \
+-s en \
+-t zh
+```
+
+![](./docs/img/07.png)
+
 ## ⚙️ 自带字幕引擎说明

 目前软件自带 3 个字幕引擎，正在规划新的引擎。它们的详细信息如下。
--- a/README_en.md
+++ b/README_en.md
@@ -14,7 +14,7 @@
        | <b>English</b>
        | <a href="./README_ja.md">日本語</a> |
    </p>
-    <p><i>Version 1.0.0 has been released, with the addition of the SOSV local caption model. More caption models are being developed...</i></p>
+    <p><i>Version 1.0.0 has been released, with the addition of the SOSV local caption model. The current features are basically complete, and there are no further development plans...</i></p>
 </div>

 ![](./assets/media/main_en.png)
@@ -108,6 +108,29 @@ To use the Vosk local caption engine, first download the model you need from the

 The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model

+## ⌨️ Using in Terminal
+
+The software adopts a modular design and can be divided into two parts: the main software body and caption engine. The main software calls caption engine through a graphical interface. Audio acquisition and speech recognition functions are implemented in the caption engine, which can be used independently without the main software.
+
+Caption engine is developed using Python and packaged as executable files via PyInstaller. Therefore, there are two ways to use caption engine:
+
+1. Use the source code of the project's caption engine part and run it with a Python environment that has the required libraries installed
+2. Run the packaged executable file of the caption engine through the terminal
+
+For runtime parameters and detailed usage instructions, please refer to the [User Manual](./docs/user-manual/en.md#using-caption-engine-standalone).
+
+```bash
+python main.py \
+-e gummy \
+-k sk-******************************** \
+-a 0 \
+-d 1 \
+-s en \
+-t zh
+```
+
+![](./docs/img/07.png)
+
 ## ⚙️ Built-in Subtitle Engines

 Currently, the software comes with 3 caption engines, with new engines under development. Their detailed information is as follows.
--- a/README_ja.md
+++ b/README_ja.md
@@ -14,7 +14,7 @@
        | <a href="./README_en.md">English</a>
        | <b>日本語</b> |
    </p>
-    <p><i>v1.0.0 バージョンがリリースされ、SOSV ローカル字幕モデルが追加されました。より多くの字幕モデルが開発中です...</i></p>
+    <p><i>v1.0.0 バージョンがリリースされ、SOSV ローカル字幕モデルが追加されました。現在の機能は基本的に完了しており、今後の開発計画はありません...</i></p>
 </div>

 ![](./assets/media/main_ja.png)
@@ -109,6 +109,29 @@ Voskローカル字幕エンジンを使用するには、まず[Vosk Models](ht

 SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りです：https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model

+## ⌨️ ターミナルでの使用
+
+ソフトウェアはモジュール化設計を採用しており、ソフトウェア本体と字幕エンジンの2つの部分に分けることができます。ソフトウェア本体はグラフィカルインターフェースを通じて字幕エンジンを呼び出します。コアとなる音声取得および音声認識機能はすべて字幕エンジンに実装されており、字幕エンジンはソフトウェア本体から独立して単独で使用できます。
+
+字幕エンジンはPythonを使用して開発され、PyInstallerによって実行可能ファイルとしてパッケージ化されます。したがって、字幕エンジンの使用方法は以下の2つがあります：
+
+1. プロジェクトの字幕エンジン部分のソースコードを使用し、必要なライブラリがインストールされたPython環境で実行する
+2. パッケージ化された字幕エンジンの実行可能ファイルをターミナルから実行する
+
+実行引数および詳細な使用方法については、[User Manual](./docs/user-manual/en.md#using-caption-engine-standalone)をご参照ください。
+
+```bash
+python main.py \
+-e gummy \
+-k sk-******************************** \
+-a 0 \
+-d 1 \
+-s en \
+-t zh
+```
+
+![](./docs/img/07.png)
+
 ## ⚙️ 字幕エンジン説明

 現在、ソフトウェアには3つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。
--- a/docs/img/06.png
+++ b/docs/img/06.png
--- a/docs/img/07.png
+++ b/docs/img/07.png
--- a/docs/user-manual/en.md
+++ b/docs/user-manual/en.md
@@ -130,3 +130,175 @@ The software provides two default caption engines. If you need other caption eng
 Note that when using a custom caption engine, all previous caption engine settings will be ineffective, and the configuration of the custom caption engine is entirely done through the engine command.

 If you are a developer and want to develop a custom caption engine, please refer to the [Caption Engine Explanation Document](../engine-manual/en.md).
+
+## Using Caption Engine Standalone
+
+### Runtime Parameter Description
+
+> The following content assumes users have some knowledge of running programs via terminal.
+
+The complete set of runtime parameters available for the caption engine is shown below:
+
+![](../img/06.png)
+
+However, when used standalone, some parameters may not need to be used or should not be modified.
+
+The following parameter descriptions only include necessary parameters.
+
+#### `-e , --caption_engine`
+
+The caption engine model to select, currently three options are available: `gummy, vosk, sosv`.
+
+The default value is `gummy`.
+
+This applies to all models.
+
+#### `-a, --audio_type`
+
+The audio type to recognize, where `0` represents system audio output and `1` represents microphone audio input.
+
+The default value is `0`.
+
+This applies to all models.
+
+#### `-d, --display_caption`
+
+Whether to display captions in the console, `0` means do not display, `1` means display.
+
+The default value is `0`, but it's recommended to choose `1` when using only the caption engine.
+
+This applies to all models.
+
+#### `-t, --target_language`
+
+> Note that Vosk and SOSV models have poor sentence segmentation, which can make translated content difficult to understand. It's not recommended to use translation with these two models.
+
+Target language for translation. All models support the following translation languages:
+
+- `none` No translation
+- `zh` Simplified Chinese
+- `en` English
+- `ja` Japanese
+- `ko` Korean
+
+Additionally, `vosk` and `sosv` models also support the following translations:
+
+- `de` German
+- `fr` French
+- `ru` Russian
+- `es` Spanish
+- `it` Italian
+
+The default value is `none`.
+
+This applies to all models.
+
+#### `-s, --source_language`
+
+Source language for recognition. Default value is `auto`, meaning no specific source language.
+
+Specifying the source language can improve recognition accuracy to some extent. You can specify the source language using the language codes above.
+
+This only applies to Gummy and SOSV models.
+
+The Gummy model can use all the languages mentioned above, plus Cantonese (`yue`).
+
+The SOSV model supports specifying the following languages: English, Chinese, Japanese, Korean, and Cantonese.
+
+#### `-k, --api_key`
+
+Specify the Alibaba Cloud API KEY required for the `Gummy` model.
+
+Default value is empty.
+
+This only applies to the Gummy model.
+
+#### `-tm, --translation_model`
+
+Specify the translation method for Vosk and SOSV models. Default is `ollama`.
+
+Supported values are:
+
+- `ollama` Use local Ollama model for translation. Users need to install Ollama software and corresponding models
+- `google` Use Google Translate API for translation. No additional configuration needed, but requires network access to Google
+
+This only applies to Vosk and SOSV models.
+
+#### `-omn, --ollama_name`
+
+Specify the Ollama model to call for translation. Default value is empty.
+
+It's recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
+
+Users need to download the corresponding model in Ollama to use it properly.
+
+This only applies to Vosk and SOSV models.
+
+#### `-vosk, --vosk_model`
+
+Specify the path to the local folder of the Vosk model to call. Default value is empty.
+
+This only applies to the Vosk model.
+
+#### `-sosv, --sosv_model`
+
+Specify the path to the local folder of the SOSV model to call. Default value is empty.
+
+This only applies to the SOSV model.
+
+### Running Caption Engine Using Source Code
+
+> The following content assumes users who use this method have knowledge of Python environment configuration and usage.
+
+First, download the project source code locally. The caption engine source code is located in the  `engine` directory of the project. Then configure the Python environment, where the project dependencies are listed in the `requirements.txt` file in the `engine` directory.
+
+After configuration, enter the `engine` directory and execute commands to run the caption engine.
+
+For example, to use the Gummy model, specify audio type as system audio output, source language as English, and target language as Chinese, execute the following command:
+
+> Note: For better visualization, the commands below are written on multiple lines. If execution fails, try removing backslashes and executing as a single line command.
+
+```bash
+python main.py \
+-e gummy \
+-k sk-******************************** \
+-a 0 \
+-d 1 \
+-s en \
+-t zh
+```
+
+To specify the Vosk model, audio type as system audio output, translate to English, and use Ollama `qwen3:0.6b` model for translation:
+
+```bash
+python main.py \
+-e vosk \
+-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
+-a 0 \
+-d 1 \
+-t en \
+```
+
+To specify the SOSV model, audio type as microphone, automatically select source language, and no translation:
+
+```bash
+python main.py \
+-e sosv \
+-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
+-a 1 \
+-d 1 \
+-s auto \
+-t none
+```
+
+Running result using the Gummy model is shown below:
+
+![](../img/07.png)
+
+### Running Subtitle Engine Executable File
+
+First, download the executable file for your platform from [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases/tag/engine) (currently only Windows and Linux platform executable files are provided).
+
+Then open a terminal in the directory containing the caption engine executable file and execute commands to run the caption engine.
+
+Simply replace `python main.py` in the above commands with the executable file name (for example: `engine-win.exe`).
--- a/docs/user-manual/zh.md
+++ b/docs/user-manual/zh.md
@@ -128,3 +128,175 @@ sudo yum install pulseaudio pavucontrol
 注意使用自定义字幕引擎时，前面的字幕引擎的设置将全部不起作用，自定义字幕引擎的配置完全通过引擎指令进行配置。

 如果你是开发者，想开发自定义字幕引擎，请查看[字幕引擎说明文档](../engine-manual/zh.md)。
+
+## 单独使用字幕引擎
+
+### 运行参数说明
+
+> 以下内容默认用户对使用终端运行程序有一定了解。
+
+字幕引擎可用使用的完整的运行参数如下：
+
+![](../img/06.png)
+
+而在单独使用时其中某些参数并不需要使用，或者不适合进行修改。
+
+下面的运行参数说明仅包含必要的参数。
+
+#### `-e , --caption_engine`
+
+需要选择的字幕引擎模型，目前有三个可用，分别为：`gummy, vosk, sosv`。
+
+该项的默认值为 `gummy`。
+
+该项适用于所有模型。
+
+#### `-a, --audio_type`
+
+需要识别的音频类型，其中 `0` 表示系统音频输出，`1` 表示麦克风音频输入。
+
+该项的默认值为 `0`。
+
+该项适用于所有模型。
+
+#### `-d, --display_caption`
+
+是否在控制台显示字幕，`0` 表示不显示，`1` 表示显示。
+
+该项默认值为 `0`，只使用字幕引擎的话建议选 `1`。
+
+该项适用于所有模型。
+
+#### `-t, --target_language`
+
+> 其中 Vosk 和 SOSV 模型分句效果较差，会导致翻译内容难以理解，不太建议这两个模型使用翻译。
+
+需要翻译成的目标语言，所有模型都支持的翻译语言如下：
+
+- `none` 不进行翻译
+- `zh` 简体中文
+- `en` 英语
+- `ja` 日语
+- `ko` 韩语
+
+除此之外 `vosk` 和 `sosv` 模型还支持如下翻译：
+
+- `de` 德语
+- `fr` 法语
+- `ru` 俄语
+- `es` 西班牙语
+- `it` 意大利语
+
+该项的默认值为 `none`。
+
+该项适用于所有模型。
+
+#### `-s, --source_language`
+
+需要识别的语言的源语言，默认值为 `auto`，表示不指定源语言。
+
+但是指定源语言能在一定程度上提高识别准确率，可用使用上面的语言代码指定源语言。
+
+该项仅适用于 Gummy 和 SOSV 模型。
+
+其中 Gummy 模型可用使用上述全部的语言，在加上粤语（`yue`）。
+
+而 SOSV 模型支持指定的语言有：英语、中文、日语、韩语、粤语。
+
+#### `-k, --api_key`
+
+指定 `Gummy` 模型需要使用的阿里云 API KEY。
+
+该项默认值为空。
+
+该项仅适用于 Gummy 模型。
+
+#### `-tm, --translation_model`
+
+指定 Vosk 和 SOSV 模型的翻译方式，默认为 `ollama`。
+
+该项支持的值有：
+
+- `ollama` 使用本地 Ollama 模型进行翻译，需要用户安装 Ollama 软件和对应的模型
+- `google` 使用 Google 翻译 API 进行翻译，无需额外配置，但是需要有能访问 Google 的网络
+
+该项仅适用于 Vosk 和 SOSV 模型。
+
+#### `-omn, --ollama_name`
+
+指定需要调用进行翻译的 Ollama 模型。该项默认值为空。
+
+建议使用参数量小于 1B 的模型，比如： `qwen2.5:0.5b`, `qwen3:0.6b`。
+
+用户需要在 Ollama 中下载了对应的模型才能正常使用。
+
+该项仅适用于 Vosk 和 SOSV 模型。
+
+#### `-vosk, --vosk_model`
+
+指定需要调用的 Vosk 模型的本地文件夹的路径。该项默认值为空。
+
+该项仅适用于 Vosk  模型。
+
+#### `-sosv, --sosv_model`
+
+指定需要调用的 SOSV 模型的本地文件夹的路径。该项默认值为空。
+
+该项仅适用于 SOSV  模型。
+
+### 使用源代码运行字幕引擎
+
+> 以下内容默认使用该方式的用户对 Python 环境配置和使用有所了解。
+
+首先下载项目源代码到本地，其中字幕引擎源代码在项目的 `engine` 目录下。然后配置 Python 环境，其中项目依赖的 Python 包在 `engine` 目录下 `requirements.txt` 文件中。
+
+配置好后进入 `engine` 目录，执行命令进行运行字幕引擎。
+
+比如要使用 Gummy 模型，指定音频类型为系统音频输出，源语言为英语，翻译语言为中文，执行的命令如下：
+
+> 注意：为了更直观，下面的命令写在了多行，如果执行失败，尝试去掉反斜杠，并改换单行命令执行。
+
+```bash
+python main.py \
+-e gummy \
+-k sk-******************************** \
+-a 0 \
+-d 1 \
+-s en \
+-t zh
+```
+
+指定 Vosk 模型，指定音频类型为系统音频输出，翻译语言为英语，使用 Ollama `qwen3:0.6b` 模型进行翻译：
+
+```bash
+python main.py \
+-e vosk \
+-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
+-a 0 \
+-d 1 \
+-t en \
+```
+
+指定 SOSV 模型，指定音频类型为麦克风，自动选择源语言，不翻译，执行的命令如下：
+
+```bash
+python main.py \
+-e sosv \
+-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
+-a 1 \
+-d 1 \
+-s auto \
+-t none
+```
+
+使用 Gummy 模型的运行效果如下：
+
+![](../img/07.png)
+
+### 运行字幕引擎可执行文件
+
+首先在 [GitHub Release](https://github.com/HiMeditator/auto-caption/releases/tag/engine) 中下载对应平台的可执行文件（目前仅提供 Windows 和 Linux 平台的字幕引擎可执行文件）。
+
+然后再字幕引擎可执行文件所在目录打开终端，执行命令进行运行字幕引擎。
+
+只需要将上述指令中的 `python main.py` 替换为可执行文件名称即可（比如：`engine-win.exe`）。
--- a/engine/utils/sysout.py
+++ b/engine/utils/sysout.py
@@ -29,14 +29,18 @@ def caption_display(obj):
    if caption_index >=0 and caption_index != int(obj['index']):
        display.finalize_current_sentence()
    caption_index = int(obj['index'])
-    full_text = f"{obj['text']} {obj['translation']}"
+    full_text = f"{obj['text']}\n{obj['translation']}"
+    if obj['translation']:
+        full_text += "\n"
    display.update_text(full_text)
    display.display()

 def translation_display(obj):
    global original_caption
    global display
-    full_text = f"{obj['text']} {obj['translation']}"
+    full_text = f"{obj['text']}\n{obj['translation']}"
+    if obj['translation']:
+        full_text += "\n"
    display.update_text(full_text)
    display.display()
    display.finalize_current_sentence()