docs(readme): 更新说明并添加终端使用指南

2026-02-15 04:14:46 +08:00 · 2025-11-02 20:53:56 +08:00
parent e6a65f8362
commit 383e582a2d
8 changed files with 422 additions and 5 deletions
--- a/docs/img/06.png
+++ b/docs/img/06.png
--- a/docs/img/07.png
+++ b/docs/img/07.png
--- a/docs/user-manual/en.md
+++ b/docs/user-manual/en.md
@@ -130,3 +130,175 @@ The software provides two default caption engines. If you need other caption eng
 Note that when using a custom caption engine, all previous caption engine settings will be ineffective, and the configuration of the custom caption engine is entirely done through the engine command.

 If you are a developer and want to develop a custom caption engine, please refer to the [Caption Engine Explanation Document](../engine-manual/en.md).
+
+## Using Caption Engine Standalone
+
+### Runtime Parameter Description
+
+> The following content assumes users have some knowledge of running programs via terminal.
+
+The complete set of runtime parameters available for the caption engine is shown below:
+
+![](../img/06.png)
+
+However, when used standalone, some parameters may not need to be used or should not be modified.
+
+The following parameter descriptions only include necessary parameters.
+
+#### `-e , --caption_engine`
+
+The caption engine model to select, currently three options are available: `gummy, vosk, sosv`.
+
+The default value is `gummy`.
+
+This applies to all models.
+
+#### `-a, --audio_type`
+
+The audio type to recognize, where `0` represents system audio output and `1` represents microphone audio input.
+
+The default value is `0`.
+
+This applies to all models.
+
+#### `-d, --display_caption`
+
+Whether to display captions in the console, `0` means do not display, `1` means display.
+
+The default value is `0`, but it's recommended to choose `1` when using only the caption engine.
+
+This applies to all models.
+
+#### `-t, --target_language`
+
+> Note that Vosk and SOSV models have poor sentence segmentation, which can make translated content difficult to understand. It's not recommended to use translation with these two models.
+
+Target language for translation. All models support the following translation languages:
+
+- `none` No translation
+- `zh` Simplified Chinese
+- `en` English
+- `ja` Japanese
+- `ko` Korean
+
+Additionally, `vosk` and `sosv` models also support the following translations:
+
+- `de` German
+- `fr` French
+- `ru` Russian
+- `es` Spanish
+- `it` Italian
+
+The default value is `none`.
+
+This applies to all models.
+
+#### `-s, --source_language`
+
+Source language for recognition. Default value is `auto`, meaning no specific source language.
+
+Specifying the source language can improve recognition accuracy to some extent. You can specify the source language using the language codes above.
+
+This only applies to Gummy and SOSV models.
+
+The Gummy model can use all the languages mentioned above, plus Cantonese (`yue`).
+
+The SOSV model supports specifying the following languages: English, Chinese, Japanese, Korean, and Cantonese.
+
+#### `-k, --api_key`
+
+Specify the Alibaba Cloud API KEY required for the `Gummy` model.
+
+Default value is empty.
+
+This only applies to the Gummy model.
+
+#### `-tm, --translation_model`
+
+Specify the translation method for Vosk and SOSV models. Default is `ollama`.
+
+Supported values are:
+
+- `ollama` Use local Ollama model for translation. Users need to install Ollama software and corresponding models
+- `google` Use Google Translate API for translation. No additional configuration needed, but requires network access to Google
+
+This only applies to Vosk and SOSV models.
+
+#### `-omn, --ollama_name`
+
+Specify the Ollama model to call for translation. Default value is empty.
+
+It's recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
+
+Users need to download the corresponding model in Ollama to use it properly.
+
+This only applies to Vosk and SOSV models.
+
+#### `-vosk, --vosk_model`
+
+Specify the path to the local folder of the Vosk model to call. Default value is empty.
+
+This only applies to the Vosk model.
+
+#### `-sosv, --sosv_model`
+
+Specify the path to the local folder of the SOSV model to call. Default value is empty.
+
+This only applies to the SOSV model.
+
+### Running Caption Engine Using Source Code
+
+> The following content assumes users who use this method have knowledge of Python environment configuration and usage.
+
+First, download the project source code locally. The caption engine source code is located in the  `engine` directory of the project. Then configure the Python environment, where the project dependencies are listed in the `requirements.txt` file in the `engine` directory.
+
+After configuration, enter the `engine` directory and execute commands to run the caption engine.
+
+For example, to use the Gummy model, specify audio type as system audio output, source language as English, and target language as Chinese, execute the following command:
+
+> Note: For better visualization, the commands below are written on multiple lines. If execution fails, try removing backslashes and executing as a single line command.
+
+```bash
+python main.py \
+-e gummy \
+-k sk-******************************** \
+-a 0 \
+-d 1 \
+-s en \
+-t zh
+```
+
+To specify the Vosk model, audio type as system audio output, translate to English, and use Ollama `qwen3:0.6b` model for translation:
+
+```bash
+python main.py \
+-e vosk \
+-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
+-a 0 \
+-d 1 \
+-t en \
+```
+
+To specify the SOSV model, audio type as microphone, automatically select source language, and no translation:
+
+```bash
+python main.py \
+-e sosv \
+-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
+-a 1 \
+-d 1 \
+-s auto \
+-t none
+```
+
+Running result using the Gummy model is shown below:
+
+![](../img/07.png)
+
+### Running Subtitle Engine Executable File
+
+First, download the executable file for your platform from [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases/tag/engine) (currently only Windows and Linux platform executable files are provided).
+
+Then open a terminal in the directory containing the caption engine executable file and execute commands to run the caption engine.
+
+Simply replace `python main.py` in the above commands with the executable file name (for example: `engine-win.exe`).
--- a/docs/user-manual/zh.md
+++ b/docs/user-manual/zh.md
@@ -128,3 +128,175 @@ sudo yum install pulseaudio pavucontrol
 注意使用自定义字幕引擎时，前面的字幕引擎的设置将全部不起作用，自定义字幕引擎的配置完全通过引擎指令进行配置。

 如果你是开发者，想开发自定义字幕引擎，请查看[字幕引擎说明文档](../engine-manual/zh.md)。
+
+## 单独使用字幕引擎
+
+### 运行参数说明
+
+> 以下内容默认用户对使用终端运行程序有一定了解。
+
+字幕引擎可用使用的完整的运行参数如下：
+
+![](../img/06.png)
+
+而在单独使用时其中某些参数并不需要使用，或者不适合进行修改。
+
+下面的运行参数说明仅包含必要的参数。
+
+#### `-e , --caption_engine`
+
+需要选择的字幕引擎模型，目前有三个可用，分别为：`gummy, vosk, sosv`。
+
+该项的默认值为 `gummy`。
+
+该项适用于所有模型。
+
+#### `-a, --audio_type`
+
+需要识别的音频类型，其中 `0` 表示系统音频输出，`1` 表示麦克风音频输入。
+
+该项的默认值为 `0`。
+
+该项适用于所有模型。
+
+#### `-d, --display_caption`
+
+是否在控制台显示字幕，`0` 表示不显示，`1` 表示显示。
+
+该项默认值为 `0`，只使用字幕引擎的话建议选 `1`。
+
+该项适用于所有模型。
+
+#### `-t, --target_language`
+
+> 其中 Vosk 和 SOSV 模型分句效果较差，会导致翻译内容难以理解，不太建议这两个模型使用翻译。
+
+需要翻译成的目标语言，所有模型都支持的翻译语言如下：
+
+- `none` 不进行翻译
+- `zh` 简体中文
+- `en` 英语
+- `ja` 日语
+- `ko` 韩语
+
+除此之外 `vosk` 和 `sosv` 模型还支持如下翻译：
+
+- `de` 德语
+- `fr` 法语
+- `ru` 俄语
+- `es` 西班牙语
+- `it` 意大利语
+
+该项的默认值为 `none`。
+
+该项适用于所有模型。
+
+#### `-s, --source_language`
+
+需要识别的语言的源语言，默认值为 `auto`，表示不指定源语言。
+
+但是指定源语言能在一定程度上提高识别准确率，可用使用上面的语言代码指定源语言。
+
+该项仅适用于 Gummy 和 SOSV 模型。
+
+其中 Gummy 模型可用使用上述全部的语言，在加上粤语（`yue`）。
+
+而 SOSV 模型支持指定的语言有：英语、中文、日语、韩语、粤语。
+
+#### `-k, --api_key`
+
+指定 `Gummy` 模型需要使用的阿里云 API KEY。
+
+该项默认值为空。
+
+该项仅适用于 Gummy 模型。
+
+#### `-tm, --translation_model`
+
+指定 Vosk 和 SOSV 模型的翻译方式，默认为 `ollama`。
+
+该项支持的值有：
+
+- `ollama` 使用本地 Ollama 模型进行翻译，需要用户安装 Ollama 软件和对应的模型
+- `google` 使用 Google 翻译 API 进行翻译，无需额外配置，但是需要有能访问 Google 的网络
+
+该项仅适用于 Vosk 和 SOSV 模型。
+
+#### `-omn, --ollama_name`
+
+指定需要调用进行翻译的 Ollama 模型。该项默认值为空。
+
+建议使用参数量小于 1B 的模型，比如： `qwen2.5:0.5b`, `qwen3:0.6b`。
+
+用户需要在 Ollama 中下载了对应的模型才能正常使用。
+
+该项仅适用于 Vosk 和 SOSV 模型。
+
+#### `-vosk, --vosk_model`
+
+指定需要调用的 Vosk 模型的本地文件夹的路径。该项默认值为空。
+
+该项仅适用于 Vosk  模型。
+
+#### `-sosv, --sosv_model`
+
+指定需要调用的 SOSV 模型的本地文件夹的路径。该项默认值为空。
+
+该项仅适用于 SOSV  模型。
+
+### 使用源代码运行字幕引擎
+
+> 以下内容默认使用该方式的用户对 Python 环境配置和使用有所了解。
+
+首先下载项目源代码到本地，其中字幕引擎源代码在项目的 `engine` 目录下。然后配置 Python 环境，其中项目依赖的 Python 包在 `engine` 目录下 `requirements.txt` 文件中。
+
+配置好后进入 `engine` 目录，执行命令进行运行字幕引擎。
+
+比如要使用 Gummy 模型，指定音频类型为系统音频输出，源语言为英语，翻译语言为中文，执行的命令如下：
+
+> 注意：为了更直观，下面的命令写在了多行，如果执行失败，尝试去掉反斜杠，并改换单行命令执行。
+
+```bash
+python main.py \
+-e gummy \
+-k sk-******************************** \
+-a 0 \
+-d 1 \
+-s en \
+-t zh
+```
+
+指定 Vosk 模型，指定音频类型为系统音频输出，翻译语言为英语，使用 Ollama `qwen3:0.6b` 模型进行翻译：
+
+```bash
+python main.py \
+-e vosk \
+-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
+-a 0 \
+-d 1 \
+-t en \
+```
+
+指定 SOSV 模型，指定音频类型为麦克风，自动选择源语言，不翻译，执行的命令如下：
+
+```bash
+python main.py \
+-e sosv \
+-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
+-a 1 \
+-d 1 \
+-s auto \
+-t none
+```
+
+使用 Gummy 模型的运行效果如下：
+
+![](../img/07.png)
+
+### 运行字幕引擎可执行文件
+
+首先在 [GitHub Release](https://github.com/HiMeditator/auto-caption/releases/tag/engine) 中下载对应平台的可执行文件（目前仅提供 Windows 和 Linux 平台的字幕引擎可执行文件）。
+
+然后再字幕引擎可执行文件所在目录打开终端，执行命令进行运行字幕引擎。
+
+只需要将上述指令中的 `python main.py` 替换为可执行文件名称即可（比如：`engine-win.exe`）。