feat(engine): 添加字幕窗口宽度记忆功能并优化字幕引擎关闭逻辑

- 添加 captionWindowWidth 属性，用于保存字幕窗口宽度 - 修改 CaptionEngine 中的 stop 和 kill 方法，优化字幕引擎关闭逻辑 - 更新 README，添加预备模型列表
fix(engine): 字幕引擎启动路径适配、音频重采样函数适配
2026-02-08 15:54:47 +08:00 · 2025-08-02 15:40:13 +08:00 · 2025-07-30 00:16:54 +08:00 · 2025-07-29 23:20:15 +08:00 · 2025-07-29 19:37:03 +08:00 · 2025-07-28 21:44:49 +08:00
72 changed files with 1522 additions and 2352 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -5,8 +5,8 @@ out
 .eslintcache
 *.log*
 __pycache__
-subenv
-caption-engine/build
-caption-engine/models
-output.wav
 .venv
+subenv
+engine/build
+engine/models
+engine/notebook
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -9,6 +9,6 @@
    "editor.defaultFormatter": "esbenp.prettier-vscode"
  },
  "python.analysis.extraPaths": [
-    "./caption-engine"
+    "./engine"
  ]
 }
--- a/README.md
+++ b/README.md
@@ -3,12 +3,8 @@
    <h1 align="center">auto-caption</h1>
    <p>Auto Caption 是一个跨平台的实时字幕显示软件。</p>
    <p>
-      <a href="https://github.com/HiMeditator/auto-caption/releases">
-        <img src="https://img.shields.io/badge/release-0.5.0-blue">
-      </a>
-      <a href="https://github.com/HiMeditator/auto-caption/issues">
-        <img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange">
-      </a>
+      <a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.6.0-blue"></a>
+      <a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
      <img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
      <img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
      <img src="https://img.shields.io/github/stars/HiMeditator/auto-caption?style=social">
@@ -18,7 +14,7 @@
        | <a href="./README_en.md">English</a>
        | <a href="./README_ja.md">日本語</a> |
    </p>
-    <p><i>v0.5.0 版本已经发布。<b>目前 Vosk 本地字幕引擎效果较差，且不含翻译</b>，更优秀的字幕引擎正在尝试开发中...</i></p>
+    <p><i>v0.6.0 版本已经发布，对字幕引擎代码进行了大重构，提升了代码的可扩展性。更多的字幕引擎正在尝试开发中...</i></p>
 </div>

 ![](./assets/media/main_zh.png)
@@ -33,7 +29,9 @@

 [字幕引擎说明文档](./docs/engine-manual/zh.md)

-[项目 API 文档](./docs/api-docs/electron-ipc.md)
+[项目 API 文档](./docs/api-docs/)
+
+[更新日志](./docs/CHANGELOG.md)

 ## 📖 基本使用

@@ -45,6 +43,7 @@
 | macOS Sequoia 15.5 | arm64      | ✅需要额外配置     | ✅                |
 | Ubuntu 24.04.2     | x64        | ✅               | ✅                |
 | Kali Linux 2022.3  | x64        | ✅               | ✅                |
+| Kylin Server V10 SP3 | x64 | ✅ | ✅ |

 macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置，详见[Auto Caption 用户手册](./docs/user-manual/zh.md)。

@@ -74,7 +73,7 @@ macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置，

 ## ⚙️ 自带字幕引擎说明

-目前软件自带 2 个字幕引擎，正在规划 1 个新的引擎。它们的详细信息如下。
+目前软件自带 2 个字幕引擎，正在规划新的引擎。它们的详细信息如下。

 ### Gummy 字幕引擎（云端）

@@ -105,9 +104,15 @@ $$

 基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。目前只支持生成音频对应的原文，不支持生成翻译内容。

-### FunASR 字幕引擎（本地）
+### 新规划字幕引擎
+
+以下为备选模型，将根据模型效果和集成难易程度选择。
+
+- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
+- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
+- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
+- [FunASR](https://github.com/modelscope/FunASR)

-如果可行，将基于 [FunASR](https://github.com/modelscope/FunASR) 进行开发。还未进行调研和可行性验证。

 ## 🚀 项目运行

@@ -121,10 +126,10 @@ npm install

 ### 构建字幕引擎

-首先进入 `caption-engine` 文件夹，执行如下指令创建虚拟环境：
+首先进入 `engine` 文件夹，执行如下指令创建虚拟环境（需要使用大于等于 Python 3.10 的 Python 运行环境，建议使用 Python 3.12）：

 ```bash
-# in ./caption-engine folder
+# in ./engine folder
 python -m venv subenv
 # or
 python3 -m venv subenv
@@ -139,7 +144,7 @@ subenv/Scripts/activate
 source subenv/bin/activate
 ```

-然后安装依赖（这一步可能会报错，一般是因为构建失败，需要根据报错信息安装对应的工具包）：
+然后安装依赖（这一步在 macOS 和 Linux 可能会报错，一般是因为构建失败，需要根据报错信息进行处理）：

 ```bash
 # Windows
@@ -150,7 +155,7 @@ pip install -r requirements_darwin.txt
 pip install -r requirements_linux.txt
 ```

-如果在 Linux 系统上安装 samplerate 模块报错，可以尝试使用以下命令单独安装：
+如果在 Linux 系统上安装 `samplerate` 模块报错，可以尝试使用以下命令单独安装：

 ```bash
 pip install samplerate --only-binary=:all:
@@ -159,11 +164,10 @@ pip install samplerate --only-binary=:all:
 然后使用 `pyinstaller` 构建项目：

 ```bash
-pyinstaller ./main-gummy.spec
-pyinstaller ./main-vosk.spec
+pyinstaller ./main.spec
 ```

-注意 `main-vosk.spec` 文件中 `vosk` 库的路径可能不正确，需要根据实际状况配置。
+注意 `main.spec` 文件中 `vosk` 库的路径可能不正确，需要根据实际状况配置（与 Python 环境的版本相关）。

 ```
 # Windows
@@ -172,7 +176,7 @@ vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
 vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
 ```

-此时项目构建完成，在进入 `caption-engine/dist` 文件夹可见对应的可执行文件。即可进行后续操作。
+此时项目构建完成，在进入 `engine/dist` 文件夹可见对应的可执行文件。即可进行后续操作。

 ### 运行项目

@@ -182,8 +186,6 @@ npm run dev

 ### 构建项目

-注意目前软件只在 Windows 和 macOS 平台上进行了构建和测试，无法保证软件在 Linux 平台下的正确性。
-
 ```bash
 # For windows
 npm run build:win
@@ -198,13 +200,9 @@ npm run build:linux
 ```yml
 extraResources:
  # For Windows
-  - from: ./caption-engine/dist/main-gummy.exe
-    to: ./caption-engine/main-gummy.exe
-  - from: ./caption-engine/dist/main-vosk.exe
-    to: ./caption-engine/main-vosk.exe
+  - from: ./engine/dist/main.exe
+    to: ./engine/main.exe
  # For macOS and Linux
-  # - from: ./caption-engine/dist/main-gummy
-  #   to: ./caption-engine/main-gummy
-  # - from: ./caption-engine/dist/main-vosk
-  #   to: ./caption-engine/main-vosk
+  # - from: ./engine/dist/main
+  #   to: ./engine/main
 ```
--- a/README_en.md
+++ b/README_en.md
@@ -3,12 +3,8 @@
    <h1 align="center">auto-caption</h1>
    <p>Auto Caption is a cross-platform real-time caption display software.</p>
    <p>
-      <a href="https://github.com/HiMeditator/auto-caption/releases">
-        <img src="https://img.shields.io/badge/release-0.5.0-blue">
-      </a>
-      <a href="https://github.com/HiMeditator/auto-caption/issues">
-        <img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange">
-      </a>
+      <a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.6.0-blue"></a>
+      <a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
      <img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
      <img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
      <img src="https://img.shields.io/github/stars/HiMeditator/auto-caption?style=social">
@@ -18,7 +14,7 @@
        | <b>English</b>
        | <a href="./README_ja.md">日本語</a> |
    </p>
-    <p><i>Version v0.5.0 has been released. <b>The current Vosk local caption engine performs poorly and does not include translation</b>. A better caption engine is under development...</i></p>
+    <p><i>Version 0.6.0 has been released, featuring a major refactor of the subtitle engine code to improve code extensibility. More subtitle engines are being developed...</i></p>
 </div>

 ![](./assets/media/main_en.png)
@@ -33,7 +29,9 @@

 [Caption Engine Documentation](./docs/engine-manual/en.md)

-[Project API Documentation (Chinese)](./docs/api-docs/electron-ipc.md)
+[Project API Documentation (Chinese)](./docs/api-docs/)
+
+[Changelog](./docs/CHANGELOG.md)

 ## 📖 Basic Usage

@@ -45,6 +43,7 @@ The software has been adapted for Windows, macOS, and Linux platforms. The teste
 | macOS Sequoia 15.5 | arm64        | ✅ Additional config required | ✅        |
 | Ubuntu 24.04.2     | x64          | ✅                 | ✅                   |
 | Kali Linux 2022.3  | x64          | ✅                 | ✅                   |
+| Kylin Server V10 SP3 | x64 | ✅ | ✅ |

 Additional configuration is required to capture system audio output on macOS and Linux platforms. See [Auto Caption User Manual](./docs/user-manual/en.md) for details.

@@ -74,7 +73,7 @@ To use the Vosk local caption engine, first download your required model from [V

 ## ⚙️ Built-in Subtitle Engines

-Currently, the software comes with 2 subtitle engines, with 1 new engine planned. Details are as follows.
+Currently, the software comes with 2 subtitle engines, with new engines under development. Their detailed information is as follows.

 ### Gummy Subtitle Engine (Cloud)

@@ -105,9 +104,14 @@ The engine only uploads data when receiving audio streams, so the actual upload

 Developed based on [vosk-api](https://github.com/alphacep/vosk-api). Currently only supports generating original text from audio, does not support translation content.

-### FunASR Subtitle Engine (Local)
+### Planned New Subtitle Engines

-If feasible, will be developed based on [FunASR](https://github.com/modelscope/FunASR). Not yet researched or verified for feasibility.
+The following are candidate models that will be selected based on model performance and ease of integration.
+
+- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
+- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
+- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
+- [FunASR](https://github.com/modelscope/FunASR)

 ## 🚀 Project Setup

@@ -121,10 +125,10 @@ npm install

 ### Build Subtitle Engine

-First enter the `caption-engine` folder and execute the following commands to create a virtual environment:
+First enter the `engine` folder and execute the following commands to create a virtual environment (requires Python 3.10 or higher, with Python 3.12 recommended):

 ```bash
-# in ./caption-engine folder
+# in ./engine folder
 python -m venv subenv
 # or
 python3 -m venv subenv
@@ -139,7 +143,7 @@ subenv/Scripts/activate
 source subenv/bin/activate
 ```

-Then install dependencies (this step may fail, usually due to build failures - you'll need to install the corresponding tool packages based on the error messages):
+Then install dependencies (this step might result in errors on macOS and Linux, usually due to build failures, and you need to handle them based on the error messages):

 ```bash
 # Windows
@@ -159,11 +163,10 @@ pip install samplerate --only-binary=:all:
 Then use `pyinstaller` to build the project:

 ```bash
-pyinstaller ./main-gummy.spec
-pyinstaller ./main-vosk.spec
+pyinstaller ./main.spec
 ```

-Note that the path to the `vosk` library in `main-vosk.spec` might be incorrect and needs to be configured according to the actual situation.
+Note that the path to the `vosk` library in `main-vosk.spec` might be incorrect and needs to be configured according to the actual situation (related to the version of the Python environment).

 ```
 # Windows
@@ -172,7 +175,7 @@ vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
 vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
 ```

-After the build completes, you can find the executable file in the `caption-engine/dist` folder. Then proceed with subsequent operations.
+After the build completes, you can find the executable file in the `engine/dist` folder. Then proceed with subsequent operations.

 ### Run Project

@@ -182,8 +185,6 @@ npm run dev

 ### Build Project

-Note: Currently the software has only been built and tested on Windows and macOS platforms. Correct operation on Linux platform is not guaranteed.
-
 ```bash
 # For windows
 npm run build:win
@@ -198,13 +199,9 @@ Note: You need to modify the configuration content in the `electron-builder.yml`
 ```yml
 extraResources:
  # For Windows
-  - from: ./caption-engine/dist/main-gummy.exe
-    to: ./caption-engine/main-gummy.exe
-  - from: ./caption-engine/dist/main-vosk.exe
-    to: ./caption-engine/main-vosk.exe
+  - from: ./engine/dist/main.exe
+    to: ./engine/main.exe
  # For macOS and Linux
-  # - from: ./caption-engine/dist/main-gummy
-  #   to: ./caption-engine/main-gummy
-  # - from: ./caption-engine/dist/main-vosk
-  #   to: ./caption-engine/main-vosk
-```
+  # - from: ./engine/dist/main
+  #   to: ./engine/main
+```
--- a/README_ja.md
+++ b/README_ja.md
@@ -3,12 +3,8 @@
    <h1 align="center">auto-caption</h1>
    <p>Auto Caption はクロスプラットフォームのリアルタイム字幕表示ソフトウェアです。</p>
    <p>
-      <a href="https://github.com/HiMeditator/auto-caption/releases">
-        <img src="https://img.shields.io/badge/release-0.5.0-blue">
-      </a>
-      <a href="https://github.com/HiMeditator/auto-caption/issues">
-        <img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange">
-      </a>
+      <a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.6.0-blue"></a>
+      <a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
      <img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
      <img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
      <img src="https://img.shields.io/github/stars/HiMeditator/auto-caption?style=social">
@@ -18,7 +14,7 @@
        | <a href="./README_en.md">English</a>
        | <b>日本語</b> |
    </p>
-    <p><i>バージョン v0.5.0 がリリースされました。<b>現在の Vosk ローカル字幕エンジンは性能が低く、翻訳機能も含まれていません</b>。より優れた字幕エンジンを開発中です...</i></p>
+    <p><i>v0.6.0 バージョンがリリースされ、字幕エンジンコードが大規模にリファクタリングされ、コードの拡張性が向上しました。より多くの字幕エンジンの開発が試みられています...</i></p>
 </div>

 ![](./assets/media/main_ja.png)
@@ -33,7 +29,9 @@

 [字幕エンジン説明ドキュメント](./docs/engine-manual/ja.md)

-[プロジェクト API ドキュメント（中国語）](./docs/api-docs/electron-ipc.md)
+[プロジェクト API ドキュメント（中国語）](./docs/api-docs/)
+
+[更新履歴](./docs/CHANGELOG.md)

 ## 📖 基本使い方

@@ -45,6 +43,7 @@
 | macOS Sequoia 15.5 | arm64        | ✅ 追加設定が必要    | ✅                   |
 | Ubuntu 24.04.2     | x64          | ✅                 | ✅                   |
 | Kali Linux 2022.3  | x64          | ✅                 | ✅                   |
+| Kylin Server V10 SP3 | x64 | ✅ | ✅ |

 macOSおよびLinuxプラットフォームでシステムオーディオ出力を取得するには追加設定が必要です。詳細は[Auto Captionユーザーマニュアル](./docs/user-manual/ja.md)をご覧ください。

@@ -74,7 +73,7 @@ Vosk ローカル字幕エンジンを使用するには、まず [Vosk Models](

 ## ⚙️ 字幕エンジン説明

-現在ソフトウェアには2つの字幕エンジンが組み込まれており、1つの新しいエンジンを計画中です。詳細は以下の通りです。
+現在、ソフトウェアには2つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。

 ### Gummy 字幕エンジン（クラウド）

@@ -105,9 +104,14 @@ $$

 [vosk-api](https://github.com/alphacep/vosk-api) をベースに開発されています。現在は音声に対応する原文の生成のみをサポートしており、翻訳コンテンツはサポートしていません。

-### FunASR字幕エンジン（ローカル）
+### 新規計画字幕エンジン

-可能であれば、[FunASR](https://github.com/modelscope/FunASR) をベースに開発予定です。まだ調査と実現可能性の検証を行っていません。
+以下は候補モデルであり、モデルの性能と統合の容易さに基づいて選択されます。
+
+- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
+- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
+- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
+- [FunASR](https://github.com/modelscope/FunASR)

 ## 🚀 プロジェクト実行

@@ -121,10 +125,10 @@ npm install

 ### 字幕エンジンの構築

-まず `caption-engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成します：
+まず `engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成します（Python 3.10 以上が必要で、Python 3.12 が推奨されます）：

 ```bash
-# ./caption-engine フォルダ内
+# ./engine フォルダ内
 python -m venv subenv
 # または
 python3 -m venv subenv
@@ -139,7 +143,7 @@ subenv/Scripts/activate
 source subenv/bin/activate
 ```

-次に依存関係をインストールします（このステップは失敗する可能性があります、通常はビルド失敗が原因です - エラーメッセージに基づいて対応するツールパッケージをインストールする必要があります）：
+次に依存関係をインストールします（このステップでは macOS と Linux でエラーが発生する可能性があります。通常はビルド失敗によるもので、エラーメッセージに基づいて対処する必要があります）：

 ```bash
 # Windows
@@ -150,7 +154,7 @@ pip install -r requirements_darwin.txt
 pip install -r requirements_linux.txt
 ```

-Linuxシステムで`samplerate`モジュールのインストールに問題が発生した場合、以下のコマンドで個別にインストールを試すことができます：
+Linux システムで `samplerate` モジュールのインストールに問題が発生した場合、以下のコマンドで個別にインストールを試すことができます：

 ```bash
 pip install samplerate --only-binary=:all:
@@ -159,20 +163,19 @@ pip install samplerate --only-binary=:all:
 その後、`pyinstaller` を使用してプロジェクトをビルドします：

 ```bash
-pyinstaller ./main-gummy.spec
-pyinstaller ./main-vosk.spec
+pyinstaller ./main.spec
 ```

-`main-vosk.spec` ファイル内の `vosk` ライブラリのパスが正しくない可能性があるため、実際の状況に応じて設定する必要があります。
+`main-vosk.spec` ファイル内の `vosk` ライブラリのパスが正しくない可能性があるため、実際の状況（Python 環境のバージョンに関連）に応じて設定する必要があります。

 ```
 # Windows
 vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
-# LinuxまたはmacOS
+# Linux または macOS
 vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
 ```

-これでプロジェクトのビルドが完了し、`caption-engine/dist` フォルダ内に対応する実行可能ファイルが確認できます。その後、次の操作に進むことができます。
+これでプロジェクトのビルドが完了し、`engine/dist` フォルダ内に対応する実行可能ファイルが確認できます。その後、次の操作に進むことができます。

 ### プロジェクト実行

@@ -182,8 +185,6 @@ npm run dev

 ### プロジェクト構築

-現在、ソフトウェアは Windows と macOS プラットフォームでのみ構築とテストが行われており、Linux プラットフォームでの正しい動作は保証できません。
-
 ```bash
 # Windows 用
 npm run build:win
@@ -197,14 +198,10 @@ npm run build:linux

 ```yml
 extraResources:
-  # Windows用
-  - from: ./caption-engine/dist/main-gummy.exe
-    to: ./caption-engine/main-gummy.exe
-  - from: ./caption-engine/dist/main-vosk.exe
-    to: ./caption-engine/main-vosk.exe
-  # macOSとLinux用
-  # - from: ./caption-engine/dist/main-gummy
-  #   to: ./caption-engine/main-gummy
-  # - from: ./caption-engine/dist/main-vosk
-  #   to: ./caption-engine/main-vosk
+  # Windows 用
+  - from: ./engine/dist/main.exe
+    to: ./engine/main.exe
+  # macOS と Linux 用
+  # - from: ./engine/dist/main
+  #   to: ./engine/main
 ```
--- a/assets/media/main_en.png
+++ b/assets/media/main_en.png
--- a/assets/media/main_ja.png
+++ b/assets/media/main_ja.png
--- a/assets/media/main_zh.png
+++ b/assets/media/main_zh.png
--- a/assets/media/structure_en.png
+++ b/assets/media/structure_en.png
--- a/assets/media/structure_ja.png
+++ b/assets/media/structure_ja.png
--- a/assets/media/structure_zh.png
+++ b/assets/media/structure_zh.png
--- a/assets/media/vosk_en.png
+++ b/assets/media/vosk_en.png
--- a/assets/media/vosk_ja.png
+++ b/assets/media/vosk_ja.png
--- a/assets/media/vosk_zh.png
+++ b/assets/media/vosk_zh.png
--- a/assets/structure.pptx
+++ b/assets/structure.pptx
--- a/caption-engine/audio2text/init.py
+++ b/caption-engine/audio2text/init.py
@@ -1,2 +0,0 @@
-from dashscope.common.error import InvalidParameter
-from .gummy import GummyTranslator
--- a/caption-engine/audioprcs/init.py
+++ b/caption-engine/audioprcs/init.py
@@ -1 +0,0 @@
-from .process import mergeChunkChannels, resampleRawChunk, resampleMonoChunk
--- a/caption-engine/main-gummy.py
+++ b/caption-engine/main-gummy.py
@@ -1,58 +0,0 @@
-import sys
-import argparse
-
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream
-elif sys.platform == 'darwin':
-    from sysaudio.darwin import AudioStream
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream
-else:
-    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
-
-from audioprcs import mergeChunkChannels
-from audio2text import InvalidParameter, GummyTranslator
-
-
-def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
-    sys.stdout.reconfigure(line_buffering=True) # type: ignore
-    stream = AudioStream(audio_type, chunk_rate)
-
-    if t_lang == 'none':
-        gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
-    else:
-        gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
-
-    stream.openStream()
-    gummy.start()
-
-    while True:
-        try:
-            chunk = stream.read_chunk()
-            chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
-            try:
-                gummy.send_audio_frame(chunk_mono)
-            except InvalidParameter:
-                gummy.start()
-                gummy.send_audio_frame(chunk_mono)
-        except KeyboardInterrupt:
-            stream.closeStream()
-            gummy.stop()
-            break
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description='Convert system audio stream to text')
-    parser.add_argument('-s', '--source_language', default='en', help='Source language code')
-    parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
-    parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output audio stream, 1 for input audio stream')
-    parser.add_argument('-c', '--chunk_rate', default=20, help='The number of audio stream chunks collected per second.')
-    parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
-    args = parser.parse_args()
-    convert_audio_to_text(
-        args.source_language,
-        args.target_language,
-        int(args.audio_type),
-        int(args.chunk_rate),
-        args.api_key
-    )
--- a/caption-engine/main-gummy.spec
+++ b/caption-engine/main-gummy.spec
@@ -1,38 +0,0 @@
-# -*- mode: python ; coding: utf-8 -*-
-
-
-a = Analysis(
-    ['main-gummy.py'],
-    pathex=[],
-    binaries=[],
-    datas=[],
-    hiddenimports=[],
-    hookspath=[],
-    hooksconfig={},
-    runtime_hooks=[],
-    excludes=[],
-    noarchive=False,
-    optimize=0,
-)
-pyz = PYZ(a.pure)
-
-exe = EXE(
-    pyz,
-    a.scripts,
-    a.binaries,
-    a.datas,
-    [],
-    name='main-gummy',
-    debug=False,
-    bootloader_ignore_signals=False,
-    strip=False,
-    upx=True,
-    upx_exclude=[],
-    runtime_tmpdir=None,
-    console=True,
-    disable_windowed_traceback=False,
-    argv_emulation=False,
-    target_arch=None,
-    codesign_identity=None,
-    entitlements_file=None,
-)
--- a/caption-engine/main-vosk.py
+++ b/caption-engine/main-vosk.py
@@ -1,83 +0,0 @@
-import sys
-import json
-import argparse
-from datetime import datetime
-import numpy.core.multiarray
-
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream
-elif sys.platform == 'darwin':
-    from sysaudio.darwin import AudioStream
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream
-else:
-    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
-
-from vosk import Model, KaldiRecognizer, SetLogLevel
-from audioprcs import resampleRawChunk
-
-SetLogLevel(-1)
-
-def convert_audio_to_text(audio_type, chunk_rate, model_path):
-    sys.stdout.reconfigure(line_buffering=True) # type: ignore
-
-    if model_path.startswith('"'):
-        model_path = model_path[1:]
-    if model_path.endswith('"'):
-        model_path = model_path[:-1]
-
-    model = Model(model_path)
-    recognizer = KaldiRecognizer(model, 16000)
-
-    stream = AudioStream(audio_type, chunk_rate)
-    stream.openStream()
-
-    time_str = ''
-    cur_id = 0
-    prev_content = ''
-
-    while True:
-        chunk = stream.read_chunk()
-        chunk_mono = resampleRawChunk(chunk, stream.CHANNELS, stream.RATE, 16000)
-
-        caption = {}
-        if recognizer.AcceptWaveform(chunk_mono):
-            content = json.loads(recognizer.Result()).get('text', '')
-            caption['index'] = cur_id
-            caption['text'] = content
-            caption['time_s'] = time_str
-            caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
-            caption['translation'] = ''
-            prev_content = ''
-            cur_id += 1
-        else:
-            content = json.loads(recognizer.PartialResult()).get('partial', '')
-            if content == '' or content == prev_content:
-                continue
-            if prev_content == '':
-                time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
-            caption['index'] = cur_id
-            caption['text'] = content
-            caption['time_s'] = time_str
-            caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
-            caption['translation'] = ''
-            prev_content = content
-        try:
-            json_str = json.dumps(caption) + '\n'
-            sys.stdout.write(json_str)
-            sys.stdout.flush()
-        except Exception as e:
-            print(e)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description='Convert system audio stream to text')
-    parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output audio stream, 1 for input audio stream')
-    parser.add_argument('-c', '--chunk_rate', default=20, help='The number of audio stream chunks collected per second.')
-    parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
-    args = parser.parse_args()
-    convert_audio_to_text(
-        int(args.audio_type),
-        int(args.chunk_rate),
-        args.model_path
-    )
--- a/caption-engine/sysaudio/init.py
+++ b/caption-engine/sysaudio/init.py
--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@@ -105,3 +105,43 @@

 - 调整字幕窗口右上角图标为竖向排布
 - 过滤 Gummy 字幕引擎输出的不完整字幕
+
+## v0.5.1
+
+2025-07-17
+
+### 修复 bug
+
+- 修复无法调用自定义字幕引擎的 bug
+- 修复自定义字幕引擎的参数失效 bug
+
+## v0.6.0
+
+2025-07-29
+
+### 新增功能
+
+- 新增字幕记录排序功能，可选择字幕记录正序或倒叙显示
+
+### 优化体验
+
+- 减小了软件安装包的体积
+- 微调字幕引擎设置界面布局
+- 交换窗口界面信息弹窗和错误弹窗的位置，防止提示信息挡住操作
+- 提高程序健壮性，完全避免字幕引擎进程成为孤儿进程
+- 修改字幕引擎文档，添加更详细的开发说明
+
+### 项目优化
+
+- 重构字幕引擎，提示字幕引擎代码的可扩展性和可读性
+- 合并 Gummy 和 Vosk 引擎为单个可执行文件
+- 字幕引擎和主程序添加 Socket 通信，完全避免字幕引擎成为孤儿进程
+
+## v0.7.0
+
+2025-08-xx
+
+### 新增功能
+
+- 添加字幕窗口宽度记忆，重新打开时与上次字幕窗口宽度一致
+- 在尝试关闭字幕引擎 4s 后字幕引擎仍未关闭，则强制关闭字幕引擎
--- a/docs/TODO.md
+++ b/docs/TODO.md
@@ -15,10 +15,13 @@
 - [x] 可以调整字幕时间轴 *2025/07/14*
 - [x] 可以导出 srt 格式的字幕记录 *2025/07/14*
 - [x] 可以获取字幕引擎的系统资源消耗情况 *2025/07/15*
+- [x] 添加字幕记录按时间降序排列选择 *2025/07/26*
+- [x] 重构字幕引擎 *2025/07/28*
+- [x] 优化前端界面提示消息 *2025/07/29*

 ## 待完成

- [ ] 探索更多的语音转文字模型
+- [ ] 验证 / 添加基于 sherpa-onnx 的字幕引擎

 ## 后续计划

--- a/docs/api-docs/caption-engine.md
+++ b/docs/api-docs/caption-engine.md
@@ -0,0 +1,109 @@
+# caption engine api-doc
+
+本文档主要介绍字幕引擎和 Electron 主进程进程的通信约定。
+
+## 原理说明
+
+本项目的 Python 进程通过标准输出向 Electron 主进程发送数据。Python 进程标准输出 (`sys.stdout`) 的内容一定为一行一行的字符串。且每行字符串均可以解释为一个 JSON 对象。每个 JSON 对象一定有 `command` 参数。
+
+Electron 主进程通过 TCP Socket 向 Python 进程发送数据。发送的数据均是转化为字符串的对象，对象格式一定为：
+
+```js
+{
+  command: string,
+  content: string
+}
+```
+
+## 标准输出约定
+
+> 数据传递方向：字幕引擎进程 => Electron 主进程
+
+当 JSON 对象的 `command` 参数为下列值时，表示的对应的含义：
+
+### `connect`
+
+```js
+{
+  command: "connect",
+  content: ""
+}
+```
+
+字幕引擎 TCP Socket 服务已经准备好，命令 Electron 主进程连接字幕引擎 Socket 服务
+
+### `kill`
+
+```js
+{
+  command: "connect",
+  content: ""
+}
+```
+
+命令 Electron 主进程强制结束字幕引擎进程。
+
+### `caption`
+
+```js
+{
+  command: "caption",
+  index: number,
+  time_s: string,
+  time_t: string,
+  text: string,
+  translation: string
+}
+```
+
+Python 端监听到的音频流转换为的字幕数据。
+
+### `print`
+
+```js
+{
+  command: "print",
+  content: string
+}
+```
+
+输出 Python 端打印的内容。
+
+### `info`
+
+```js
+{
+  command: "info",
+  content: string
+}
+```
+
+Python 端打印的提示信息，比起 `print`，该信息更希望 Electron 端的关注。
+
+### `usage`
+
+```js
+{
+  command: "usage",
+  content: string
+}
+```
+
+Gummy 字幕引擎结束时打印计费消耗信息。
+
+## TCP Socket
+
+> 数据传递方向：Electron 主进程 => 字幕引擎进程
+
+当 JSON 对象的 `command` 参数为下列值时，表示的对应的含义：
+
+### `stop`
+
+```js
+{
+  command: "stop",
+  content: ""
+}
+```
+
+命令当前字幕引擎停止监听并结束任务。
--- a/docs/api-docs/electron-ipc.md
+++ b/docs/api-docs/electron-ipc.md
@@ -236,13 +236,13 @@

 ### `control.engine.started`

-**介绍：** 引擎启动成功
+**介绍：** 引擎启动成功，参数为引擎的进程 ID

 **发起方：** 后端

 **接收方：** 前端控制窗口

-**数据类型：** 无数据
+**数据类型：** `number`

 ### `control.engine.stopped`

--- a/docs/engine-manual/en.md
+++ b/docs/engine-manual/en.md
@@ -1,201 +1,199 @@
-# Caption Engine Documentation
+# Caption Engine Documentation  

-Corresponding Version: v0.5.0
+Corresponding Version: v0.6.0  

-![](../../assets/media/structure_en.png)
+![](../../assets/media/structure_en.png)  

-## Introduction to the Caption Engine
+## Introduction to the Caption Engine  

-The so-called caption engine is actually a subprogram that captures real-time streaming data from the system's audio input (recording) or output (playing sound) and calls an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into a JSON-formatted string and passed to the main program through standard output (it must be ensured that the string read by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and then displays it on the window.
+The so-called caption engine is essentially a subprogram that continuously captures real-time streaming data from the system's audio input (microphone) or output (speakers) and invokes an audio-to-text model to generate corresponding captions for the audio. The generated captions are converted into JSON-formatted string data and passed to the main program via standard output (ensuring the string can be correctly interpreted as a JSON object by the main program). The main program reads and interprets the caption data, processes it, and displays it in the window.  

-## Functions Required by the Caption Engine
+**The communication standard between the caption engine process and the Electron main process is: [caption engine api-doc](../api-docs/caption-engine.md).**  

-### Audio Acquisition
+## Workflow  

-First, your caption engine needs to capture streaming data from the system's audio input (recording) or output (playing sound). If using Python for development, you can use the PyAudio library to obtain microphone audio input data (cross-platform). Use the PyAudioWPatch library to get system audio output (Windows platform only).
+The communication flow between the main process and the caption engine:  

-Generally, the captured audio stream data consists of short audio chunks, and the size of these chunks should be adjusted according to the model. For example, Alibaba Cloud's Gummy model performs better with 0.05-second audio chunks compared to 0.2-second ones.
+### Starting the Engine  

-### Audio Processing
+- **Main Process**: Uses `child_process.spawn()` to launch the caption engine process.  
+- **Caption Engine Process**: Creates a TCP Socket server thread. After creation, it outputs a JSON object string via standard output, containing a `command` field with the value `connect`.  
+- **Main Process**: Monitors the standard output of the caption engine process, attempts to split it line by line, parses it into a JSON object, and checks if the `command` field value is `connect`. If so, it connects to the TCP Socket server.  

-The acquired audio stream may need preprocessing before being converted to text. For instance, Alibaba Cloud's Gummy model can only recognize single-channel audio streams, while the collected audio streams are typically dual-channel, thus requiring conversion from dual-channel to single-channel. Channel conversion can be achieved using methods in the NumPy library.
+### Caption Recognition  

-You can directly use the audio acquisition (`caption-engine/sysaudio`) and audio processing (`caption-engine/audioprcs`) modules I have developed.
+- **Caption Engine Process**: The main thread monitors system audio output, sends audio data chunks to the caption engine for parsing, and outputs the parsed caption data object strings via standard output.  
+- **Main Process**: Continues to monitor the standard output of the caption engine and performs different operations based on the `command` field of the parsed object.  

-### Audio to Text Conversion
+### Closing the Engine  

-After obtaining the appropriate audio stream, you can convert it into text. This is generally done using various models based on your requirements.
+- **Main Process**: When the user closes the caption engine via the frontend, the main process sends a JSON object string with the `command` field set to `stop` to the caption engine process via Socket communication.  
+- **Caption Engine Process**: Receives the object string, parses it, and if the `command` field is `stop`, sets the global variable `thread_data.status` to `stop`.  
+- **Caption Engine Process**: The main thread's loop for monitoring system audio output ends when `thread_data.status` is not `running`, releases resources, and terminates.  
+- **Main Process**: Detects the termination of the caption engine process, performs corresponding cleanup, and provides feedback to the frontend.  

-A nearly complete implementation of a caption engine is as follows:
+## Implemented Features  

-```python
-import sys
-import argparse
+The following features are already implemented and can be reused directly.  

-# Import system audio acquisition module
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream
-elif sys.platform == 'darwin':
-    from sysaudio.darwin import AudioStream
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream
-else:
-    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
+### Standard Output  

-# Import audio processing functions
-from audioprcs import mergeChunkChannels
-# Import audio-to-text module
-from audio2text import InvalidParameter, GummyTranslator
+Supports printing general information, commands, and error messages.  

+Example:  

-def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
-    # Set standard output to line buffering
-    sys.stdout.reconfigure(line_buffering=True) # type: ignore
+```python  
+from utils import stdout, stdout_cmd, stdout_obj, stderr  
+stdout("Hello") # {"command": "print", "content": "Hello"}\n  
+stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n  
+stdout_obj({"command": "print", "content": "Hello"})  
+stderr("Error Info")  
+```  

-    # Create instances for audio acquisition and speech-to-text
-    stream = AudioStream(audio_type, chunk_rate)
-    if t_lang == 'none':
-        gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
-    else:
-        gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
+### Creating a Socket Service  

-    # Start instances
-    stream.openStream()
-    gummy.start()
+This Socket service listens on a specified port, parses content sent by the Electron main program, and may modify the value of `thread_data.status`.  

-    while True:
-        try:
-            # Read audio stream data
-            chunk = stream.read_chunk()
-            chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
-            try:
-                # Call the model for translation
-                gummy.send_audio_frame(chunk_mono)
-            except InvalidParameter:
-                gummy.start()
-                gummy.send_audio_frame(chunk_mono)
-        except KeyboardInterrupt:
-            stream.closeStream()
-            gummy.stop()
-            break
-```
+Example:  

-### Caption Translation
+```python  
+from utils import start_server  
+from utils import thread_data  
+port = 8080  
+start_server(port)  
+while thread_data == 'running':  
+    # do something  
+    pass  
+```  

-Some speech-to-text models don't provide translation functionality, requiring an additional translation module. This part can use either cloud-based translation APIs or local translation models.
+### Audio Capture  

-### Data Transmission
+The `AudioStream` class captures audio data and is cross-platform, supporting Windows, Linux, and macOS. Its initialization includes two parameters:  

-After obtaining the text of the current audio stream, it needs to be transmitted to the main program. The caption engine process passes the caption data to the Electron main process through standard output.
+- `audio_type`: The type of audio to capture. `0` for system output audio (speakers), `1` for system input audio (microphone).  
+- `chunk_rate`: The frequency of audio data capture, i.e., the number of audio chunks captured per second.  

-The content transmitted must be a JSON string, where the JSON object must contain the following parameters:
+The class includes three methods:  

-```typescript
-export interface CaptionItem {
-  index: number, // Caption sequence number
-  time_s: string, // Caption start time
-  time_t: string, // Caption end time
-  text: string, // Caption content
-  translation: string // Caption translation
-}
-```
+- `open_stream()`: Starts audio capture.  
+- `read_chunk() -> bytes`: Reads an audio chunk.  
+- `close_stream()`: Stops audio capture.  

-**It is essential to ensure that each time we output caption JSON data, the buffer is flushed, ensuring that the string received by the Electron main process can always be interpreted as a JSON object.**
+Example:  

-If using Python, you can refer to the following method to pass data to the main program:
+```python  
+from sysaudio import AudioStream  
+audio_type = 0  
+chunk_rate = 20  
+stream = AudioStream(audio_type, chunk_rate)  
+stream.open_stream()  
+while True:  
+    data = stream.read_chunk()  
+    # do something with data  
+    pass  
+stream.close_stream()  
+```  

-```python
-# caption-engine\main-gummy.py
-sys.stdout.reconfigure(line_buffering=True)
+### Audio Processing  

-# caption-engine\audio2text\gummy.py
-...
-    def send_to_node(self, data):
-        """
-        Send data to the Node.js process
-        """
-        try:
-            json_data = json.dumps(data) + '\n'
-            sys.stdout.write(json_data)
-            sys.stdout.flush()
-        except Exception as e:
-            print(f"Error sending data to Node.js: {e}", file=sys.stderr)
-...
-```
+The captured audio stream may require preprocessing before conversion to text. Typically, multi-channel audio needs to be converted to mono, and resampling may be necessary. This project provides three audio processing functions:  

-Data receiver code is as follows:
+- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Converts a multi-channel audio chunk to mono.  
+- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Converts a multi-channel audio chunk to mono and resamples it.  
+- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Resamples a mono audio chunk.  

+## Features to Be Implemented in the Caption Engine  

-```typescript
-// src\main\utils\engine.ts
-...
-    this.process.stdout.on('data', (data) => {
-      const lines = data.toString().split('\n');
-      lines.forEach((line: string) => {
-        if (line.trim()) {
-          try {
-            const caption = JSON.parse(line);
-            addCaptionLog(caption);
-          } catch (e) {
-            controlWindow.sendErrorMessage('Unable to parse the output from the caption engine as a JSON object: ' + e)
-            console.error('[ERROR] Error parsing JSON:', e);
-          }
-        }
-      });
-    });
+### Audio-to-Text Conversion  

-    this.process.stderr.on('data', (data) => {
-      controlWindow.sendErrorMessage('Caption engine error: ' + data)
-      console.error(`[ERROR] Subprocess Error: ${data}`);
-    });
-...
-```
+After obtaining a suitable audio stream, it needs to be converted to text. Typically, various models (cloud-based or local) are used for this purpose. Choose the appropriate model based on requirements.  

-## Usage of Caption Engine
+This part is recommended to be encapsulated as a class with three methods:  

-### Command Line Parameter Specification
+- `start(self)`: Starts the model.  
+- `send_audio_frame(self, data: bytes)`: Processes the current audio chunk data. **The generated caption data is sent to the Electron main process via standard output.**  
+- `stop(self)`: Stops the model.  

-The custom caption engine settings are specified via command line parameters. Common required parameters are as follows:
+Complete caption engine examples:  

-```python
-import argparse
+- [gummy.py](../../engine/audio2text/gummy.py)  
+- [vosk.py](../../engine/audio2text/vosk.py)  

-...
+### Caption Translation  

-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description='Convert system audio stream to text')
-    parser.add_argument('-s', '--source_language', default='en', help='Source language code')
-    parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
-    parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output audio stream, 1 for input audio stream')
-    parser.add_argument('-c', '--chunk_rate', default=20, help='The number of audio stream chunks collected per second.')
-    parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
-    args = parser.parse_args()
-    convert_audio_to_text(
-        args.source_language,
-        args.target_language,
-        int(args.audio_type),
-        int(args.chunk_rate),
-        args.api_key
-    )
-```
+Some speech-to-text models do not provide translation. If needed, a translation module must be added.  

-For example, to specify Japanese as source language, Chinese as target language, capture system audio output, and collect 0.1s audio chunks, use the following command:
+### Sending Caption Data  

-```bash
-python main-gummy.py -s ja -t zh -a 0 -c 10 -k <your-api-key>
-```
+After obtaining the text for the current audio stream, it must be sent to the main program. The caption engine process passes caption data to the Electron main process via standard output.  

-### Packaging
+The content must be a JSON string, with the JSON object including the following parameters:  

-After development and testing, package the caption engine into an executable file using `pyinstaller`. If errors occur, check for missing dependencies.
+```typescript  
+export interface CaptionItem {  
+  command: "caption",  
+  index: number, // Caption sequence number  
+  time_s: string, // Start time of the current caption  
+  time_t: string, // End time of the current caption  
+  text: string, // Caption content  
+  translation: string // Caption translation  
+}  
+```  

-### Execution
+**Note: Ensure the buffer is flushed after each JSON output to guarantee the Electron main process receives a string that can be parsed as a JSON object.**  

-With a working caption engine, specify its path and runtime parameters in the caption software window to launch it.
+It is recommended to use the project's `stdout_obj` function for sending.  

-![](../img/02_en.png)
+### Command-Line Parameter Specification  

+Custom caption engine settings are provided via command-line arguments. The current project uses the following parameters:  

-## Reference Code
+```python  
+import argparse  
+if __name__ == "__main__":  
+    parser = argparse.ArgumentParser(description='Convert system audio stream to text')  
+    # Common parameters  
+    parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')  
+    parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')  
+    parser.add_argument('-c', '--chunk_rate', default=20, help='Number of audio stream chunks collected per second')  
+    parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')  
+    # Gummy-specific parameters  
+    parser.add_argument('-s', '--source_language', default='en', help='Source language code')  
+    parser.add_argument('-t', '--target_language', default='zh', help='Target language code')  
+    parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')  
+    # Vosk-specific parameters  
+    parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')  
+```  

-The `main-gummy.py` file under the `caption-engine` folder in this project serves as the entry point for the default caption engine. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing data from the caption engine. You can read and understand the implementation details and the complete execution process of the caption engine as needed.
+For example, to use the Gummy model with Japanese as the source language, Chinese as the target language, and system audio output captions with 0.1s audio chunks, the command-line arguments would be:  
+
+```bash  
+python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>  
+```  
+
+## Additional Notes  
+
+### Communication Standards  
+
+[caption engine api-doc](../api-docs/caption-engine.md)  
+
+### Program Entry  
+
+[main.py](../../engine/main.py)  
+
+### Development Recommendations  
+
+Apart from audio-to-text conversion, it is recommended to reuse the existing code. In this case, the following additions are needed:  
+
+- `engine/audio2text/`: Add a new audio-to-text class (file-level).  
+- `engine/main.py`: Add new parameter settings and workflow functions (refer to `main_gummy` and `main_vosk` functions).  
+
+### Packaging  
+
+After development and testing, the caption engine must be packaged into an executable. Typically, `pyinstaller` is used. If the packaged executable reports errors, check for missing dependencies.  
+
+### Execution  
+
+With a functional caption engine, it can be launched in the caption software window by specifying the engine's path and runtime arguments.  
+
+![](../img/02_en.png)  
--- a/docs/engine-manual/ja.md
+++ b/docs/engine-manual/ja.md
@@ -1,201 +1,201 @@
-# 字幕エンジンの説明文書
+# 字幕エンジン説明ドキュメント

-対応バージョン：v0.5.0
+対応バージョン：v0.6.0

 この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。

 ![](../../assets/media/structure_ja.png)

-## 字幕エンジンの紹介
+## 字幕エンジン紹介

-所謂字幕エンジンは実際にはサブプログラムであり、システムの音声入力（録音）または出力（音声再生）のストリーミングデータをリアルタイムで取得し、音声からテキストへの変換モデルを使って対応する音声の字幕を生成します。生成された字幕はJSON形式の文字列データに変換され、標準出力を通じてメインプログラムに渡されます（メインプログラムが読み取った文字列が正しいJSONオブジェクトとして解釈されることが保証される必要があります）。メインプログラムは字幕データを読み取り、解釈して処理し、ウィンドウ上に表示します。
+字幕エンジンとは、システムのオーディオ入力（マイク）または出力（スピーカー）のストリーミングデータをリアルタイムで取得し、音声を文字に変換するモデルを呼び出して対応する字幕を生成するサブプログラムです。生成された字幕はJSON形式の文字列データに変換され、標準出力を介してメインプログラムに渡されます（メインプログラムが受け取る文字列が正しくJSONオブジェクトとして解釈できる必要があります）。メインプログラムは字幕データを読み取り、解釈して処理した後、ウィンドウに表示します。

-## 字幕エンジンが必要な機能
+**字幕エンジンプロセスとElectronメインプロセス間の通信は、[caption engine api-doc](../api-docs/caption-engine.md)に準拠しています。**

-### 音声の取得
+## 実行フロー

-まず、あなたの字幕エンジンはシステムの音声入力（録音）または出力（音声再生）のストリーミングデータを取得する必要があります。Pythonを使用して開発する場合、PyAudioライブラリを使ってマイクからの音声入力データを取得できます（全プラットフォーム共通）。また、WindowsプラットフォームではPyAudioWPatchライブラリを使ってシステムの音声出力を取得することもできます。
+メインプロセスと字幕エンジンの通信フロー：

-一般的に取得される音声ストリームデータは、比較的短い時間間隔の音声ブロックで構成されています。モデルに合わせて音声ブロックのサイズを調整する必要があります。例えば、アリババクラウドのGummyモデルでは、0.05秒の音声ブロックを使用した認識結果の方が0.2秒の音声ブロックよりも優れています。
+### エンジンの起動

-### 音声の処理
+- メインプロセス：`child_process.spawn()`を使用して字幕エンジンプロセスを起動
+- 字幕エンジンプロセス：TCP Socketサーバースレッドを作成し、作成後に標準出力にJSONオブジェクトを文字列化して出力。このオブジェクトには`command`フィールドが含まれ、値は`connect`
+- メインプロセス：字幕エンジンプロセスの標準出力を監視し、標準出力を行ごとに分割してJSONオブジェクトとして解析し、オブジェクトの`command`フィールドの値が`connect`かどうかを判断。`connect`の場合はTCP Socketサーバーに接続

-取得した音声ストリームは、テキストに変換する前に前処理が必要な場合があります。例えば、アリババクラウドのGummyモデルは単一チャンネルの音声ストリームしか認識できませんが、収集された音声ストリームは通常二重チャンネルであるため、二重チャンネルの音声ストリームを単一チャンネルに変換する必要があります。チャンネル数の変換はNumPyライブラリのメソッドを使って行うことができます。
+### 字幕認識

-あなたは私によって開発された音声の取得（`caption-engine/sysaudio`）と音声の処理（`caption-engine/audioprcs`）モジュールを直接使用することができます。
+- 字幕エンジンプロセス：メインスレッドでシステムオーディオ出力を監視し、オーディオデータブロックを字幕エンジンに送信して解析。字幕エンジンはオーディオデータブロックを解析し、標準出力を介して解析された字幕データオブジェクト文字列を送信
+- メインプロセス：字幕エンジンの標準出力を引き続き監視し、解析されたオブジェクトの`command`フィールドに基づいて異なる操作を実行

-### 音声からテキストへの変換
+### エンジンの停止

-適切な音声ストリームを得た後、それをテキストに変換することができます。通常、様々なモデルを使って音声ストリームをテキストに変換します。必要に応じてモデルを選択することができます。
+- メインプロセス：ユーザーがフロントエンドで字幕エンジンを停止する操作を実行すると、メインプロセスはSocket通信を介して字幕エンジンプロセスに`command`フィールドが`stop`のオブジェクト文字列を送信
+- 字幕エンジンプロセス：メインエンジンプロセスから送信された字幕データオブジェクト文字列を受信し、文字列をオブジェクトとして解析。オブジェクトの`command`フィールドが`stop`の場合、グローバル変数`thread_data.status`の値を`stop`に設定
+- 字幕エンジンプロセス：メインスレッドでシステムオーディオ出力をループ監視し、`thread_data.status`の値が`running`でない場合、ループを終了し、リソースを解放して実行を終了
+- メインプロセス：字幕エンジンプロセスの終了を検出した場合、対応する処理を実行し、フロントエンドにフィードバック

-ほぼ完全な字幕エンジンの実装例：
+## プロジェクトで実装済みの機能
+
+以下の機能はすでに実装されており、直接再利用できます。
+
+### 標準出力
+
+通常情報、コマンド、エラー情報を出力できます。
+
+サンプル：

 ```python
-import sys
-import argparse
-
-# システム音声の取得に関する設定
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream
-elif sys.platform == 'darwin':
-    from sysaudio.darwin import AudioStream
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream
-else:
-    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
-
-# 音声処理関数のインポート
-from audioprcs import mergeChunkChannels
-# 音声からテキストへの変換モジュールのインポート
-from audio2text import InvalidParameter, GummyTranslator
-
-
-def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
-    # 標準出力をラインバッファリングに設定
-    sys.stdout.reconfigure(line_buffering=True) # type: ignore
-
-    # 音声の取得と音声からテキストへの変換のインスタンスを作成
-    stream = AudioStream(audio_type, chunk_rate)
-    if t_lang == 'none':
-        gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
-    else:
-        gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
-
-    # インスタンスを開始
-    stream.openStream()
-    gummy.start()
-
-    while True:
-        try:
-            # 音声ストリームデータを読み込む
-            chunk = stream.read_chunk()
-            chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
-            try:
-                # モデルを使って翻訳を行う
-                gummy.send_audio_frame(chunk_mono)
-            except InvalidParameter:
-                gummy.start()
-                gummy.send_audio_frame(chunk_mono)
-        except KeyboardInterrupt:
-            stream.closeStream()
-            gummy.stop()
-            break
+from utils import stdout, stdout_cmd, stdout_obj, stderr
+stdout("Hello") # {"command": "print", "content": "Hello"}\n
+stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
+stdout_obj({"command": "print", "content": "Hello"})
+stderr("Error Info")
 ```

+### Socketサービスの作成
+
+このSocketサービスは指定されたポートを監視し、Electronメインプログラムから送信された内容を解析し、`thread_data.status`の値を変更する可能性があります。
+
+サンプル：
+
+```python
+from utils import start_server
+from utils import thread_data
+port = 8080
+start_server(port)
+while thread_data == 'running':
+    # 何か処理
+    pass
+```
+
+### オーディオ取得
+
+`AudioStream`クラスはオーディオデータを取得するために使用され、Windows、Linux、macOSでクロスプラットフォームで実装されています。このクラスの初期化には2つのパラメータが含まれます：
+
+- `audio_type`：取得するオーディオのタイプ。0はシステム出力オーディオ（スピーカー）、1はシステム入力オーディオ（マイク）
+- `chunk_rate`：オーディオデータの取得頻度。1秒あたりに取得するオーディオブロックの数
+
+このクラスには3つのメソッドがあります：
+
+- `open_stream()`：オーディオ取得を開始
+- `read_chunk() -> bytes`：1つのオーディオブロックを読み取り
+- `close_stream()`：オーディオ取得を閉じる
+
+サンプル：
+
+```python
+from sysaudio import AudioStream
+audio_type = 0
+chunk_rate = 20
+stream =  AudioStream(audio_type, chunk_rate)
+stream.open_stream()
+while True:
+    data = stream.read_chunk()
+    # データで何か処理
+    pass
+stream.close_stream()
+```
+
+### オーディオ処理
+
+取得したオーディオストリームは、文字に変換する前に前処理が必要な場合があります。一般的に、マルチチャンネルオーディオをシングルチャンネルオーディオに変換し、リサンプリングが必要な場合もあります。このプロジェクトでは、3つのオーディオ処理関数を提供しています：
+
+- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`：マルチチャンネルオーディオブロックをシングルチャンネルオーディオブロックに変換
+- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`：現在のマルチチャンネルオーディオデータブロックをシングルチャンネルオーディオデータブロックに変換し、リサンプリングを実行
+- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`：現在のシングルチャンネルオーディオブロックをリサンプリング
+
+## 字幕エンジンで実装が必要な機能
+
+### オーディオから文字への変換
+
+適切なオーディオストリームを取得した後、オーディオストリームを文字に変換する必要があります。一般的に、さまざまなモデル（クラウドまたはローカル）を使用してオーディオストリームを文字に変換します。要件に応じて適切なモデルを選択する必要があります。
+
+この部分はクラスとしてカプセル化することをお勧めします。以下の3つのメソッドを実装する必要があります：
+
+- `start(self)`：モデルを起動
+- `send_audio_frame(self, data: bytes)`：現在のオーディオブロックデータを処理し、**生成された字幕データを標準出力を介してElectronメインプロセスに送信**
+- `stop(self)`：モデルを停止
+
+完全な字幕エンジンの実例：
+
+- [gummy.py](../../engine/audio2text/gummy.py)
+- [vosk.py](../../engine/audio2text/vosk.py)
+
 ### 字幕翻訳

-音声認識モデルによっては翻訳機能を提供していないため、別途翻訳モジュールを追加する必要があります。この部分にはクラウドベースの翻訳APIを使用することも、ローカルの翻訳モデルを使用することも可能です。
+一部の音声文字変換モデルは翻訳を提供していません。必要がある場合、翻訳モジュールを追加する必要があります。

-### データの伝送
+### 字幕データの送信

-現在の音声ストリームのテキストを得たら、それをメインプログラムに渡す必要があります。字幕エンジンプロセスは標準出力を通じて電子メール主プロセスに字幕データを渡します。
+現在のオーディオストリームのテキストを取得した後、そのテキストをメインプログラムに送信する必要があります。字幕エンジンプロセスは標準出力を介して字幕データをElectronメインプロセスに渡します。

-渡す内容はJSON文字列でなければなりません。JSONオブジェクトには以下のパラメータを含める必要があります：
+送信する内容はJSON文字列でなければなりません。JSONオブジェクトには以下のパラメータを含める必要があります：

 ```typescript
 export interface CaptionItem {
-  index: number, // 字幕番号
-  time_s: string, // 現在の字幕開始時間
-  time_t: string, // 現在の字幕終了時間
-  text: string, // 字幕内容
-  translation: string // 字幕翻訳
+  command: "caption",
+  index: number, // 字幕のシーケンス番号
+  time_s: string, // 現在の字幕の開始時間
+  time_t: string, // 現在の字幕の終了時間
+  text: string, // 字幕の内容
+  translation: string // 字幕の翻訳
 }
 ```

-**必ず、字幕JSONデータを出力するたびにバッファをフラッシュし、electron主プロセスが受け取る文字列が常にJSONオブジェクトとして解釈できるようにする必要があります。**
+**JSONデータを出力するたびにバッファをフラッシュし、electronメインプロセスが受信する文字列が常にJSONオブジェクトとして解釈できるようにする必要があります。**

-Python言語を使用する場合、以下の方法でデータをメインプログラムに渡すことができます：
+プロジェクトで既に実装されている`stdout_obj`関数を使用して送信することをお勧めします。

-```python
-# caption-engine\main-gummy.py
-sys.stdout.reconfigure(line_buffering=True)
+### コマンドラインパラメータの指定

-# caption-engine\audio2text\gummy.py
-...
-    def send_to_node(self, data):
-        """
-        Node.jsプロセスにデータを送信する
-        """
-        try:
-            json_data = json.dumps(data) + '\n'
-            sys.stdout.write(json_data)
-            sys.stdout.flush()
-        except Exception as e:
-            print(f"Error sending data to Node.js: {e}", file=sys.stderr)
-...
-```
-
-データ受信側のコード
-
-```typescript
-// src\main\utils\engine.ts
-...
-    this.process.stdout.on('data', (data) => {
-      const lines = data.toString().split('\n');
-      lines.forEach((line: string) => {
-        if (line.trim()) {
-          try {
-            const caption = JSON.parse(line);
-            addCaptionLog(caption);
-          } catch (e) {
-            controlWindow.sendErrorMessage('字幕エンジンの出力をJSONオブジェクトとして解析できません:' + e)
-            console.error('[ERROR] JSON解析エラー:', e);
-          }
-        }
-      });
-    });
-
-    this.process.stderr.on('data', (data) => {
-      controlWindow.sendErrorMessage('字幕エンジンエラー:' + data)
-      console.error(`[ERROR] サブプロセスエラー: ${data}`);
-    });
-...
-```
-
-## 字幕エンジンの使用方法
-
-### コマンドライン引数の指定
-
-カスタム字幕エンジンの設定はコマンドライン引数で指定します。主な必要なパラメータは以下の通りです：
+カスタム字幕エンジンの設定はコマンドラインパラメータで指定するため、字幕エンジンのパラメータを設定する必要があります。このプロジェクトで現在使用されているパラメータは以下のとおりです：

 ```python
 import argparse
-
-...
-
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description='システムのオーディオストリームをテキストに変換')
+    parser = argparse.ArgumentParser(description='システムオーディオストリームをテキストに変換')
+    # 共通
+    parser.add_argument('-e', '--caption_engine', default='gummy', help='字幕エンジン: gummyまたはvosk')
+    parser.add_argument('-a', '--audio_type', default=0, help='オーディオストリームソース: 0は出力、1は入力')
+    parser.add_argument('-c', '--chunk_rate', default=20, help='1秒あたりに収集するオーディオストリームブロックの数')
+    parser.add_argument('-p', '--port', default=8080, help='サーバーを実行するポート、0はサーバーなし')
+    # gummy専用
    parser.add_argument('-s', '--source_language', default='en', help='ソース言語コード')
    parser.add_argument('-t', '--target_language', default='zh', help='ターゲット言語コード')
-    parser.add_argument('-a', '--audio_type', default=0, help='オーディオストリームソース: 0は出力音声、1は入力音声')
-    parser.add_argument('-c', '--chunk_rate', default=20, help='1秒間に収集するオーディオチャンク数')
-    parser.add_argument('-k', '--api_key', default='', help='GummyモデルのAPIキー')
-    args = parser.parse_args()
-    convert_audio_to_text(
-        args.source_language,
-        args.target_language,
-        int(args.audio_type),
-        int(args.chunk_rate),
-        args.api_key
-    )
+    parser.add_argument('-k', '--api_key', default='', help='GummyモデルのAPI KEY')
+    # vosk専用
+    parser.add_argument('-m', '--model_path', default='', help='voskモデルのパス')
 ```

-例：原文を日本語、翻訳を中国語に指定し、システム音声出力を取得、0.1秒のオーディオデータを収集する場合：
+たとえば、このプロジェクトの字幕エンジンでGummyモデルを使用し、原文を日本語、翻訳を中国語に指定し、システムオーディオ出力の字幕を取得し、毎回0.1秒のオーディオデータをキャプチャする場合、コマンドラインパラメータは以下のようになります：

 ```bash
-python main-gummy.py -s ja -t zh -a 0 -c 10 -k <your-api-key>
+python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
 ```

+## その他
+
+### 通信規格
+
+[caption engine api-doc](../api-docs/caption-engine.md)
+
+### プログラムエントリ
+
+[main.py](../../engine/main.py)
+
+### 開発の推奨事項
+
+オーディオから文字への変換以外は、このプロジェクトのコードを直接再利用することをお勧めします。その場合、追加する必要がある内容は：
+
+- `engine/audio2text/`：新しいオーディオから文字への変換クラスを追加（ファイルレベル）
+- `engine/main.py`：新しいパラメータ設定とプロセス関数を追加（`main_gummy`関数と`main_vosk`関数を参照）
+
 ### パッケージ化

-開発とテスト完了後、`pyinstaller`を使用して実行可能ファイルにパッケージ化します。エラーが発生した場合、依存ライブラリの不足を確認してください。
+字幕エンジンの開発とテストが完了した後、字幕エンジンを実行可能ファイルにパッケージ化する必要があります。一般的に`pyinstaller`を使用してパッケージ化します。パッケージ化された字幕エンジンファイルの実行でエラーが発生した場合、依存ライブラリが不足している可能性があります。不足している依存ライブラリを確認してください。

 ### 実行

-利用可能な字幕エンジンが準備できたら、字幕ソフトウェアのウィンドウでエンジンのパスと実行パラメータを指定して起動します。
+使用可能な字幕エンジンを取得したら、字幕ソフトウェアウィンドウで字幕エンジンのパスと字幕エンジンの実行コマンド（パラメータ）を指定して字幕エンジンを起動できます。

-![](../img/02_ja.png)
-
-## 参考コード
-
-本プロジェクトの`caption-engine`フォルダにある`main-gummy.py`ファイルはデフォルトの字幕エンジンのエントリーコードです。`src\main\utils\engine.ts`はサーバー側で字幕エンジンのデータを取得・処理するコードです。必要に応じて字幕エンジンの実装詳細と完全な実行プロセスを理解するために参照してください。
+![](../img/02_ja.png)
--- a/docs/engine-manual/zh.md
+++ b/docs/engine-manual/zh.md
@@ -1,97 +1,138 @@
 # 字幕引擎说明文档

-对应版本：v0.5.0
+对应版本：v0.6.0

 ![](../../assets/media/structure_zh.png)

 ## 字幕引擎介绍

-所谓的字幕引擎实际上是一个子程序，它会实时获取系统音频输入（录音）或输出（播放声音）的流式数据，并调用音频转文字的模型生成对应音频的字幕。生成的字幕转换为 JSON 格式的字符串数据，并通过标准输出传递给主程序（需要保证主程序读取到的字符串可以被正确解释为 JSON 对象）。主程序读取并解释字幕数据，处理后显示在窗口上。
+所谓的字幕引擎实际上是一个子程序，它会实时获取系统音频输入（麦克风）或输出（扬声器）的流式数据，并调用音频转文字的模型生成对应音频的字幕。生成的字幕转换为 JSON 格式的字符串数据，并通过标准输出传递给主程序（需要保证主程序读取到的字符串可以被正确解释为 JSON 对象）。主程序读取并解释字幕数据，处理后显示在窗口上。

-## 字幕引擎需要实现的功能
+**字幕引擎进程和 Electron 主进程之间的通信遵循的标准为：[caption engine api-doc](../api-docs/caption-engine.md)。**
+
+## 运行流程
+
+主进程和字幕引擎通信的流程：
+
+### 启动引擎
+
+- 主进程：使用 `child_process.spawn()` 启动字幕引擎进程
+- 字幕引擎进程：创建 TCP Socket 服务器线程，创建后在标准输出中输出转化为字符串的 JSON 对象，该对象中包含 `command` 字段，值为 `connect`
+- 主进程：监听字幕引擎进程的标准输出，尝试将标准输出按行分割，解析为 JSON 对象，并判断对象的 `command` 字段值是否为 `connect`，如果是则连接 TCP Socket 服务器
+
+### 字幕识别
+
+- 字幕引擎进程：在主线程监听系统音频输出，并将音频数据块发送给字幕引擎解析，字幕引擎解析音频数据块，通过标准输出发送解析的字幕数据对象字符串
+- 主进程：继续监听字幕引擎的标准输出，并根据解析的对象的 `command` 字段采取不同的操作
+
+### 关闭引擎
+
+- 主进程：当用户在前端操作关闭字幕引擎时，主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
+- 字幕引擎进程：接收主引擎进程发送的字幕数据对象字符串，将字符串解析为对象，如果对象的 `command` 字段为 `stop`，则将全局变量 `thread_data.status` 的值设置为 `stop`
+- 字幕引擎进程：主线程循环监听系统音频输出，当 `thread_data.status` 的值不为 `running` 时，则结束循环，释放资源，结束运行
+- 主进程：如果检测到字幕引擎进程结束，进行相应处理，并向前端反馈
+
+
+## 项目已经实现的功能
+
+以下功能已经实现，可以直接复用。
+
+### 标准输出
+
+可以输出普通信息、命令和错误信息。
+
+样例：
+
+```python
+from utils import stdout, stdout_cmd, stdout_obj, stderr
+stdout("Hello") # {"command": "print", "content": "Hello"}\n
+stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
+stdout_obj({"command": "print", "content": "Hello"})
+stderr("Error Info")
+```
+
+### 创建 Socket 服务
+
+该 Socket 服务会监听指定端口，会解析 Electron 主程序发送的内容，并可能改变 `thread_data.status` 的值。
+
+样例：
+
+```python
+from utils import start_server
+from utils import thread_data
+port = 8080
+start_server(port)
+while thread_data == 'running':
+    # do something
+    pass
+```

 ### 音频获取

-首先，你的字幕引擎需要获取系统音频输入（录音）或输出（播放声音）的流式数据。如果使用 Python 开发，可以使用 PyAudio 库获取麦克风音频输入数据（全平台通用）。使用 PyAudioWPatch 库获取系统音频输出（仅适用于 Windows 平台）。
+`AudioStream` 类用于获取音频数据，实现是跨平台的，支持 Windows、Linux 和 macOS。该类初始化包含两个参数：

-一般获取的音频流数据实际上是一个一个的时间比较短的音频块，需要根据模型调整音频块的大小。比如阿里云的 Gummy 模型使用 0.05 秒大小的音频块识别效果优于使用 0.2 秒大小的音频块。
+- `audio_type`: 获取音频类型，0 表示系统输出音频（扬声器），1 表示系统输入音频（麦克风）
+- `chunk_rate`: 音频数据获取频率，每秒音频获取的音频块的数量
+
+该类包含三个方法：
+
+- `open_stream()`: 开启音频获取
+- `read_chunk() -> bytes`: 读取一个音频块
+- `close_stream()`: 关闭音频获取
+
+样例：
+
+```python
+from sysaudio import AudioStream
+audio_type = 0
+chunk_rate = 20
+stream =  AudioStream(audio_type, chunk_rate)
+stream.open_stream()
+while True:
+    data = stream.read_chunk()
+    # do something with data
+    pass
+stream.close_stream()
+```

 ### 音频处理

-获取到的音频流在转文字之前可能需要进行预处理。比如阿里云的 Gummy 模型只能识别单通道的音频流，而收集的音频流一般是双通道的，因此要将双通道音频流转换为单通道。通道数的转换可以使用 NumPy 库中的方法实现。
+获取到的音频流在转文字之前可能需要进行预处理。一般需要将多通道音频转换为单通道音频，还可能需要进行重采样。本项目提供了三个音频处理函数：

-你可以直接使用我开发好的音频获取（`caption-engine/sysaudio`）和音频处理（`caption-engine/audioprcs`）模块。
+- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`： 将多通道音频块转换为单通道音频块
+- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`：将当前多通道音频数据块转换成单通道音频数据块，然后进行重采样
+- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`：将当前单通道音频块进行重采样
+
+## 字幕引擎需要实现的功能

 ### 音频转文字

-在得到了合适的音频流后，就可以将音频流转换为文字了。一般使用各种模型来实现音频流转文字。可根据需求自行选择模型。
+在得到了合适的音频流后，需要将音频流转换为文字了。一般使用各种模型（云端或本地）来实现音频流转文字。需要根据需求选择合适的模型。

-一个接近完整的字幕引擎实例如下：
+这部分建议封装为一个类，需要实现三个方法：

-```python
-import sys
-import argparse
+- `start(self)`：启动模型
+- `send_audio_frame(self, data: bytes)`：处理当前音频块数据，**生成的字幕数据通过标准输出发送给 Electron 主进程**
+- `stop(self)`：停止模型

-# 引入系统音频获取类
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream
-elif sys.platform == 'darwin':
-    from sysaudio.darwin import AudioStream
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream
-else:
-    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
+完整的字幕引擎实例如下：

-# 引入音频处理函数
-from audioprcs import mergeChunkChannels
-# 引入音频转文本模块
-from audio2text import InvalidParameter, GummyTranslator
-
-
-def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
-    # 设置标准输出为行缓冲
-    sys.stdout.reconfigure(line_buffering=True) # type: ignore
-
-    # 创建音频获取和语音转文字实例
-    stream = AudioStream(audio_type, chunk_rate)
-    if t_lang == 'none':
-        gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
-    else:
-        gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
-
-    # 启动实例
-    stream.openStream()
-    gummy.start()
-
-    while True:
-        try:
-            # 读取音频流数据
-            chunk = stream.read_chunk()
-            chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
-            try:
-                # 调用模型进行翻译
-                gummy.send_audio_frame(chunk_mono)
-            except InvalidParameter:
-                gummy.start()
-                gummy.send_audio_frame(chunk_mono)
-        except KeyboardInterrupt:
-            stream.closeStream()
-            gummy.stop()
-            break
-```
+- [gummy.py](../../engine/audio2text/gummy.py)
+- [vosk.py](../../engine/audio2text/vosk.py)

 ### 字幕翻译

-有的语音转文字模型并不提供翻译，需要再添加一个翻译模块。这部分可以使用云端翻译 API 也可以使用本地翻译模型。
+有的语音转文字模型并不提供翻译，如果有需求，需要再添加一个翻译模块。

-### 数据传递
+### 字幕数据发送

-在获取到当前音频流的文字后，需要将文字传递给主程序。字幕引擎进程通过标准输出将字幕数据传递给 electron 主进程。
+在获取到当前音频流的文字后，需要将文字发送给主程序。字幕引擎进程通过标准输出将字幕数据传递给 Electron 主进程。

 传递的内容必须是 JSON 字符串，其中 JSON 对象需要包含的参数如下：

 ```typescript
 export interface CaptionItem {
+  command: "caption",
  index: number, // 字幕序号
  time_s: string, // 当前字幕开始时间
  time_t: string, // 当前字幕结束时间
@@ -102,89 +143,52 @@ export interface CaptionItem {

 **注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区，确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。**

-如果使用 python 语言，可以参考以下方式将数据传递给主程序：
-
-```python
-# caption-engine\main-gummy.py
-sys.stdout.reconfigure(line_buffering=True)
-
-# caption-engine\audio2text\gummy.py
-...
-    def send_to_node(self, data):
-        """
-        将数据发送到 Node.js 进程
-        """
-        try:
-            json_data = json.dumps(data) + '\n'
-            sys.stdout.write(json_data)
-            sys.stdout.flush()
-        except Exception as e:
-            print(f"Error sending data to Node.js: {e}", file=sys.stderr)
-...
-```
-
-数据接收端代码如下：
-
-
-```typescript
-// src\main\utils\engine.ts
-...
-    this.process.stdout.on('data', (data) => {
-      const lines = data.toString().split('\n');
-      lines.forEach((line: string) => {
-        if (line.trim()) {
-          try {
-            const caption = JSON.parse(line);
-            addCaptionLog(caption);
-          } catch (e) {
-            controlWindow.sendErrorMessage('字幕引擎输出内容无法解析为 JSON 对象：' + e)
-            console.error('[ERROR] Error parsing JSON:', e);
-          }
-        }
-      });
-    });
-
-    this.process.stderr.on('data', (data) => {
-      controlWindow.sendErrorMessage('字幕引擎错误：' + data)
-      console.error(`[ERROR] Subprocess Error: ${data}`);
-    });
-...
-```
-
-## 字幕引擎的使用
+建议使用项目已经实现的 `stdout_obj` 函数来发送。

 ### 命令行参数的指定

-自定义字幕引擎的设置提供命令行参数指定，因此需要设置好字幕引擎的参数，常见的需要的参数如下：
+自定义字幕引擎的设置提供命令行参数指定，因此需要设置好字幕引擎的参数，本项目目前用到的参数如下：

 ```python
 import argparse
-
-...
-
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Convert system audio stream to text')
+    # both
+    parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
+    parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
+    parser.add_argument('-c', '--chunk_rate', default=20, help='Number of audio stream chunks collected per second')
+    parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
+    # gummy only
    parser.add_argument('-s', '--source_language', default='en', help='Source language code')
    parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
-    parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output audio stream, 1 for input audio stream')
-    parser.add_argument('-c', '--chunk_rate', default=20, help='The number of audio stream chunks collected per second.')
    parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
-    args = parser.parse_args()
-    convert_audio_to_text(
-        args.source_language,
-        args.target_language,
-        int(args.audio_type),
-        int(args.chunk_rate),
-        args.api_key
-    )
+    # vosk only
+    parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
 ```

-比如对应上面的字幕引擎，我想指定原文为日语，翻译为中文，获取系统音频输出的字幕，每次截取 0.1s 的音频数据，那么命令行参数如下：
+比如对于本项目的字幕引擎，我想使用 Gummy 模型，指定原文为日语，翻译为中文，获取系统音频输出的字幕，每次截取 0.1s 的音频数据，那么命令行参数如下：

 ```bash
-python main-gummy.py -s ja -t zh -a 0 -c 10 -k <your-api-key>
+python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
 ```

+## 其他
+
+### 通信规范
+
+[caption engine api-doc](../api-docs/caption-engine.md)
+
+### 程序入口
+
+[main.py](../../engine/main.py)
+
+### 开发建议
+
+除音频转文字外，其他建议直接复用本项目代码。如果这样，那么需要添加的内容为：
+
+- `engine/audio2text/`：添加新的音频转文字类（文件级别）
+- `engine/main.py`：添加新参数设置、流程函数（参考 `main_gummy` 函数和 `main_vosk` 函数）
+
 ### 打包

 在完成字幕引擎的开发和测试后，需要将字幕引擎打包成可执行文件。一般使用 `pyinstaller` 进行打包。如果打包好的字幕引擎文件执行报错，可能是打包漏掉了某些依赖库，请检查是否缺少了依赖库。
@@ -194,8 +198,3 @@ python main-gummy.py -s ja -t zh -a 0 -c 10 -k <your-api-key>
 有了可以使用的字幕引擎，就可以在字幕软件窗口中通过指定字幕引擎的路径和字幕引擎的运行指令（参数）来启动字幕引擎了。

 ![](../img/02_zh.png)
-
-
-## 参考代码
-
-本项目 `caption-engine` 文件夹下的 `main-gummy.py` 文件为默认字幕引擎的入口代码。`src\main\utils\engine.ts` 为服务端获取字幕引擎数据和进行处理的代码。可以根据需要阅读了解字幕引擎的实现细节和完整运行过程。
--- a/docs/img/01.png
+++ b/docs/img/01.png
--- a/docs/user-manual/en.md
+++ b/docs/user-manual/en.md
@@ -1,6 +1,8 @@
 # Auto Caption User Manual

-Corresponding Version: v0.5.0
+Corresponding Version: v0.6.0
+
+**Note: Due to limited personal resources, the English and Japanese documentation files for this project (except for the README document) will no longer be maintained. The content of this document may not be consistent with the latest version of the project. If you are willing to help with translation, please submit relevant Pull Requests.**

 ## Software Introduction

@@ -16,6 +18,7 @@ The following operating system versions have been tested and confirmed to work p
 | macOS Sequoia 15.5 | arm64        | ✅ Additional config required  | ✅          |
 | Ubuntu 24.04.2     | x64          | ✅                   | ✅                    |
 | Kali Linux 2022.3  | x64          | ✅                   | ✅                    |
+| Kylin Server V10 SP3 | x64 | ✅ | ✅ |

 ![](../../assets/media/main_en.png)

--- a/docs/user-manual/ja.md
+++ b/docs/user-manual/ja.md
@@ -1,9 +1,11 @@
 # Auto Caption ユーザーマニュアル

-対応バージョン：v0.5.0
+対応バージョン：v0.6.0

 この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。

+**注意：個人のリソースが限られているため、このプロジェクトの英語および日本語のドキュメント（README ドキュメントを除く）のメンテナンスは行われません。このドキュメントの内容は最新版のプロジェクトと一致しない場合があります。翻訳のお手伝いをしていただける場合は、関連するプルリクエストを提出してください。**
+
 ## ソフトウェアの概要

 Auto Caption は、クロスプラットフォームの字幕表示ソフトウェアで、システムの音声入力（録音）または出力（音声再生）のストリーミングデータをリアルタイムで取得し、音声からテキストに変換するモデルを利用して対応する音声の字幕を生成します。このソフトウェアが提供するデフォルトの字幕エンジン（アリババクラウド Gummy モデルを使用）は、9つの言語（中国語、英語、日本語、韓国語、ドイツ語、フランス語、ロシア語、スペイン語、イタリア語）の認識と翻訳をサポートしています。
@@ -18,6 +20,7 @@ Auto Caption は、クロスプラットフォームの字幕表示ソフトウ
 | macOS Sequoia 15.5  | arm64         | ✅ 追加設定が必要      | ✅                  |
 | Ubuntu 24.04.2      | x64           | ✅                  | ✅                  |
 | Kali Linux 2022.3   | x64           | ✅                  | ✅                  |
+| Kylin Server V10 SP3 | x64 | ✅ | ✅ |

 ![](../../assets/media/main_ja.png)

--- a/docs/user-manual/zh.md
+++ b/docs/user-manual/zh.md
@@ -1,6 +1,6 @@
 # Auto Caption 用户手册

-对应版本：v0.5.0
+对应版本：v0.6.0

 ## 软件简介

@@ -16,6 +16,7 @@ Auto Caption 是一个跨平台的字幕显示软件，能够实时获取系统
 | macOS Sequoia 15.5 | arm64      | ✅需要额外配置    | ✅                |
 | Ubuntu 24.04.2     | x64        | ✅    | ✅                |
 | Kali Linux 2022.3     | x64        | ✅    | ✅                |
+| Kylin Server V10 SP3 | x64 | ✅ | ✅ |

 ![](../../assets/media/main_zh.png)

--- a/electron-builder.yml
+++ b/electron-builder.yml
@@ -10,21 +10,16 @@ files:
  - '!{LICENSE,README.md,README_en.md,README_ja.md}'
  - '!{.env,.env.*,.npmrc,pnpm-lock.yaml}'
  - '!{tsconfig.json,tsconfig.node.json,tsconfig.web.json}'
-  - '!caption-engine/*'
-  - '!engine-test/*'
+  - '!engine/*'
  - '!docs/*'
  - '!assets/*'
 extraResources:
  # For Windows
-  - from: ./caption-engine/dist/main-gummy.exe
-    to: ./caption-engine/main-gummy.exe
-  - from: ./caption-engine/dist/main-vosk.exe
-    to: ./caption-engine/main-vosk.exe
+  - from: ./engine/dist/main.exe
+    to: ./engine/main.exe
  # For macOS and Linux
-  # - from: ./caption-engine/dist/main-gummy
-  #   to: ./caption-engine/main-gummy
-  # - from: ./caption-engine/dist/main-vosk
-  #   to: ./caption-engine/main-vosk
+  # - from: ./engine/dist/main
+  #   to: ./engine/main
 win:
  executableName: auto-caption
  icon: build/icon.png
--- a/engine-test/gummy.ipynb
+++ b/engine-test/gummy.ipynb
@@ -1,221 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from dashscope.audio.asr import * # type: ignore\n",
-    "import pyaudiowpatch as pyaudio\n",
-    "import numpy as np\n",
-    "\n",
-    "\n",
-    "def getDefaultSpeakers(mic: pyaudio.PyAudio, info = True):\n",
-    "    \"\"\"\n",
-    "    获取默认的系统音频输出的回环设备\n",
-    "    Args:\n",
-    "        mic (pyaudio.PyAudio): pyaudio对象\n",
-    "        info (bool, optional): 是否打印设备信息. Defaults to True.\n",
-    "\n",
-    "    Returns:\n",
-    "        dict: 统音频输出的回环设备\n",
-    "    \"\"\"\n",
-    "    try:\n",
-    "        WASAPI_info = mic.get_host_api_info_by_type(pyaudio.paWASAPI)\n",
-    "    except OSError:\n",
-    "        print(\"Looks like WASAPI is not available on the system. Exiting...\")\n",
-    "        exit()\n",
-    "\n",
-    "    default_speaker = mic.get_device_info_by_index(WASAPI_info[\"defaultOutputDevice\"])\n",
-    "    if(info): print(\"wasapi_info:\\n\", WASAPI_info, \"\\n\")\n",
-    "    if(info): print(\"default_speaker:\\n\", default_speaker, \"\\n\")\n",
-    "\n",
-    "    if not default_speaker[\"isLoopbackDevice\"]:\n",
-    "        for loopback in mic.get_loopback_device_info_generator():\n",
-    "            if default_speaker[\"name\"] in loopback[\"name\"]:\n",
-    "                default_speaker = loopback\n",
-    "                if(info): print(\"Using loopback device:\\n\", default_speaker, \"\\n\")\n",
-    "                break\n",
-    "        else:\n",
-    "            print(\"Default loopback output device not found.\")\n",
-    "            print(\"Run `python -m pyaudiowpatch` to check available devices.\")\n",
-    "            print(\"Exiting...\")\n",
-    "            exit()\n",
-    "            \n",
-    "    if(info): print(f\"Recording Device: #{default_speaker['index']} {default_speaker['name']}\")\n",
-    "    return default_speaker\n",
-    "\n",
-    "\n",
-    "class Callback(TranslationRecognizerCallback):\n",
-    "    \"\"\"\n",
-    "    语音大模型流式传输回调对象\n",
-    "    \"\"\"\n",
-    "    def __init__(self):\n",
-    "        super().__init__()\n",
-    "        self.usage = 0\n",
-    "        self.sentences = []\n",
-    "        self.translations = []\n",
-    "    \n",
-    "    def on_open(self) -> None:\n",
-    "        print(\"\\n流式翻译开始...\\n\")\n",
-    "\n",
-    "    def on_close(self) -> None:\n",
-    "        print(f\"\\nTokens消耗：{self.usage}\")\n",
-    "        print(f\"流式翻译结束...\\n\")\n",
-    "        for i in range(len(self.sentences)):\n",
-    "            print(f\"\\n{self.sentences[i]}\\n{self.translations[i]}\\n\")\n",
-    "\n",
-    "    def on_event(\n",
-    "        self,\n",
-    "        request_id,\n",
-    "        transcription_result: TranscriptionResult,\n",
-    "        translation_result: TranslationResult,\n",
-    "        usage\n",
-    "    ) -> None:\n",
-    "        if transcription_result is not None:\n",
-    "            id = transcription_result.sentence_id\n",
-    "            text = transcription_result.text\n",
-    "            if transcription_result.stash is not None:\n",
-    "                stash = transcription_result.stash.text\n",
-    "            else:\n",
-    "                stash = \"\"\n",
-    "            print(f\"#{id}: {text}{stash}\")\n",
-    "            if usage: self.sentences.append(text)\n",
-    "        \n",
-    "        if translation_result is not None:\n",
-    "            lang = translation_result.get_language_list()[0]\n",
-    "            text = translation_result.get_translation(lang).text\n",
-    "            if translation_result.get_translation(lang).stash is not None:\n",
-    "                stash = translation_result.get_translation(lang).stash.text\n",
-    "            else:\n",
-    "                stash = \"\"\n",
-    "            print(f\"#{lang}: {text}{stash}\")\n",
-    "            if usage: self.translations.append(text)\n",
-    "        \n",
-    "        if usage: self.usage += usage['duration']"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "采样输入设备：\n",
-      "    - 序号：26\n",
-      "    - 名称：耳机 (HUAWEI FreeLace 活力版) [Loopback]\n",
-      "    - 最大输入通道数：2\n",
-      "    - 默认低输入延迟：0.003s\n",
-      "    - 默认高输入延迟：0.01s\n",
-      "    - 默认采样率：48000.0Hz\n",
-      "    - 是否回环设备：True\n",
-      "\n",
-      "音频样本块大小：4800\n",
-      "样本位宽：2\n",
-      "音频数据格式：8\n",
-      "音频通道数：2\n",
-      "音频采样率：48000\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "mic = pyaudio.PyAudio()\n",
-    "default_speaker = getDefaultSpeakers(mic, False)\n",
-    "\n",
-    "SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)\n",
-    "FORMAT = pyaudio.paInt16\n",
-    "CHANNELS = default_speaker[\"maxInputChannels\"]\n",
-    "RATE = int(default_speaker[\"defaultSampleRate\"])\n",
-    "CHUNK = RATE // 10\n",
-    "INDEX = default_speaker[\"index\"]\n",
-    "\n",
-    "dev_info = f\"\"\"\n",
-    "采样输入设备：\n",
-    "    - 序号：{default_speaker['index']}\n",
-    "    - 名称：{default_speaker['name']}\n",
-    "    - 最大输入通道数：{default_speaker['maxInputChannels']}\n",
-    "    - 默认低输入延迟：{default_speaker['defaultLowInputLatency']}s\n",
-    "    - 默认高输入延迟：{default_speaker['defaultHighInputLatency']}s\n",
-    "    - 默认采样率：{default_speaker['defaultSampleRate']}Hz\n",
-    "    - 是否回环设备：{default_speaker['isLoopbackDevice']}\n",
-    "\n",
-    "音频样本块大小：{CHUNK}\n",
-    "样本位宽：{SAMP_WIDTH}\n",
-    "音频数据格式：{FORMAT}\n",
-    "音频通道数：{CHANNELS}\n",
-    "音频采样率：{RATE}\n",
-    "\"\"\"\n",
-    "print(dev_info)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "RECORD_SECONDS = 20 # 监听时长(s)\n",
-    "\n",
-    "stream = mic.open(\n",
-    "    format = FORMAT,\n",
-    "    channels = CHANNELS,\n",
-    "    rate = RATE,\n",
-    "    input = True,\n",
-    "    input_device_index = INDEX\n",
-    ")\n",
-    "translator = TranslationRecognizerRealtime(\n",
-    "    model = \"gummy-realtime-v1\",\n",
-    "    format = \"pcm\",\n",
-    "    sample_rate = RATE,\n",
-    "    transcription_enabled = True,\n",
-    "    translation_enabled = True,\n",
-    "    source_language = \"ja\",\n",
-    "    translation_target_languages = [\"zh\"],\n",
-    "    callback = Callback()\n",
-    ")\n",
-    "translator.start()\n",
-    "\n",
-    "for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):\n",
-    "    data = stream.read(CHUNK)\n",
-    "    data_np = np.frombuffer(data, dtype=np.int16)\n",
-    "    data_np_r = data_np.reshape(-1, CHANNELS)\n",
-    "    print(data_np_r.shape)\n",
-    "    mono_data = np.mean(data_np_r.astype(np.float32), axis=1)\n",
-    "    mono_data = mono_data.astype(np.int16)\n",
-    "    mono_data_bytes = mono_data.tobytes()\n",
-    "    translator.send_audio_frame(mono_data_bytes)\n",
-    "\n",
-    "translator.stop()\n",
-    "stream.stop_stream()\n",
-    "stream.close()"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "mystd",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.12"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
--- a/engine-test/resample.ipynb
+++ b/engine-test/resample.ipynb
@@ -1,189 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "id": "1e12f3ef",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "        采样输入设备：\n",
-      "            - 设备类型：音频输出\n",
-      "            - 序号：0\n",
-      "            - 名称：BlackHole 2ch\n",
-      "            - 最大输入通道数：2\n",
-      "            - 默认低输入延迟：0.01s\n",
-      "            - 默认高输入延迟：0.1s\n",
-      "            - 默认采样率：48000.0Hz\n",
-      "\n",
-      "        音频样本块大小：2400\n",
-      "        样本位宽：2\n",
-      "        采样格式：8\n",
-      "        音频通道数：2\n",
-      "        音频采样率：48000\n",
-      "        \n"
-     ]
-    }
-   ],
-   "source": [
-    "import sys\n",
-    "import os\n",
-    "import wave\n",
-    "\n",
-    "current_dir = os.getcwd() \n",
-    "sys.path.append(os.path.join(current_dir, '../caption-engine'))\n",
-    "\n",
-    "from sysaudio.darwin import AudioStream\n",
-    "from audioprcs import resampleRawChunk, mergeChunkChannels\n",
-    "\n",
-    "stream = AudioStream(0)\n",
-    "stream.printInfo()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "a72914f4",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Recording...\n",
-      "Done\n"
-     ]
-    }
-   ],
-   "source": [
-    "\"\"\"获取系统音频输出5秒，然后保存为wav文件\"\"\"\n",
-    "\n",
-    "with wave.open('output.wav', 'wb') as wf:\n",
-    "    wf.setnchannels(stream.CHANNELS)\n",
-    "    wf.setsampwidth(stream.SAMP_WIDTH)\n",
-    "    wf.setframerate(stream.RATE)\n",
-    "    stream.openStream()\n",
-    "\n",
-    "    print('Recording...')\n",
-    "\n",
-    "    for _ in range(0, 100):\n",
-    "        chunk = stream.read_chunk()\n",
-    "        if isinstance(chunk, bytes):\n",
-    "            wf.writeframes(chunk)\n",
-    "        else:\n",
-    "            raise Exception('Error: chunk is not bytes')\n",
-    "        \n",
-    "    stream.closeStream()    \n",
-    "    print('Done')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "id": "a6e8a098",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Recording...\n",
-      "Done\n"
-     ]
-    }
-   ],
-   "source": [
-    "\"\"\"获取系统音频输入，转换为单通道音频，持续5秒，然后保存为wav文件\"\"\"\n",
-    "\n",
-    "with wave.open('output.wav', 'wb') as wf:\n",
-    "    wf.setnchannels(1)\n",
-    "    wf.setsampwidth(stream.SAMP_WIDTH)\n",
-    "    wf.setframerate(stream.RATE)\n",
-    "    stream.openStream()\n",
-    "\n",
-    "    print('Recording...')\n",
-    "\n",
-    "    for _ in range(0, 100):\n",
-    "        chunk = mergeChunkChannels(\n",
-    "            stream.read_chunk(),\n",
-    "            stream.CHANNELS\n",
-    "        )\n",
-    "        if isinstance(chunk, bytes):\n",
-    "            wf.writeframes(chunk)\n",
-    "        else:\n",
-    "            raise Exception('Error: chunk is not bytes')\n",
-    "        \n",
-    "    stream.closeStream()    \n",
-    "    print('Done')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "aaca1465",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Recording...\n",
-      "Done\n"
-     ]
-    }
-   ],
-   "source": [
-    "\"\"\"获取系统音频输入，转换为单通道音频并重采样到16000Hz，持续5秒，然后保存为wav文件\"\"\"\n",
-    "\n",
-    "with wave.open('output.wav', 'wb') as wf:\n",
-    "    wf.setnchannels(1)\n",
-    "    wf.setsampwidth(stream.SAMP_WIDTH)\n",
-    "    wf.setframerate(16000)\n",
-    "    stream.openStream()\n",
-    "\n",
-    "    print('Recording...')\n",
-    "\n",
-    "    for _ in range(0, 100):\n",
-    "        chunk = resampleRawChunk(\n",
-    "            stream.read_chunk(),\n",
-    "            stream.CHANNELS,\n",
-    "            stream.RATE,\n",
-    "            16000,\n",
-    "            mode=\"sinc_best\"\n",
-    "        )\n",
-    "        if isinstance(chunk, bytes):\n",
-    "            wf.writeframes(chunk)\n",
-    "        else:\n",
-    "            raise Exception('Error: chunk is not bytes')\n",
-    "        \n",
-    "    stream.closeStream()    \n",
-    "    print('Done')"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": ".venv",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.9.6"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
--- a/engine-test/vosk.ipynb
+++ b/engine-test/vosk.ipynb
@@ -1,124 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "6fb12704",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "d:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\vosk\\__init__.py\n"
-     ]
-    }
-   ],
-   "source": [
-    "import vosk\n",
-    "print(vosk.__file__)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "id": "63a06f5c",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "        采样设备：\n",
-      "            - 设备类型：音频输入\n",
-      "            - 序号：1\n",
-      "            - 名称：麦克风阵列 (Realtek(R) Audio)\n",
-      "            - 最大输入通道数：2\n",
-      "            - 默认低输入延迟：0.09s\n",
-      "            - 默认高输入延迟：0.18s\n",
-      "            - 默认采样率：44100.0Hz\n",
-      "            - 是否回环设备：False\n",
-      "\n",
-      "        音频样本块大小：2205\n",
-      "        样本位宽：2\n",
-      "        采样格式：8\n",
-      "        音频通道数：2\n",
-      "        音频采样率：44100\n",
-      "        \n"
-     ]
-    }
-   ],
-   "source": [
-    "import sys\n",
-    "import os\n",
-    "import json\n",
-    "from vosk import Model, KaldiRecognizer\n",
-    "\n",
-    "current_dir = os.getcwd() \n",
-    "sys.path.append(os.path.join(current_dir, '../caption-engine'))\n",
-    "\n",
-    "from sysaudio.win import AudioStream\n",
-    "from audioprcs import resampleRawChunk, mergeChunkChannels\n",
-    "\n",
-    "stream = AudioStream(1)\n",
-    "stream.printInfo()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "id": "5d5a0afa",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model = Model(os.path.join(\n",
-    "    current_dir,\n",
-    "    '../caption-engine/models/vosk-model-small-cn-0.22'\n",
-    "))\n",
-    "recognizer = KaldiRecognizer(model, 16000)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7e9d1530",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "stream.openStream()\n",
-    "\n",
-    "for i in range(200):\n",
-    "    chunk = stream.read_chunk()\n",
-    "    chunk_mono = resampleRawChunk(chunk, stream.CHANNELS, stream.RATE, 16000)\n",
-    "    if recognizer.AcceptWaveform(chunk_mono):\n",
-    "        result = json.loads(recognizer.Result())\n",
-    "        print(\"acc:\", result.get(\"text\", \"\"))\n",
-    "    else:\n",
-    "        partial = json.loads(recognizer.PartialResult())\n",
-    "        print(\"else:\", partial.get(\"partial\", \"\"))"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "subenv",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.12.1"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
--- a/engine/audio2text/init.py
+++ b/engine/audio2text/init.py
@@ -0,0 +1,3 @@
+from dashscope.common.error import InvalidParameter
+from .gummy import GummyRecognizer
+from .vosk import VoskRecognizer
--- a/caption-engine/audio2text/gummy.py
+++ b/caption-engine/audio2text/gummy.py
@@ -6,8 +6,8 @@ from dashscope.audio.asr import (
 )
 import dashscope
 from datetime import datetime
-import json
-import sys
+from utils import stdout_cmd, stdout_obj, stderr
+

 class Callback(TranslationRecognizerCallback):
    """
@@ -15,17 +15,20 @@ class Callback(TranslationRecognizerCallback):
    """
    def __init__(self):
        super().__init__()
+        self.index = 0
        self.usage = 0
        self.cur_id = -1
        self.time_str = ''

    def on_open(self) -> None:
-        # print("on_open")
-        pass
+        self.usage = 0
+        self.cur_id = -1
+        self.time_str = ''
+        stdout_cmd('info', 'Gummy translator started.')

    def on_close(self) -> None:
-        # print("on_close")
-        pass
+        stdout_cmd('info', 'Gummy translator closed.')
+        stdout_cmd('usage', str(self.usage))

    def on_event(
        self,
@@ -35,17 +38,17 @@ class Callback(TranslationRecognizerCallback):
        usage
    ) -> None:
        caption = {}
+
        if transcription_result is not None:
-            caption['index'] = transcription_result.sentence_id
-            caption['text'] = transcription_result.text
-            if caption['index'] != self.cur_id:
-                self.cur_id = caption['index']
-                cur_time = datetime.now().strftime('%H:%M:%S.%f')[:-3]
-                caption['time_s'] = cur_time
-                self.time_str = cur_time
-            else:
-                caption['time_s'] = self.time_str
+            if self.cur_id != transcription_result.sentence_id:
+                self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
+                self.cur_id = transcription_result.sentence_id
+                self.index += 1  
+            caption['command'] = 'caption'
+            caption['index'] = self.index
+            caption['time_s'] = self.time_str
            caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
+            caption['text'] = transcription_result.text
            caption['translation'] = ""

        if translation_result is not None:
@@ -55,21 +58,11 @@ class Callback(TranslationRecognizerCallback):
        if usage:
            self.usage += usage['duration']

-        # print(caption)
-        self.send_to_node(caption)
+        if 'text' in caption:
+            stdout_obj(caption)

-    def send_to_node(self, data):
-        """
-        将数据发送到 Node.js 进程
-        """
-        try:
-            json_data = json.dumps(data) + '\n'
-            sys.stdout.write(json_data)
-            sys.stdout.flush()
-        except Exception as e:
-            print(f"Error sending data to Node.js: {e}", file=sys.stderr)

-class GummyTranslator:
+class GummyRecognizer:
    """
    使用 Gummy 引擎流式处理的音频数据，并在标准输出中输出与 Auto Caption 软件可读取的 JSON 字符串数据

@@ -77,8 +70,9 @@ class GummyTranslator:
        rate: 音频采样率
        source: 源语言代码字符串（zh, en, ja 等）
        target: 目标语言代码字符串（zh, en, ja 等）
+        api_key: 阿里云百炼平台 API KEY
    """
-    def __init__(self, rate, source, target, api_key):
+    def __init__(self, rate: int, source: str, target: str | None, api_key: str | None):
        if api_key:
            dashscope.api_key = api_key
        self.translator = TranslationRecognizerRealtime(
@@ -97,9 +91,12 @@ class GummyTranslator:
        self.translator.start()

    def send_audio_frame(self, data):
-        """发送音频帧"""
+        """发送音频帧，擎将自动识别并将识别结果输出到标准输出中"""
        self.translator.send_audio_frame(data)

    def stop(self):
        """停止 Gummy 引擎"""
-        self.translator.stop()
+        try:
+            self.translator.stop()
+        except Exception:
+            return
--- a/engine/audio2text/vosk.py
+++ b/engine/audio2text/vosk.py
@@ -0,0 +1,68 @@
+import json
+from datetime import datetime
+
+from vosk import Model, KaldiRecognizer, SetLogLevel
+from utils import stdout_cmd, stdout_obj
+
+
+class VoskRecognizer:
+    """
+    使用 Vosk 引擎流式处理的音频数据，并在标准输出中输出与 Auto Caption 软件可读取的 JSON 字符串数据
+
+    初始化参数：
+        model_path: Vosk 识别模型路径
+    """
+    def __init__(self, model_path: str):
+        SetLogLevel(-1)
+        if model_path.startswith('"'):
+            model_path = model_path[1:]
+        if model_path.endswith('"'):
+            model_path = model_path[:-1]
+        self.model_path = model_path
+        self.time_str = ''
+        self.cur_id = 0
+        self.prev_content = ''
+
+        self.model = Model(self.model_path)
+        self.recognizer = KaldiRecognizer(self.model, 16000)
+
+    def start(self):
+        """启动 Vosk 引擎"""
+        stdout_cmd('info', 'Vosk recognizer started.')
+
+    def send_audio_frame(self, data: bytes):
+        """
+        发送音频帧给 Vosk 引擎，引擎将自动识别并将识别结果输出到标准输出中
+
+        Args:
+            data: 音频帧数据，采样率必须为 16000Hz
+        """
+        caption = {}
+        caption['command'] = 'caption'
+        caption['translation'] = ''
+
+        if self.recognizer.AcceptWaveform(data):
+            content = json.loads(self.recognizer.Result()).get('text', '')
+            caption['index'] = self.cur_id
+            caption['text'] = content
+            caption['time_s'] = self.time_str
+            caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
+            self.prev_content = ''
+            self.cur_id += 1
+        else:
+            content = json.loads(self.recognizer.PartialResult()).get('partial', '')
+            if content == '' or content == self.prev_content:
+                return
+            if self.prev_content == '':
+                self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
+            caption['index'] = self.cur_id
+            caption['text'] = content
+            caption['time_s'] = self.time_str
+            caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
+            self.prev_content = content
+        
+        stdout_obj(caption)
+
+    def stop(self):
+        """停止 Vosk 引擎"""
+        stdout_cmd('info', 'Vosk recognizer closed.')
--- a/engine/main.py
+++ b/engine/main.py
@@ -0,0 +1,103 @@
+import argparse
+from utils import stdout_cmd, stderr
+from utils import thread_data, start_server
+from utils import merge_chunk_channels, resample_chunk_mono
+from audio2text import InvalidParameter, GummyRecognizer
+from audio2text import VoskRecognizer
+from sysaudio import AudioStream
+
+
+def main_gummy(s: str, t: str, a: int, c: int, k: str):
+    global thread_data
+    stream = AudioStream(a, c)
+    if t == 'none':
+        engine = GummyRecognizer(stream.RATE, s, None, k)
+    else:
+        engine = GummyRecognizer(stream.RATE, s, t, k)
+
+    stream.open_stream()
+    engine.start()
+
+    restart_count = 0
+    while thread_data.status == "running":
+        try:
+            chunk = stream.read_chunk()
+            if chunk is None: continue
+            chunk_mono = merge_chunk_channels(chunk, stream.CHANNELS)
+            try:
+                engine.send_audio_frame(chunk_mono)
+            except InvalidParameter as e:
+                restart_count += 1
+                if restart_count > 8:
+                    stderr(str(e))
+                    thread_data.status = "kill"
+                    break
+                else:
+                    stdout_cmd('info', f'Gummy engine stopped, trying to restart #{restart_count}')
+        except KeyboardInterrupt:
+            break
+
+    stream.close_stream()
+    engine.stop()
+
+
+def main_vosk(a: int, c: int, m: str):
+    global thread_data
+    stream = AudioStream(a, c)
+    engine = VoskRecognizer(m)
+
+    stream.open_stream()
+    engine.start()
+
+    while thread_data.status == "running":
+        try:
+            chunk = stream.read_chunk()
+            if chunk is None: continue
+            chunk_mono = resample_chunk_mono(chunk, stream.CHANNELS, stream.RATE, 16000)
+            engine.send_audio_frame(chunk_mono)
+        except KeyboardInterrupt:
+            break
+
+    stream.close_stream()
+    engine.stop()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Convert system audio stream to text')
+    # both
+    parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
+    parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
+    parser.add_argument('-c', '--chunk_rate', default=20, help='Number of audio stream chunks collected per second')
+    parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
+    # gummy only
+    parser.add_argument('-s', '--source_language', default='en', help='Source language code')
+    parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
+    parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
+    # vosk only
+    parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
+
+    args = parser.parse_args()    
+    if int(args.port) == 0:
+        thread_data.status = "running"
+    else:
+        start_server(int(args.port))
+
+    if args.caption_engine == 'gummy':
+        main_gummy(
+            args.source_language,
+            args.target_language,
+            int(args.audio_type),
+            int(args.chunk_rate),
+            args.api_key
+        )
+    elif args.caption_engine == 'vosk':
+        main_vosk(
+            int(args.audio_type),
+            int(args.chunk_rate),
+            args.model_path
+        )
+    else:
+        raise ValueError('Invalid caption engine specified.')
+    
+    if thread_data.status == "kill":
+        stdout_cmd('kill')
--- a/caption-engine/main-vosk.spec
+++ b/caption-engine/main-vosk.spec
@@ -9,7 +9,7 @@ else:
    vosk_path = str(Path('./subenv/lib/python3.12/site-packages/vosk').resolve())

 a = Analysis(
-    ['main-vosk.py'],
+    ['main.py'],
    pathex=[],
    binaries=[],
    datas=[(vosk_path, 'vosk')],
@@ -30,7 +30,7 @@ exe = EXE(
    a.binaries,
    a.datas,
    [],
-    name='main-vosk',
+    name='main',
    debug=False,
    bootloader_ignore_signals=False,
    strip=False,
@@ -43,4 +43,5 @@ exe = EXE(
    target_arch=None,
    codesign_identity=None,
    entitlements_file=None,
+    onefile=True,
 )
--- a/caption-engine/requirements_darwin.txt
+++ b/caption-engine/requirements_darwin.txt
--- a/caption-engine/requirements_linux.txt
+++ b/caption-engine/requirements_linux.txt
--- a/caption-engine/requirements_win.txt
+++ b/caption-engine/requirements_win.txt
@@ -1,7 +1,6 @@
 dashscope
 numpy
 samplerate
-PyAudio
 PyAudioWPatch
 vosk
 pyinstaller
--- a/engine/sysaudio/init.py
+++ b/engine/sysaudio/init.py
@@ -0,0 +1,10 @@
+import sys
+
+if sys.platform == "win32":
+    from .win import AudioStream
+elif sys.platform == "darwin":
+    from .darwin import AudioStream
+elif sys.platform == "linux":
+    from .linux import AudioStream
+else:
+    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
--- a/caption-engine/sysaudio/darwin.py
+++ b/caption-engine/sysaudio/darwin.py
@@ -1,11 +1,24 @@
 """获取 MacOS 系统音频输入/输出流"""

 import pyaudio
+from textwrap import dedent
+
+
+def get_blackhole_device(mic: pyaudio.PyAudio):
+    """
+    获取 BlackHole 设备
+    """
+    device_count = mic.get_device_count()
+    for i in range(device_count):
+        dev_info = mic.get_device_info_by_index(i)
+        if 'blackhole' in str(dev_info["name"]).lower():    
+            return dev_info
+    raise Exception("The device containing BlackHole was not found.")


 class AudioStream:
    """
-    获取系统音频流（支持 BlackHole 作为系统音频输出捕获）
+    获取系统音频流（如果要捕获输出音频，仅支持 BlackHole 作为系统音频输出捕获）

    初始化参数：
        audio_type: 0-系统音频输出流（需配合 BlackHole），1-系统音频输入流
@@ -15,46 +28,40 @@ class AudioStream:
        self.audio_type = audio_type
        self.mic = pyaudio.PyAudio()
        if self.audio_type == 0:
-            self.device = self.getOutputDeviceInfo()
+            self.device = get_blackhole_device(self.mic)
        else:
            self.device = self.mic.get_default_input_device_info()
+        self.stop_signal = False
        self.stream = None
-        self.SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)
+        self.INDEX = self.device["index"]
        self.FORMAT = pyaudio.paInt16
-        self.CHANNELS = self.device["maxInputChannels"]
+        self.SAMP_WIDTH = pyaudio.get_sample_size(self.FORMAT)
+        self.CHANNELS = int(self.device["maxInputChannels"])
        self.RATE = int(self.device["defaultSampleRate"])
        self.CHUNK = self.RATE // chunk_rate
-        self.INDEX = self.device["index"]

-    def getOutputDeviceInfo(self):
-        """查找指定关键词的输入设备"""
-        device_count = self.mic.get_device_count()
-        for i in range(device_count):
-            dev_info = self.mic.get_device_info_by_index(i)
-            if 'blackhole' in dev_info["name"].lower():    
-                return dev_info
-        raise Exception("The device containing BlackHole was not found.")
-
-    def printInfo(self):
+    def get_info(self):
        dev_info = f"""
-        采样输入设备：
+        采样设备：
            - 设备类型：{ "音频输出" if self.audio_type == 0 else "音频输入" }
-            - 序号：{self.device['index']}
-            - 名称：{self.device['name']}
+            - 设备序号：{self.device['index']}
+            - 设备名称：{self.device['name']}
            - 最大输入通道数：{self.device['maxInputChannels']}
            - 默认低输入延迟：{self.device['defaultLowInputLatency']}s
            - 默认高输入延迟：{self.device['defaultHighInputLatency']}s
            - 默认采样率：{self.device['defaultSampleRate']}Hz
+            - 是否回环设备：{self.device['isLoopbackDevice']}

-        音频样本块大小：{self.CHUNK}
+        设备序号：{self.INDEX}
+        样本格式：{self.FORMAT}
        样本位宽：{self.SAMP_WIDTH}
-        采样格式：{self.FORMAT}
-        音频通道数：{self.CHANNELS}
-        音频采样率：{self.RATE}
+        样本通道数：{self.CHANNELS}
+        样本采样率：{self.RATE}
+        样本块大小：{self.CHUNK}
        """
-        print(dev_info)
+        return dedent(dev_info).strip()

-    def openStream(self):
+    def open_stream(self):
        """
        打开并返回系统音频输出流
        """
@@ -72,14 +79,24 @@ class AudioStream:
        """
        读取音频数据
        """
+        if self.stop_signal:
+            self.close_stream()
+            return None
        if not self.stream: return None
        return self.stream.read(self.CHUNK, exception_on_overflow=False)

-    def closeStream(self):
+    def close_stream_signal(self):
        """
-        关闭系统音频输出流
+        线程安全的关闭系统音频输入流，不一定会立即关闭
        """
-        if self.stream is None: return
-        self.stream.stop_stream()
-        self.stream.close()
-        self.stream = None
+        self.stop_signal = True
+
+    def close_stream(self):
+        """
+        立即关闭系统音频输入流
+        """
+        if self.stream is not None:
+            self.stream.stop_stream()
+            self.stream.close()
+            self.stream = None
+        self.stop_signal = False
--- a/caption-engine/sysaudio/linux.py
+++ b/caption-engine/sysaudio/linux.py
@@ -1,8 +1,10 @@
 """获取 Linux 系统音频输入流"""

 import subprocess
+from textwrap import dedent

-def findMonitorSource():
+
+def find_monitor_source():
    result = subprocess.run(
        ["pactl", "list", "short", "sources"],
        stdout=subprocess.PIPE, text=True
@@ -16,7 +18,8 @@ def findMonitorSource():

    raise RuntimeError("System output monitor device not found")

-def findInputSource():
+
+def find_input_source():
    result = subprocess.run(
        ["pactl", "list", "short", "sources"],
        stdout=subprocess.PIPE, text=True
@@ -28,8 +31,10 @@ def findInputSource():
        name = parts[1]
        if ".monitor" not in name:
            return name
+
    raise RuntimeError("Microphone input device not found")

+
 class AudioStream:
    """
    获取系统音频流
@@ -42,34 +47,33 @@ class AudioStream:
        self.audio_type = audio_type

        if self.audio_type == 0:
-            self.source = findMonitorSource()
+            self.source = find_monitor_source()
        else:
-            self.source = findInputSource()
-
+            self.source = find_input_source()
+        self.stop_signal = False
        self.process = None
-
-        self.SAMP_WIDTH = 2
        self.FORMAT = 16
+        self.SAMP_WIDTH = 2
        self.CHANNELS = 2
        self.RATE = 48000
        self.CHUNK = self.RATE // chunk_rate

-    def printInfo(self):
+    def get_info(self):
        dev_info = f"""
        音频捕获进程：
            - 捕获类型：{"音频输出" if self.audio_type == 0 else "音频输入"}
            - 设备源：{self.source}
-            - 捕获进程PID：{self.process.pid if self.process else "None"}
+            - 捕获进程 PID：{self.process.pid if self.process else "None"}

-        音频样本块大小：{self.CHUNK}
+        样本格式：{self.FORMAT}
        样本位宽：{self.SAMP_WIDTH}
-        采样格式：{self.FORMAT}
-        音频通道数：{self.CHANNELS}
-        音频采样率：{self.RATE}
+        样本通道数：{self.CHANNELS}
+        样本采样率：{self.RATE}
+        样本块大小：{self.CHUNK}
        """
        print(dev_info)

-    def openStream(self):
+    def open_stream(self):
        """
        启动音频捕获进程
        """
@@ -82,13 +86,23 @@ class AudioStream:
        """
        读取音频数据
        """
-        if self.process:
+        if self.stop_signal:
+            self.close_stream()
+            return None
+        if self.process and self.process.stdout:
            return self.process.stdout.read(self.CHUNK)
        return None

-    def closeStream(self):
+    def close_stream_signal(self):
+        """
+        线程安全的关闭系统音频输入流，不一定会立即关闭
+        """
+        self.stop_signal = True
+
+    def close_stream(self):
        """
        关闭系统音频捕获进程
        """
        if self.process:
            self.process.terminate()
+        self.stop_signal = False
--- a/caption-engine/sysaudio/win.py
+++ b/caption-engine/sysaudio/win.py
@@ -1,14 +1,15 @@
 """获取 Windows 系统音频输入/输出流"""

 import pyaudiowpatch as pyaudio
+from textwrap import dedent


-def getDefaultLoopbackDevice(mic: pyaudio.PyAudio, info = True)->dict:
+def get_default_loopback_device(mic: pyaudio.PyAudio, info = True)->dict:
    """
    获取默认的系统音频输出的回环设备
    Args:
-        mic (pyaudio.PyAudio): pyaudio对象
-        info (bool, optional): 是否打印设备信息
+        mic: pyaudio对象
+        info: 是否打印设备信息

    Returns:
        dict: 系统音频输出的回环设备
@@ -51,38 +52,40 @@ class AudioStream:
        self.audio_type = audio_type
        self.mic = pyaudio.PyAudio()
        if self.audio_type == 0:
-            self.device = getDefaultLoopbackDevice(self.mic, False)
+            self.device = get_default_loopback_device(self.mic, False)
        else:
            self.device = self.mic.get_default_input_device_info()
+        self.stop_signal = False
        self.stream = None
-        self.SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)
+        self.INDEX = self.device["index"]
        self.FORMAT = pyaudio.paInt16
+        self.SAMP_WIDTH = pyaudio.get_sample_size(self.FORMAT)
        self.CHANNELS = int(self.device["maxInputChannels"])
        self.RATE = int(self.device["defaultSampleRate"])
        self.CHUNK = self.RATE // chunk_rate
-        self.INDEX = self.device["index"]

-    def printInfo(self):
+    def get_info(self):
        dev_info = f"""
        采样设备：
            - 设备类型：{ "音频输出" if self.audio_type == 0 else "音频输入" }
-            - 序号：{self.device['index']}
-            - 名称：{self.device['name']}
+            - 设备序号：{self.device['index']}
+            - 设备名称：{self.device['name']}
            - 最大输入通道数：{self.device['maxInputChannels']}
            - 默认低输入延迟：{self.device['defaultLowInputLatency']}s
            - 默认高输入延迟：{self.device['defaultHighInputLatency']}s
            - 默认采样率：{self.device['defaultSampleRate']}Hz
            - 是否回环设备：{self.device['isLoopbackDevice']}

-        音频样本块大小：{self.CHUNK}
+        设备序号：{self.INDEX}
+        样本格式：{self.FORMAT}
        样本位宽：{self.SAMP_WIDTH}
-        采样格式：{self.FORMAT}
-        音频通道数：{self.CHANNELS}
-        音频采样率：{self.RATE}
+        样本通道数：{self.CHANNELS}
+        样本采样率：{self.RATE}
+        样本块大小：{self.CHUNK}
        """
-        print(dev_info)
+        return dedent(dev_info).strip()

-    def openStream(self):
+    def open_stream(self):
        """
        打开并返回系统音频输出流
        """
@@ -96,18 +99,28 @@ class AudioStream:
        )
        return self.stream

-    def read_chunk(self):
+    def read_chunk(self) -> bytes | None:
        """
        读取音频数据
        """
+        if self.stop_signal:
+            self.close_stream()
+            return None
        if not self.stream: return None
        return self.stream.read(self.CHUNK, exception_on_overflow=False)

-    def closeStream(self):
+    def close_stream_signal(self):
        """
-        关闭系统音频输出流
+        线程安全的关闭系统音频输入流，不一定会立即关闭
        """
-        if self.stream is None: return
-        self.stream.stop_stream()
-        self.stream.close()
-        self.stream = None
+        self.stop_signal = True
+
+    def close_stream(self):
+        """
+        关闭系统音频输入流
+        """
+        if self.stream is not None:
+            self.stream.stop_stream()
+            self.stream.close()
+            self.stream = None
+        self.stop_signal = False
--- a/engine/utils/init.py
+++ b/engine/utils/init.py
@@ -0,0 +1,4 @@
+from .audioprcs import merge_chunk_channels, resample_chunk_mono, resample_mono_chunk
+from .sysout import stdout, stdout_cmd, stdout_obj, stderr
+from .thdata import thread_data
+from .server import start_server
--- a/caption-engine/audioprcs/process.py
+++ b/caption-engine/audioprcs/process.py
@@ -1,17 +1,19 @@
 import samplerate
 import numpy as np
+import numpy.core.multiarray # do not remove

-def mergeChunkChannels(chunk, channels):
+def merge_chunk_channels(chunk: bytes, channels: int) -> bytes:
    """
    将当前多通道音频数据块转换为单通道音频数据块

    Args:
-        chunk: (bytes)多通道音频数据块
+        chunk: 多通道音频数据块
        channels: 通道数

    Returns:
-        (bytes)单通道音频数据块
+        单通道音频数据块
    """
+    if channels == 1: return chunk
    # (length * channels,)
    chunk_np = np.frombuffer(chunk, dtype=np.int16)
    # (length, channels)
@@ -22,46 +24,52 @@ def mergeChunkChannels(chunk, channels):
    return chunk_mono.tobytes()


-def resampleRawChunk(chunk, channels, orig_sr, target_sr, mode="sinc_best"):
+def resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes:
    """
    将当前多通道音频数据块转换成单通道音频数据块，然后进行重采样

    Args:
-        chunk: (bytes)多通道音频数据块
+        chunk: 多通道音频数据块
        channels: 通道数
        orig_sr: 原始采样率
        target_sr: 目标采样率
        mode: 重采样模式，可选：'sinc_best' | 'sinc_medium' | 'sinc_fastest' | 'zero_order_hold' | 'linear'

    Return:
-        (bytes)单通道音频数据块
+        单通道音频数据块
    """
-    # (length * channels,)
-    chunk_np = np.frombuffer(chunk, dtype=np.int16)
-    # (length, channels)
-    chunk_np = chunk_np.reshape(-1, channels)
-    # (length,)
-    chunk_mono_f = np.mean(chunk_np.astype(np.float32), axis=1)
-    chunk_mono = chunk_mono_f.astype(np.int16)
+    if channels == 1:
+        chunk_mono = np.frombuffer(chunk, dtype=np.int16)
+        chunk_mono = chunk_mono.astype(np.float32)
+    else:
+        # (length * channels,)
+        chunk_np = np.frombuffer(chunk, dtype=np.int16)
+        # (length, channels)
+        chunk_np = chunk_np.reshape(-1, channels)
+        # (length,)
+        chunk_mono = np.mean(chunk_np.astype(np.float32), axis=1)
+
    ratio = target_sr / orig_sr
-    chunk_mono_r =  samplerate.resample(chunk_mono, ratio, converter_type=mode)
+    chunk_mono_r = samplerate.resample(chunk_mono, ratio, converter_type=mode)
    chunk_mono_r = np.round(chunk_mono_r).astype(np.int16)
    return chunk_mono_r.tobytes()

-def resampleMonoChunk(chunk, orig_sr, target_sr, mode="sinc_best"):
+
+def resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes:
    """
    将当前单通道音频块进行重采样

    Args:
-        chunk: (bytes)单通道音频数据块
+        chunk: 单通道音频数据块
        orig_sr: 原始采样率
        target_sr: 目标采样率
        mode: 重采样模式，可选：'sinc_best' | 'sinc_medium' | 'sinc_fastest' | 'zero_order_hold' | 'linear'

    Return:
-        (bytes)单通道音频数据块
+        单通道音频数据块
    """
    chunk_np = np.frombuffer(chunk, dtype=np.int16)
+    chunk_np = chunk_np.astype(np.float32)
    ratio = target_sr / orig_sr
    chunk_r =  samplerate.resample(chunk_np, ratio, converter_type=mode)
    chunk_r = np.round(chunk_r).astype(np.int16)
--- a/engine/utils/server.py
+++ b/engine/utils/server.py
@@ -0,0 +1,41 @@
+import socket
+import threading
+import json
+from utils import thread_data, stdout_cmd, stderr
+
+
+def handle_client(client_socket):
+    global thread_data
+    while thread_data.status == 'running':
+        try:
+            data = client_socket.recv(4096).decode('utf-8')
+            if not data:
+                break
+            data = json.loads(data)
+
+            if data['command'] == 'stop':
+                thread_data.status = 'stop'
+                break
+        except Exception as e:
+            stderr(f'Communication error: {e}')
+            break
+    
+    thread_data.status = 'stop'
+    client_socket.close()
+
+
+def start_server(port: int):
+    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    try:
+        server.bind(('localhost', port))
+        server.listen(1)
+    except Exception as e:
+        stderr(str(e))
+        stdout_cmd('kill')
+        return
+    stdout_cmd('connect')
+
+    client, addr = server.accept()
+    client_handler = threading.Thread(target=handle_client, args=(client,))
+    client_handler.daemon = True
+    client_handler.start()
--- a/engine/utils/sysout.py
+++ b/engine/utils/sysout.py
@@ -0,0 +1,18 @@
+import sys
+import json
+
+def stdout(text: str):
+    stdout_cmd("print", text)
+
+def stdout_cmd(command: str, content = ""):
+    msg = { "command": command, "content": content }
+    sys.stdout.write(json.dumps(msg) + "\n")
+    sys.stdout.flush()
+
+def stdout_obj(obj):
+    sys.stdout.write(json.dumps(obj) + "\n")
+    sys.stdout.flush()
+
+def stderr(text: str):
+    sys.stderr.write(text + "\n")
+    sys.stderr.flush()
--- a/engine/utils/thdata.py
+++ b/engine/utils/thdata.py
@@ -0,0 +1,5 @@
+class ThreadData:
+    def __init__(self):
+        self.status = "running"
+
+thread_data = ThreadData()
--- a/package-lock.json
+++ b/package-lock.json
--- a/package.json
+++ b/package.json
@@ -1,7 +1,7 @@
 {
  "name": "auto-caption",
  "productName": "Auto Caption",
-  "version": "0.5.0",
+  "version": "0.6.0",
  "description": "A cross-platform subtitle display software.",
  "main": "./out/main/index.js",
  "author": "himeditator",
@@ -35,6 +35,7 @@
    "@electron-toolkit/eslint-config-ts": "^3.0.0",
    "@electron-toolkit/tsconfig": "^1.0.1",
    "@types/node": "^22.14.1",
+    "@types/pidusage": "^2.0.5",
    "@vitejs/plugin-vue": "^5.2.3",
    "electron": "^35.1.5",
    "electron-builder": "^25.1.8",
--- a/src/main/CaptionWindow.ts
+++ b/src/main/CaptionWindow.ts
@@ -3,6 +3,7 @@ import path from 'path'
 import { is } from '@electron-toolkit/utils'
 import icon from '../../build/icon.png?asset'
 import { controlWindow } from './ControlWindow'
+import { allConfig } from './utils/AllConfig'

 class CaptionWindow {
  window: BrowserWindow | undefined;
@@ -10,7 +11,7 @@ class CaptionWindow {
  public createWindow(): void {
    this.window = new BrowserWindow({
      icon: icon,
-      width: 900,
+      width: allConfig.captionWindowWidth,
      height: 100,
      minWidth: 480,
      show: false,
@@ -30,6 +31,12 @@ class CaptionWindow {
      this.window?.show()
    })

+    this.window.on('close', () => {
+      if(this.window) {
+        allConfig.captionWindowWidth = this.window?.getBounds().width;
+      }
+    })
+
    this.window.on('closed', () => {
      this.window = undefined
    })
--- a/src/main/ControlWindow.ts
+++ b/src/main/ControlWindow.ts
@@ -85,12 +85,13 @@ class ControlWindow {

    ipcMain.handle('control.engine.info', async () => {
      const info: EngineInfo = {
-        pid: 0, ppid: 0, cpu: 0, mem: 0, elapsed: 0
+        pid: 0, ppid: 0, port: 0, cpu: 0, mem: 0, elapsed: 0
      }
-      if(captionEngine.processStatus !== 'running') return info
+      if(captionEngine.status !== 'running') return info
      const stats = await pidusage(captionEngine.process.pid)
      info.pid = stats.pid
      info.ppid = stats.ppid
+      info.port = captionEngine.port
      info.cpu = stats.cpu
      info.mem = stats.memory
      info.elapsed = stats.elapsed
--- a/src/main/index.ts
+++ b/src/main/index.ts
@@ -25,7 +25,7 @@ app.whenReady().then(() => {
 })

 app.on('will-quit', async () => {
-  captionEngine.stop()
+  captionEngine.kill()
  allConfig.writeConfig()
 });

--- a/src/main/types/index.ts
+++ b/src/main/types/index.ts
@@ -58,6 +58,7 @@ export interface FullConfig {
 export interface EngineInfo {
  pid: number,
  ppid: number,
+  port:number,
  cpu: number,
  mem: number,
  elapsed: number
--- a/src/main/utils/AllConfig.ts
+++ b/src/main/utils/AllConfig.ts
@@ -2,6 +2,7 @@ import {
  UILanguage, UITheme, Styles, Controls,
  CaptionItem, FullConfig
 } from '../types'
+import { Log } from './Log'
 import { app, BrowserWindow } from 'electron'
 import * as path from 'path'
 import * as fs from 'fs'
@@ -43,11 +44,15 @@ const defaultControls: Controls = {


 class AllConfig {
+  captionWindowWidth: number = 900;
+
  uiLanguage: UILanguage = 'zh';
  leftBarWidth: number = 8;
  uiTheme: UITheme = 'system';
  styles: Styles = {...defaultStyles};
  controls: Controls = {...defaultControls};
+  
+  lastLogIndex: number = -1;
  captionLog: CaptionItem[] = [];

  constructor() {}
@@ -56,17 +61,19 @@ class AllConfig {
    const configPath = path.join(app.getPath('userData'), 'config.json')
    if(fs.existsSync(configPath)){
      const config = JSON.parse(fs.readFileSync(configPath, 'utf-8'))
+      if(config.captionWindowWidth) this.captionWindowWidth = config.captionWindowWidth
      if(config.uiLanguage) this.uiLanguage = config.uiLanguage
      if(config.uiTheme) this.uiTheme = config.uiTheme
      if(config.leftBarWidth) this.leftBarWidth = config.leftBarWidth
      if(config.styles) this.setStyles(config.styles)
      if(config.controls) this.setControls(config.controls)
-      console.log('[INFO] Read Config from:', configPath)
+      Log.info('Read Config from:', configPath)
    }
  }

  public writeConfig() {
    const config = {
+      captionWindowWidth: this.captionWindowWidth,
      uiLanguage: this.uiLanguage,
      uiTheme: this.uiTheme,
      leftBarWidth: this.leftBarWidth,
@@ -75,7 +82,7 @@ class AllConfig {
    }
    const configPath = path.join(app.getPath('userData'), 'config.json')
    fs.writeFileSync(configPath, JSON.stringify(config, null, 2))
-    console.log('[INFO] Write Config to:', configPath)
+    Log.info('Write Config to:', configPath)
  }

  public getFullConfig(): FullConfig {
@@ -96,7 +103,7 @@ class AllConfig {
        this.styles[key] = args[key]
      }
    }
-    console.log('[INFO] Set Styles:', this.styles)
+    Log.info('Set Styles:', this.styles)
  }

  public resetStyles() {
@@ -105,7 +112,7 @@ class AllConfig {

  public sendStyles(window: BrowserWindow) {
    window.webContents.send('both.styles.set', this.styles)
-    console.log(`[INFO] Send Styles to #${window.id}:`, this.styles)
+    Log.info(`Send Styles to #${window.id}:`, this.styles)
  }

  public setControls(args: Object) {
@@ -116,27 +123,28 @@ class AllConfig {
      }
    }
    this.controls.engineEnabled = engineEnabled
-    console.log('[INFO] Set Controls:', this.controls)
+    Log.info('Set Controls:', this.controls)
  }

-  public sendControls(window: BrowserWindow) {
+  public sendControls(window: BrowserWindow, info = true) {
    window.webContents.send('control.controls.set', this.controls)
-    console.log(`[INFO] Send Controls to #${window.id}:`, this.controls)
+    if(info) Log.info(`Send Controls to #${window.id}:`, this.controls)
  }

  public updateCaptionLog(log: CaptionItem) {
    let command: 'add' | 'upd' = 'add'
    if(
      this.captionLog.length &&
-      this.captionLog[this.captionLog.length - 1].index === log.index &&
-      this.captionLog[this.captionLog.length - 1].time_s === log.time_s
+      this.lastLogIndex === log.index
    ) {
      this.captionLog.splice(this.captionLog.length - 1, 1, log)
      command = 'upd'
    }
    else {
      this.captionLog.push(log)
+      this.lastLogIndex = log.index
    }
+    this.captionLog[this.captionLog.length - 1].index = this.captionLog.length
    for(const window of BrowserWindow.getAllWindows()){
      this.sendCaptionLog(window, command)
    }
--- a/src/main/utils/CaptionEngine.ts
+++ b/src/main/utils/CaptionEngine.ts
@@ -1,172 +1,235 @@
-import { spawn, exec } from 'child_process'
+import { exec, spawn } from 'child_process'
 import { app } from 'electron'
 import { is } from '@electron-toolkit/utils'
 import path from 'path'
+import net from 'net'
 import { controlWindow } from '../ControlWindow'
 import { allConfig } from './AllConfig'
 import { i18n } from '../i18n'
+import { Log } from './Log'

 export class CaptionEngine {
  appPath: string = ''
  command: string[] = []
  process: any | undefined
-  processStatus: 'running' | 'stopping' | 'stopped' = 'stopped'
+  client: net.Socket | undefined
+  port: number = 8080
+  status: 'running' | 'starting' | 'stopping' | 'stopped' = 'stopped'
+  timerID: NodeJS.Timeout | undefined

  private getApp(): boolean {
-    allConfig.controls.customized = false
-    if (allConfig.controls.customized && allConfig.controls.customizedApp) {
+    if (allConfig.controls.customized) {
+      Log.info('Using customized caption engine')
      this.appPath = allConfig.controls.customizedApp
-      this.command = [allConfig.controls.customizedCommand]
-      allConfig.controls.customized = true
+      this.command = allConfig.controls.customizedCommand.split(' ')
+      this.port = Math.floor(Math.random() * (65535 - 1024 + 1)) + 1024
+      this.command.push('-p', this.port.toString())
    }
-    else if (allConfig.controls.engine === 'gummy') {
-      if(!allConfig.controls.API_KEY && !process.env.DASHSCOPE_API_KEY) {
+    else {
+      if(allConfig.controls.engine === 'gummy' && 
+        !allConfig.controls.API_KEY && !process.env.DASHSCOPE_API_KEY
+      ) {
        controlWindow.sendErrorMessage(i18n('gummy.key.missing'))
        return false
      }
-      let gummyName = 'main-gummy'
-      if (process.platform === 'win32') {
-        gummyName += '.exe'
-      }
+      this.command = []
      if (is.dev) {
-        this.appPath = path.join(
-          app.getAppPath(),
-          'caption-engine', 'dist', gummyName
-        )
+        if(process.platform === "win32") {
+          this.appPath = path.join(
+            app.getAppPath(), 'engine',
+            'subenv', 'Scripts', 'python.exe'
+          )
+          this.command.push(path.join(
+            app.getAppPath(), 'engine', 'main.py'
+          ))
+          // this.appPath = path.join(app.getAppPath(), 'engine', 'dist', 'main.exe')
+        }
+        else {
+          this.appPath = path.join(
+            app.getAppPath(), 'engine',
+            'subenv', 'bin', 'python3'
+          )
+          this.command.push(path.join(
+            app.getAppPath(), 'engine', 'main.py'
+          ))
+        }
      }
      else {
-        this.appPath = path.join(
-          process.resourcesPath, 'caption-engine', gummyName
-        )
+        if(process.platform === 'win32') {
+          this.appPath = path.join(process.resourcesPath, 'engine', 'main.exe')
+        }
+        else {
+          this.appPath = path.join(process.resourcesPath, 'engine', 'main')
+        }
      }
-      this.command = []
-      this.command.push('-s', allConfig.controls.sourceLang)
-      this.command.push(
-        '-t', allConfig.controls.translation ?
-        allConfig.controls.targetLang : 'none'
-      )
+
      this.command.push('-a', allConfig.controls.audio ? '1' : '0')
-      if(allConfig.controls.API_KEY) {
-        this.command.push('-k', allConfig.controls.API_KEY)
+      this.port = Math.floor(Math.random() * (65535 - 1024 + 1)) + 1024
+      this.command.push('-p', this.port.toString())
+
+      if(allConfig.controls.engine === 'gummy') {
+        this.command.push('-e', 'gummy')
+        this.command.push('-s', allConfig.controls.sourceLang)
+        this.command.push(
+          '-t', allConfig.controls.translation ?
+          allConfig.controls.targetLang : 'none'
+        )
+        if(allConfig.controls.API_KEY) {
+          this.command.push('-k', allConfig.controls.API_KEY)
+        }
+      }
+      else if(allConfig.controls.engine === 'vosk'){
+        this.command.push('-e', 'vosk')
+        
+        this.command.push('-m', `"${allConfig.controls.modelPath}"`)        
      }
    }
-    else if(allConfig.controls.engine === 'vosk'){
-      let voskName = 'main-vosk'
-      if (process.platform === 'win32') {
-        voskName += '.exe'
-      }
-      if (is.dev) {
-        this.appPath = path.join(
-          app.getAppPath(),
-          'caption-engine', 'dist', voskName
-        )
-      }
-      else {
-        this.appPath = path.join(
-          process.resourcesPath, 'caption-engine', voskName
-        )
-      }
-      this.command = []
-      this.command.push('-a', allConfig.controls.audio ? '1' : '0')
-      this.command.push('-m', `"${allConfig.controls.modelPath}"`)
-    }
-    console.log('[INFO] Engine Path:', this.appPath)
-    console.log('[INFO] Engine Command:', this.command)
+    Log.info('Engine Path:', this.appPath)
+    Log.info('Engine Command:', this.command)
    return true
  }

-  public start() {
-    if (this.processStatus !== 'stopped') {
-      return
-    }
-    if(!this.getApp()){ return }
-
-    try {
-      this.process = spawn(this.appPath, this.command)
-    }
-    catch (e) {
-      controlWindow.sendErrorMessage(i18n('engine.start.error') + e)
-      console.error('[ERROR] Error starting subprocess:', e)
-      return
-    }
-
-    this.processStatus = 'running'
-    console.log('[INFO] Caption Engine Started, PID:', this.process.pid)
-
+  public connect() {
+    Log.info('Connecting to caption engine server...')
+    if(this.client) { Log.warn('Client already exists, ignoring...') }
+    this.client = net.createConnection({ port: this.port }, () => {
+      Log.info('Connected to caption engine server');
+    });
+    this.status = 'running'
    allConfig.controls.engineEnabled = true
    if(controlWindow.window){
-      allConfig.sendControls(controlWindow.window)
+      allConfig.sendControls(controlWindow.window, false)
      controlWindow.window.webContents.send(
        'control.engine.started',
        this.process.pid
      )
    }
+  }

+  public sendCommand(command: string, content: string = "") {
+    if(this.client === undefined) {
+      Log.error('Client not initialized yet')
+      return
+    }
+    const data = JSON.stringify({command, content})
+    this.client.write(data);
+    Log.info(`Send data to python server: ${data}`);
+  }
+
+  public start() {
+    if (this.status !== 'stopped') {
+      Log.warn('Caption engine is not stopped, current status:', this.status)
+      return
+    }
+    if(!this.getApp()){ return }
+
+    this.process = spawn(this.appPath, this.command)
+    this.status = 'starting'
+    Log.info('Caption Engine Starting, PID:', this.process.pid)
+    
    this.process.stdout.on('data', (data: any) => {
-      const lines = data.toString().split('\n');
+      const lines = data.toString().split('\n')
      lines.forEach((line: string) => {
        if (line.trim()) {
          try {
-            const caption = JSON.parse(line);
-            if(caption.index === undefined) {
-              console.log('[INFO] Engine Bad Output:', caption);
-            }
-            else allConfig.updateCaptionLog(caption);
+            const data_obj = JSON.parse(line)
+            handleEngineData(data_obj)
          } catch (e) {
            controlWindow.sendErrorMessage(i18n('engine.output.parse.error') + e)
-            console.error('[ERROR] Error parsing JSON:', e);
+            Log.error('Error parsing JSON:', e)
          }
        }
      });
    });

-    this.process.stderr.on('data', (data) => {
-      if(this.processStatus === 'stopping') return
-      controlWindow.sendErrorMessage(i18n('engine.error') + data)
-      console.error(`[ERROR] Subprocess Error: ${data}`);
+    this.process.stderr.on('data', (data: any) => {
+      const lines = data.toString().split('\n')
+      lines.forEach((line: string) => {
+        if(line.trim()){
+          controlWindow.sendErrorMessage(/*i18n('engine.error') +*/ line)
+          console.error(line)          
+        }
+      })
    });

    this.process.on('close', (code: any) => {
-      console.log(`[INFO] Subprocess exited with code ${code}`);
      this.process = undefined;
+      this.client = undefined
      allConfig.controls.engineEnabled = false
      if(controlWindow.window){
-        allConfig.sendControls(controlWindow.window)
+        allConfig.sendControls(controlWindow.window, false)
        controlWindow.window.webContents.send('control.engine.stopped')
      }
-      this.processStatus = 'stopped'
-      console.log('[INFO] Caption engine process stopped')
+      this.status = 'stopped'
+      clearInterval(this.timerID)
+      Log.info(`Engine exited with code ${code}`)
    });
  }

  public stop() {
-    if(this.processStatus !== 'running') return
+    if(this.status !== 'running'){
+      Log.warn('Trying to stop engine which is not running, current status:', this.status)
+      return
+    }
+    this.sendCommand('stop')
+    if(this.client){
+      this.client.destroy()
+      this.client = undefined
+    }
+    this.status = 'stopping'
+    Log.info('Caption engine process stopping...')
+    this.timerID = setTimeout(() => {
+      if(this.status !== 'stopping') return
+      Log.warn('Engine process still not stopped, trying to kill...')
+      this.kill()
+    }, 4000);
+  }
+
+  public kill(){
+    if(!this.process || !this.process.pid) return
+    if(this.status !== 'running'){
+      Log.warn('Trying to kill engine which is not running, current status:', this.status)
+    }
+    Log.warn('Trying to kill engine process, PID:', this.process.pid)
+    if(this.client){
+      this.client.destroy()
+      this.client = undefined
+    }
    if (this.process.pid) {
-      console.log('[INFO] Trying to stop process, PID:', this.process.pid)
      let cmd = `kill ${this.process.pid}`;
      if (process.platform === "win32") {
        cmd = `taskkill /pid ${this.process.pid} /t /f`
      }
-      exec(cmd, (error) => {
-        if (error) {
-          controlWindow.sendErrorMessage(i18n('engine.shutdown.error') + error)
-          console.error(`[ERROR] Failed to kill process: ${error}`)
-        }
-      })
+      exec(cmd)
    }
-    else {
-      this.process = undefined;
-      allConfig.controls.engineEnabled = false
-      if(controlWindow.window){
-        allConfig.sendControls(controlWindow.window)
-        controlWindow.window.webContents.send('control.engine.stopped')
-      }
-      this.processStatus = 'stopped'
-      console.log('[INFO] Process PID undefined, caption engine process stopped')
-      return
+    this.status = 'stopping'
+  }
+}
+
+function handleEngineData(data: any) {
+  if(data.command === 'connect'){
+    captionEngine.connect()
+  }
+  else if(data.command === 'kill') {
+    if(captionEngine.status !== 'stopped') {
+      Log.warn('Error occurred, trying to kill caption engine...')
+      captionEngine.kill()
    }
-    this.processStatus = 'stopping'
-    console.log('[INFO] Caption engine process stopping')
+  }
+  else if(data.command === 'caption') {
+    allConfig.updateCaptionLog(data);
+  }
+  else if(data.command === 'print') {
+    Log.info('Engine Print:', data.content)
+  }
+  else if(data.command === 'info') {
+    Log.info('Engine Info:', data.content)
+  }
+  else if(data.command === 'usage') {
+    Log.info('Engine Usage: ', data.content)
+  }
+  else {
+    Log.warn('Unknown command:', data)
  }
 }

--- a/src/main/utils/Log.ts
+++ b/src/main/utils/Log.ts
@@ -0,0 +1,22 @@
+function getTimeString() {
+  const now = new Date()
+  const HH = String(now.getHours()).padStart(2, '0')
+  const MM = String(now.getMinutes()).padStart(2, '0')
+  const SS = String(now.getSeconds()).padStart(2, '0')
+  const MS = String(now.getMilliseconds()).padStart(3, '0')
+  return `${HH}:${MM}:${SS}.${MS}`
+}
+
+export class Log {
+  static info(...msg: any[]){
+    console.log(`[INFO ${getTimeString()}]`, ...msg)
+  }
+
+  static warn(...msg: any[]){
+    console.warn(`[WARN ${getTimeString()}]`, ...msg)
+  }
+
+  static error(...msg: any[]){
+    console.error(`[ERROR ${getTimeString()}]`, ...msg)
+  }
+}
--- a/src/renderer/src/components/CaptionLog.vue
+++ b/src/renderer/src/components/CaptionLog.vue
@@ -136,6 +136,7 @@ import { useCaptionLogStore } from '@renderer/stores/captionLog'
 import { message } from 'ant-design-vue'
 import { useI18n } from 'vue-i18n'
 import * as tc from '../utils/timeCalc'
+import { CaptionItem } from '../types'

 const { t } = useI18n()

@@ -154,10 +155,9 @@ const baseMS = ref<number>(0)

 const pagination = ref({
  current: 1,
-  pageSize: 10,
+  pageSize: 20,
  showSizeChanger: true,
-  pageSizeOptions: ['10', '20', '50'],
-  showTotal: (total: number) => `Total: ${total}`,
+  pageSizeOptions: ['10', '20', '50', '100'],
  onChange: (page: number, pageSize: number) => {
    pagination.value.current = page
    pagination.value.pageSize = pageSize
@@ -174,12 +174,23 @@ const columns = [
    dataIndex: 'index',
    key: 'index',
    width: 80,
+    sorter: (a: CaptionItem, b: CaptionItem) => {
+      if(a.index <= b.index) return -1
+      return 1
+    },
+    sortDirections: ['descend'],
+    defaultSortOrder: 'descend',
  },
  {
    title: 'time',
    dataIndex: 'time',
    key: 'time',
    width: 160,
+    sorter: (a: CaptionItem, b: CaptionItem) => {
+      if(a.time_s <= b.time_s) return -1
+      return 1
+    },
+    sortDirections: ['descend', 'ascend'],
  },
  {
    title: 'content',
--- a/src/renderer/src/components/CaptionStyle.vue
+++ b/src/renderer/src/components/CaptionStyle.vue
@@ -37,7 +37,7 @@
      <a-input
        class="input-area"
        type="range"
-        min="0" max="64"
+        min="0" max="72"
        v-model:value="currentFontSize"
      />
      <div class="input-item-value">{{ currentFontSize }}px</div>
@@ -76,12 +76,12 @@
    <div class="input-item">
      <span class="input-label">{{ $t('style.preview') }}</span>
      <a-switch v-model:checked="currentPreview" />
-      <span style="display:inline-block;width:20px;"></span>
+      <span style="display:inline-block;width:10px;"></span>
      <div style="display: inline-block;">
        <span class="switch-label">{{ $t('style.translation') }}</span>
        <a-switch v-model:checked="currentTransDisplay" />
      </div>
-      <span style="display:inline-block;width:20px;"></span>
+      <span style="display:inline-block;width:10px;"></span>
      <div style="display: inline-block;">
        <span class="switch-label">{{ $t('style.textShadow') }}</span>
        <a-switch v-model:checked="currentTextShadow" />
@@ -114,7 +114,7 @@
          <a-input
            class="input-area"
            type="range"
-            min="0" max="64"
+            min="0" max="72"
            v-model:value="currentTransFontSize"
          />
          <div class="input-item-value">{{ currentTransFontSize }}px</div>
@@ -159,7 +159,7 @@
          <a-input
            class="input-area"
            type="range"
-            min="0" max="10"
+            min="0" max="12"
            v-model:value="currentBlur"
          />
          <div class="input-item-value">{{ currentBlur }}px</div>
@@ -282,7 +282,8 @@ function applyStyle(){

  captionStyle.sendStylesChange();

-    notification.open({
+  notification.open({
+    placement: 'topLeft',
    message: t('noti.styleChange'),
    description: t('noti.styleInfo')
  });
--- a/src/renderer/src/components/EngineControl.vue
+++ b/src/renderer/src/components/EngineControl.vue
@@ -41,14 +41,44 @@
    <div class="input-item">
      <span class="input-label">{{ $t('engine.enableTranslation') }}</span>
      <a-switch v-model:checked="currentTranslation" />
-      <span style="display:inline-block;width:20px;"></span>
+      <span style="display:inline-block;width:10px;"></span>
+      <div style="display: inline-block;">
+        <span class="switch-label">{{ $t('engine.customEngine') }}</span>
+        <a-switch v-model:checked="currentCustomized" />
+      </div>
+      <span style="display:inline-block;width:10px;"></span>
      <div style="display: inline-block;">
        <span class="switch-label">{{ $t('engine.showMore') }}</span>
        <a-switch v-model:checked="showMore" />
      </div>
    </div>

-    <a-card size="small" :title="$t('engine.showMore')" v-show="showMore">
+    <a-card size="small" :title="$t('engine.custom.title')" v-show="currentCustomized">
+      <template #extra>
+        <a-popover>
+          <template #content>
+            <p class="customize-note">{{ $t('engine.custom.note') }}</p>
+          </template>
+          <a><InfoCircleOutlined />{{ $t('engine.custom.attention') }}</a>
+        </a-popover>
+      </template>
+      <div class="input-item">
+        <span class="input-label">{{ $t('engine.custom.app') }}</span>
+        <a-input
+          class="input-area"
+          v-model:value="currentCustomizedApp"
+        ></a-input>
+      </div>
+      <div class="input-item">
+        <span class="input-label">{{ $t('engine.custom.command') }}</span>
+        <a-input
+          class="input-area"
+          v-model:value="currentCustomizedCommand"
+        ></a-input>
+      </div>
+    </a-card>
+
+    <a-card size="small" :title="$t('engine.showMore')" v-show="showMore" style="margin-top:10px;">
      <div class="input-item">
        <a-popover>
          <template #content>
@@ -79,36 +109,6 @@
          v-model:value="currentModelPath"
        />
      </div>
-      <div class="input-item">
-        <span style="margin-right:5px;">{{ $t('engine.customEngine') }}</span>
-        <a-switch v-model:checked="currentCustomized" />
-      </div>
-      <div v-show="currentCustomized">
-        <a-card size="small" :title="$t('engine.custom.title')">
-          <template #extra>
-            <a-popover>
-              <template #content>
-                <p class="customize-note">{{ $t('engine.custom.note') }}</p>
-              </template>
-              <a><InfoCircleOutlined />{{ $t('engine.custom.attention') }}</a>
-            </a-popover>
-          </template>
-          <div class="input-item">
-            <span class="input-label">{{ $t('engine.custom.app') }}</span>
-            <a-input
-              class="input-area"
-              v-model:value="currentCustomizedApp"
-            ></a-input>
-          </div>
-          <div class="input-item">
-            <span class="input-label">{{ $t('engine.custom.command') }}</span>
-            <a-input
-              class="input-area"
-              v-model:value="currentCustomizedCommand"
-            ></a-input>
-          </div>
-        </a-card>
-      </div>
    </a-card>
  </a-card>
  <div style="height: 20px;"></div>
@@ -164,6 +164,7 @@ function applyChange(){
  engineControl.sendControlsChange()

  notification.open({
+    placement: 'topLeft',
    message: t('noti.engineChange'),
    description: t('noti.changeInfo')
  });
--- a/src/renderer/src/components/EngineStatus.vue
+++ b/src/renderer/src/components/EngineStatus.vue
@@ -4,10 +4,10 @@
      <a-col :span="6">
        <a-statistic
          :title="$t('status.engine')"
-          :value="(customized && customizedApp)?$t('status.customized'):engine"
+          :value="customized?$t('status.customized'):engine"
        />
      </a-col>
-      <a-popover :title="$t('status.engineStatus')"> 
+      <a-popover :title="$t('status.engineStatus')">
        <template #content>
          <a-row class="engine-status">
            <a-col :flex="1" :title="$t('status.pid')" style="cursor:pointer;">
@@ -18,6 +18,10 @@
              <div class="engine-status-title">ppid</div>
              <div>{{ ppid }}</div>
            </a-col>
+            <a-col :flex="1" :title="$t('status.port')" style="cursor:pointer;">
+              <div class="engine-status-title">port</div>
+              <div>{{ port }}</div>
+            </a-col>
            <a-col :flex="1" :title="$t('status.cpu')" style="cursor:pointer;">
              <div class="engine-status-title">cpu</div>
              <div>{{ cpu.toFixed(1) }}%</div>
@@ -41,8 +45,8 @@
              <InfoCircleOutlined style="font-size:18px;color:#1677ff"/>
            </template>
          </a-statistic>
-        </a-col>  
-      </a-popover>      
+        </a-col>
+      </a-popover>
      <a-col :span="6">
        <a-statistic :title="$t('status.logNumber')" :value="captionData.length" />
      </a-col>
@@ -61,12 +65,14 @@
    >{{ $t('status.openCaption') }}</a-button>
    <a-button
      class="control-button"
-      :disabled="engineEnabled"
+      :loading="pending && !engineEnabled"
+      :disabled="pending || engineEnabled"
      @click="startEngine"
    >{{ $t('status.startEngine') }}</a-button>
    <a-button
     danger class="control-button"
-     :disabled="!engineEnabled"
+     :loading="pending && engineEnabled"
+     :disabled="pending || !engineEnabled"
     @click="stopEngine"
    >{{ $t('status.stopEngine') }}</a-button>
  </div>
@@ -77,7 +83,7 @@
      <p class="about-desc">{{ $t('status.about.desc') }}</p>
      <a-divider />
      <div class="about-info">
-        <p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v0.5.0</a-tag></p>
+        <p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v0.6.0</a-tag></p>
        <p>
          <b>{{ $t('status.about.author') }}</b>
          <a
@@ -119,21 +125,23 @@

 <script setup lang="ts">
 import { EngineInfo } from '@renderer/types'
-import { ref } from 'vue'
+import { ref, watch } from 'vue'
 import { storeToRefs } from 'pinia'
 import { useCaptionLogStore } from '@renderer/stores/captionLog'
 import { useEngineControlStore } from '@renderer/stores/engineControl'
 import { GithubOutlined, InfoCircleOutlined } from '@ant-design/icons-vue';

 const showAbout = ref(false)
+const pending = ref(false)

 const captionLog = useCaptionLogStore()
 const { captionData } = storeToRefs(captionLog)
 const engineControl = useEngineControlStore()
-const { engineEnabled, engine, customized, customizedApp } = storeToRefs(engineControl)
+const { engineEnabled, engine, customized, errorSignal } = storeToRefs(engineControl)

 const pid = ref(0)
 const ppid = ref(0)
+const port = ref(0)
 const cpu = ref(0)
 const mem = ref(0)
 const elapsed = ref(0)
@@ -143,6 +151,7 @@ function openCaptionWindow() {
 }

 function startEngine() {
+  pending.value = true
  if(engineControl.engine === 'vosk' && engineControl.modelPath.trim() === '') {
    engineControl.emptyModelPathErr()
    return
@@ -151,6 +160,7 @@ function startEngine() {
 }

 function stopEngine() {
+  pending.value = true
  window.electron.ipcRenderer.send('control.engine.stop')
 }

@@ -158,12 +168,21 @@ function getEngineInfo() {
  window.electron.ipcRenderer.invoke('control.engine.info').then((data: EngineInfo) => {
    pid.value = data.pid
    ppid.value = data.ppid
+    port.value = data.port
    cpu.value = data.cpu
    mem.value = data.mem
    elapsed.value = data.elapsed
  })
 }

+watch(engineEnabled, () => {
+  pending.value = false
+})
+
+watch(errorSignal, () => {
+  pending.value = false
+  errorSignal.value = false
+})
 </script>

 <style scoped>
--- a/src/renderer/src/i18n/lang/en.ts
+++ b/src/renderer/src/i18n/lang/en.ts
@@ -22,6 +22,8 @@ export default {
    "stopped": "Caption Engine Stopped",
    "stoppedInfo": "The caption engine has stopped. You can click the 'Start Caption Engine' button to restart it.",
    "error": "An error occurred",
+    "engineError": "The subtitle engine encountered an error and requested a forced exit.",
+    "socketError": "The Socket connection between the main program and the caption engine failed",
    "engineChange": "Cpation Engine Configuration Changed",
    "changeInfo": "If the caption engine is already running, you need to restart it for the changes to take effect.",
    "styleChange": "Caption Style Changed",
@@ -93,8 +95,9 @@ export default {
    "engine": "Caption Engine",
    "engineStatus": "Caption Engine Status",
    "pid": "Process ID",
-    "ppid": "Parent Process ID", 
+    "ppid": "Parent Process ID",
    "cpu": "CPU Usage",
+    "port": "Socket Port Number",
    "mem": "Memory Usage",
    "elapsed": "Running Time",
    "customized": "Customized",
@@ -116,7 +119,7 @@ export default {
      "projLink": "Project Link",
      "manual": "User Manual",
      "engineDoc": "Caption Engine Manual",
-      "date": "July 15, 2025"
+      "date": "July 30, 2025"
    }
  },
  log: {
--- a/src/renderer/src/i18n/lang/ja.ts
+++ b/src/renderer/src/i18n/lang/ja.ts
@@ -22,6 +22,8 @@ export default {
    "stopped": "字幕エンジンが停止しました",
    "stoppedInfo": "字幕エンジンが停止しました。再起動するには「字幕エンジンを開始」ボタンをクリックしてください。",
    "error": "エラーが発生しました",
+    "engineError": "字幕エンジンにエラーが発生し、強制終了が要求されました。",
+    "socketError": "メインプログラムと字幕エンジンの Socket 接続に失敗しました",
    "engineChange": "字幕エンジンの設定が変更されました",
    "changeInfo": "字幕エンジンがすでに起動している場合、変更を有効にするには再起動が必要です。",
    "styleChange": "字幕のスタイルが変更されました",
@@ -94,7 +96,8 @@ export default {
    "engineStatus": "字幕エンジンの状態",
    "pid": "プロセス ID",
    "ppid": "親プロセス ID",
-    "cpu": "CPU 使用率", 
+    "port": "Socket ポート番号",
+    "cpu": "CPU 使用率",
    "mem": "メモリ使用量",
    "elapsed": "稼働時間",
    "customized": "カスタマイズ済み",
@@ -116,7 +119,7 @@ export default {
      "projLink": "プロジェクトリンク",
      "manual": "ユーザーマニュアル",
      "engineDoc": "字幕エンジンマニュアル",
-      "date": "2025 年 7 月 15 日"
+      "date": "2025 年 7 月 30 日"
    }
  },
  log: {
--- a/src/renderer/src/i18n/lang/zh.ts
+++ b/src/renderer/src/i18n/lang/zh.ts
@@ -22,6 +22,8 @@ export default {
    "stopped": "字幕引擎停止",
    "stoppedInfo": "字幕引擎已经停止，可点击“启动字幕引擎”按钮重新启动",
    "error": "发生错误",
+    "engineError": "字幕引擎发生错误并请求强制退出",
+    "socketError": "主程序与字幕引擎的 Socket 连接未成功",
    "engineChange": "字幕引擎配置已更改",
    "changeInfo": "如果字幕引擎已经启动，需要重启字幕引擎修改才会生效",
    "styleChange": "字幕样式已修改",
@@ -94,6 +96,7 @@ export default {
    "engineStatus": "字幕引擎状态",
    "pid": "进程ID",
    "ppid": "父进程ID",
+    "port": "Socket 端口号",
    "cpu": "CPU使用率",
    "mem": "内存使用量",
    "elapsed": "运行时间",
@@ -116,7 +119,7 @@ export default {
      "projLink": "项目链接",
      "manual": "用户手册",
      "engineDoc": "字幕引擎手册",
-      "date": "2025 年 7 月 15 日"
+      "date": "2025 年 7 月 30 日"
    }
  },
  log: {
--- a/src/renderer/src/stores/engineControl.ts
+++ b/src/renderer/src/stores/engineControl.ts
@@ -29,6 +29,7 @@ export const useEngineControlStore = defineStore('engineControl', () => {
  const customizedCommand = ref<string>('')

  const changeSignal = ref<boolean>(false)
+  const errorSignal = ref<boolean>(false)

  function sendControlsChange() {
    const controls: Controls = {
@@ -47,7 +48,22 @@ export const useEngineControlStore = defineStore('engineControl', () => {
    window.electron.ipcRenderer.send('control.controls.change', controls)
  }

-  function setControls(controls: Controls) {
+  function setControls(controls: Controls, set = false) {
+    if(set && !engineEnabled.value && !controls.engineEnabled) {
+      errorSignal.value = true
+      notification.open({
+        message: t('noti.error'),
+        description: t("noti.engineError"),
+        duration: null,
+        icon: () => h(ExclamationCircleOutlined, { style: 'color: #ff4d4f' })
+      });
+      notification.open({
+        message: t('noti.error'),
+        description: t("noti.socketError"),
+        duration: null,
+        icon: () => h(ExclamationCircleOutlined, { style: 'color: #ff4d4f' })
+      });
+    }
    sourceLang.value = controls.sourceLang
    targetLang.value = controls.targetLang
    engine.value = controls.engine
@@ -64,13 +80,14 @@ export const useEngineControlStore = defineStore('engineControl', () => {

  function emptyModelPathErr() {
    notification.open({
+      placement: 'topLeft',
      message: t('noti.empty'),
      description: t('noti.emptyInfo')
    });
  }

  window.electron.ipcRenderer.on('control.controls.set', (_, controls: Controls) => {
-    setControls(controls)
+    setControls(controls, true)
  })

  window.electron.ipcRenderer.on('control.engine.started', (_, args) => {
@@ -80,15 +97,17 @@ export const useEngineControlStore = defineStore('engineControl', () => {
      (translation.value ? `${t('noti.tLang')}${targetLang.value}` : '');
    const str1 = `${t('noti.custom')}${customizedApp.value}${t('noti.args')}${customizedCommand.value}`;
    notification.open({
+      placement: 'topLeft',
      message: t('noti.started'),
      description:
-        ((customized.value && customizedApp.value) ? str1 : str0) +
+        (customized.value ? str1 : str0) +
        `${t('noti.pidInfo')}${args}`
    });
  })

  window.electron.ipcRenderer.on('control.engine.stopped', () => {
    notification.open({
+      placement: 'topLeft',
      message: t('noti.stopped'),
      description: t('noti.stoppedInfo')
    });
@@ -99,7 +118,6 @@ export const useEngineControlStore = defineStore('engineControl', () => {
      message: t('noti.error'),
      description: message,
      duration: null,
-      placement: 'topLeft',
      icon: () => h(ExclamationCircleOutlined, { style: 'color: #ff4d4f' })
    });
  })
@@ -123,5 +141,6 @@ export const useEngineControlStore = defineStore('engineControl', () => {
    sendControlsChange, // 发送最新控制消息到后端
    emptyModelPathErr,  // 模型路径为空时显示警告
    changeSignal,       // 配置改变信号
+    errorSignal,        // 错误信号
  }
 })
--- a/src/renderer/src/types/index.ts
+++ b/src/renderer/src/types/index.ts
@@ -58,6 +58,7 @@ export interface FullConfig {
 export interface EngineInfo {
  pid: number,
  ppid: number,
+  port:number,
  cpu: number,
  mem: number,
  elapsed: number
--- a/src/renderer/src/views/ControlPage.vue
+++ b/src/renderer/src/views/ControlPage.vue
@@ -36,7 +36,6 @@ const { leftBarWidth, antdTheme } = storeToRefs(generalSettingStore)
  background-color: var(--control-background);
 }

-
 .caption-control {
  height: 100vh;
  border-right: 1px solid var(--tag-color);
Author	SHA1	Message	Date
himeditator	36636d0caa	feat(engine): 添加字幕窗口宽度记忆功能并优化字幕引擎关闭逻辑 - 添加 captionWindowWidth 属性，用于保存字幕窗口宽度 - 修改 CaptionEngine 中的 stop 和 kill 方法，优化字幕引擎关闭逻辑 - 更新 README，添加预备模型列表	2025-08-02 15:40:13 +08:00
himeditator mac	a7a60da260	fix(engine): 字幕引擎启动路径适配、音频重采样函数适配	2025-07-30 00:16:54 +08:00
himeditator	1b7ff33656	feat(docs): 更新项目文档和图片	2025-07-29 23:20:15 +08:00
himeditator mac	d5d692188e	feat(engine): 优化字幕引擎、提升程序健壮性 - 优化服务器启动流程，增加异常处理 - 主程序和字幕引擎的 WebSocket 端口号改为随机生成	2025-07-29 19:37:03 +08:00
himeditator	e4f937e6b6	feat(engine): 优化字幕引擎通信和控制逻辑，优化窗口信息展示 - 优化错误处理和引擎重启逻辑 - 添加字幕引擎强制终止功能 - 调整通知和错误提示的显示位置 - 优化日志记录精度到毫秒级	2025-07-28 21:44:49 +08:00
himeditator	cd9f3a847d	feat(engine): 重构字幕引擎并实现 WebSocket 通信 - 重构了 Gummy 和 Vosk 字幕引擎的代码，提高了可扩展性和可读性 - 合并 Gummy 和 Vosk 引擎为单个可执行文件 - 实现了字幕引擎和主程序之间的 WebSocket 通信，避免了孤儿进程问题	2025-07-28 15:49:52 +08:00
himeditator	b658ef5440	feat(engine): 优化字幕引擎输出格式、准备合并两个字幕引擎 - 重构字幕引擎相关代码 - 准备合并两个字幕引擎	2025-07-27 17:15:12 +08:00
himeditator	3792eb88b6	refactor(engine): 重构字幕引擎 - 更新 GummyTranslator 类，优化字幕生成逻辑 - 移除 audioprcs 模块，音频处理功能转移到 utils 模块 - 重构 sysaudio 模块，提高音频流管理的灵活性和稳定性 - 修改 TODO.md，完成按时间降序排列字幕记录的功能 - 更新文档，说明因资源限制将不再维护英文和日文文档	2025-07-26 23:37:24 +08:00
himeditator	8e575a9ba3	refactor(engine): 字幕引擎文件夹重命名，字幕记录添加降序选择 - 字幕记录表格可以按时间降序排列 - 将 caption-engine 重命名为 engine - 更新了相关文件和文件夹的路径 - 修改了 README 和 TODO 文档中的相关内容 - 更新了 Electron 构建配置	2025-07-26 21:29:16 +08:00
himeditator	697488ce84	docs: update README, add TODO	2025-07-20 00:32:57 +08:00
himeditator	f7d2df938d	fix(engine): 修复自定义字幕引擎相关问题	2025-07-17 20:52:27 +08:00
himeditator	5513c7e84c	docs(compatibility): 添加 Kylin OS 支持、更新文档	2025-07-16 20:55:03 +08:00
				`@@ -1 +0,0 @@`
				`from .process import mergeChunkChannels, resampleRawChunk, resampleMonoChunk`