feat(docs): 更新文档、添加 macOS 平台适配指南

2026-02-15 20:34:47 +08:00 · 2025-07-08 22:44:11 +08:00
parent cbbaaa95a3
commit 3c9138f115
15 changed files with 463 additions and 244 deletions
--- a/docs/engine-manual/en.md
+++ b/docs/engine-manual/en.md
@@ -1,67 +1,102 @@
 # Caption Engine Documentation

+Corresponding Version: v0.3.0
+
 ![](../../assets/media/structure_en.png)

 ## Introduction to the Caption Engine

-The so-called caption engine is actually a subprocess that fetches real-time streaming audio data from system audio input (recording) or output (playing sound) and calls an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into JSON formatted string data and passed to the main program via standard output (it must be ensured that the string read by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and displays it on the window.
+The so-called caption engine is actually a subprogram that captures real-time streaming data from the system's audio input (recording) or output (playing sound) and calls an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into a JSON-formatted string and passed to the main program through standard output (it must be ensured that the string read by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and then displays it on the window.

-## Features the Caption Engine Needs to Implement
+## Functions Required by the Caption Engine

 ### Audio Acquisition

-First, your caption engine needs to acquire streaming audio data from system audio input (recording) or output (playing sound). If developing with Python, you can use the PyAudio library to get microphone audio input data (cross-platform). Use the PyAudioWPatch library to get system audio output (only applicable to Windows platform).
+First, your caption engine needs to capture streaming data from the system's audio input (recording) or output (playing sound). If using Python for development, you can use the PyAudio library to obtain microphone audio input data (cross-platform). Use the PyAudioWPatch library to get system audio output (Windows platform only).

-The acquired audio stream data is usually in the form of short audio chunks, and the size of these chunks should be adjusted according to the model. For example, Alibaba Cloud's Gummy model performs better with 0.05-second audio chunks than with 0.2-second audio chunks.
+Generally, the captured audio stream data consists of short audio chunks, and the size of these chunks should be adjusted according to the model. For example, Alibaba Cloud's Gummy model performs better with 0.05-second audio chunks compared to 0.2-second ones.

 ### Audio Processing

-The acquired audio stream may need preprocessing before being converted to text. For instance, Alibaba Cloud's Gummy model can only recognize single-channel audio streams, while the collected audio streams are generally dual-channel, so you need to convert the dual-channel audio stream to a single channel. The conversion of channels can be achieved using methods from the NumPy library.
+The acquired audio stream may need preprocessing before being converted to text. For instance, Alibaba Cloud's Gummy model can only recognize single-channel audio streams, while the collected audio streams are typically dual-channel, thus requiring conversion from dual-channel to single-channel. Channel conversion can be achieved using methods in the NumPy library.

-You can directly use the audio acquisition and processing modules I've developed (path: `caption-engine/sysaudio`):
-
-```python
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream, mergeStreamChannels
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream, mergeStreamChannels
-else:
-    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
-
-# Create an instance of the audio stream object
-stream = AudioStream(audio_type)
-# Open the audio stream
-stream.openStream()
-while True:  # Loop to read audio data
-    # Read audio data
-    data = stream.stream.read(stream.CHUNK)
-    # Convert dual-channel audio data to single-channel
-    data = mergeStreamChannels(data, stream.CHANNELS)
-    # Call the audio-to-text model
-    # ... ...
-```
+You can directly use the audio acquisition (`caption-engine/sysaudio`) and audio processing (`caption-engine/audioprcs`) modules I have developed.

 ### Audio to Text Conversion

-Once you have the appropriate audio stream, you can convert it to text. Various models are typically used to achieve this. You can choose the model based on your requirements.
+After obtaining the appropriate audio stream, you can convert it into text. This is generally done using various models based on your requirements.
+
+A nearly complete implementation of a caption engine is as follows:
+
+```python
+import sys
+import argparse
+
+# Import system audio acquisition module
+if sys.platform == 'win32':
+    from sysaudio.win import AudioStream
+elif sys.platform == 'darwin':
+    from sysaudio.darwin import AudioStream
+elif sys.platform == 'linux':
+    from sysaudio.linux import AudioStream
+else:
+    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
+
+# Import audio processing functions
+from audioprcs import mergeChunkChannels
+# Import audio-to-text module
+from audio2text import InvalidParameter, GummyTranslator
+
+
+def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
+    # Set standard output to line buffering
+    sys.stdout.reconfigure(line_buffering=True) # type: ignore
+
+    # Create instances for audio acquisition and speech-to-text
+    stream = AudioStream(audio_type, chunk_rate)
+    if t_lang == 'none':
+        gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
+    else:
+        gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
+
+    # Start instances
+    stream.openStream()
+    gummy.start()
+
+    while True:
+        try:
+            # Read audio stream data
+            chunk = stream.read_chunk()
+            chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
+            try:
+                # Call the model for translation
+                gummy.send_audio_frame(chunk_mono)
+            except InvalidParameter:
+                gummy.start()
+                gummy.send_audio_frame(chunk_mono)
+        except KeyboardInterrupt:
+            stream.closeStream()
+            gummy.stop()
+            break
+```

 ### Data Transmission

-After obtaining the text for the current audio stream, you need to pass the text to the main program. The caption engine process passes the caption data to the Electron main process through standard output.
+After obtaining the text of the current audio stream, it needs to be transmitted to the main program. The caption engine process passes the caption data to the Electron main process through standard output.

-The content transmitted must be a JSON string, where the JSON object should include the following parameters:
+The content transmitted must be a JSON string, where the JSON object must contain the following parameters:

 ```typescript
 export interface CaptionItem {
  index: number, // Caption sequence number
-  time_s: string, // Start time of the current caption
-  time_t: string, // End time of the current caption
+  time_s: string, // Caption start time
+  time_t: string, // Caption end time
  text: string, // Caption content
  translation: string // Caption translation
 }
 ```

-**It is essential to ensure that every time a caption JSON data is output, the buffer is flushed, ensuring that the string received by the Electron main process each time can be interpreted as a JSON object.**
+**It is essential to ensure that each time we output caption JSON data, the buffer is flushed, ensuring that the string received by the Electron main process can always be interpreted as a JSON object.**

 If using Python, you can refer to the following method to pass data to the main program:

@@ -84,7 +119,8 @@ sys.stdout.reconfigure(line_buffering=True)
 ...
 ```

-The code for the data receiving end is as follows:
+Data receiver code is as follows:
+

 ```typescript
 // src\main\utils\engine.ts
@@ -97,7 +133,7 @@ The code for the data receiving end is as follows:
            const caption = JSON.parse(line);
            addCaptionLog(caption);
          } catch (e) {
-            controlWindow.sendErrorMessage('Cannot parse caption engine output as JSON object: ' + e)
+            controlWindow.sendErrorMessage('Unable to parse the output from the caption engine as a JSON object: ' + e)
            console.error('[ERROR] Error parsing JSON:', e);
          }
        }
@@ -111,6 +147,6 @@ The code for the data receiving end is as follows:
 ...
 ```

-## Code Reference
+## Reference Code

-The default caption engine entry point code is located in the `main-gummy.py` file under the `caption-engine` folder of this project. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing caption engine data. You can read and understand the implementation details and the complete runtime process of the caption engine as needed.
+The `main-gummy.py` file under the `caption-engine` folder in this project serves as the entry point for the default caption engine. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing data from the caption engine. You can read and understand the implementation details and the complete execution process of the caption engine as needed.
--- a/docs/engine-manual/ja.md
+++ b/docs/engine-manual/ja.md
@@ -1,71 +1,106 @@
-# キャプションエンジンの説明文書
+# 字幕エンジンの説明文書

-![](../../assets/media/structure_ja.png)
+対応バージョン：v0.3.0

 この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。

-## キャプションエンジンの紹介
+![](../../assets/media/structure_ja.png)

-キャプションエンジンとは、実際にはサブプログラムであり、システムの音声入力（録音）または出力（音声再生）のストリーミングデータをリアルタイムで取得し、音声をテキストに変換するモデルを呼び出して対応する音声のキャプションを生成します。生成されたキャプションはJSON形式の文字列データに変換され、標準出力を通じてメインプログラムに渡されます（メインプログラムが読み取った文字列がJSONオブジェクトとして正しく解釈できるようにする必要があります）。メインプログラムはキャプションデータを読み取り、解釈し、処理してウィンドウ上に表示します。
+## 字幕エンジンの紹介

-## キャプションエンジンが必要とする機能
+所謂字幕エンジンは実際にはサブプログラムであり、システムの音声入力（録音）または出力（音声再生）のストリーミングデータをリアルタイムで取得し、音声からテキストへの変換モデルを使って対応する音声の字幕を生成します。生成された字幕はJSON形式の文字列データに変換され、標準出力を通じてメインプログラムに渡されます（メインプログラムが読み取った文字列が正しいJSONオブジェクトとして解釈されることが保証される必要があります）。メインプログラムは字幕データを読み取り、解釈して処理し、ウィンドウ上に表示します。
+
+## 字幕エンジンが必要な機能

 ### 音声の取得

-まず、あなたのキャプションエンジンはシステムの音声入力（録音）または出力（音声再生）のストリーミングデータを取得する必要があります。Pythonを使用して開発する場合、PyAudioライブラリを使用してマイクからの音声入力データを取得できます（全プラットフォーム対応）。PyAudioWPatchライブラリを使用してシステムの音声出力を取得することができます（Windowsプラットフォームのみ対応）。
+まず、あなたの字幕エンジンはシステムの音声入力（録音）または出力（音声再生）のストリーミングデータを取得する必要があります。Pythonを使用して開発する場合、PyAudioライブラリを使ってマイクからの音声入力データを取得できます（全プラットフォーム共通）。また、WindowsプラットフォームではPyAudioWPatchライブラリを使ってシステムの音声出力を取得することもできます。

-一般的に取得される音声ストリームデータは、比較的短い時間の音声ブロックで構成されています。モデルに合わせて音声ブロックのサイズを調整する必要があります。例えば、アリババクラウドのGummyモデルでは、0.05秒の音声ブロックを使用した認識精度が0.2秒の音声ブロックよりも優れています。
+一般的に取得される音声ストリームデータは、比較的短い時間間隔の音声ブロックで構成されています。モデルに合わせて音声ブロックのサイズを調整する必要があります。例えば、アリババクラウドのGummyモデルでは、0.05秒の音声ブロックを使用した認識結果の方が0.2秒の音声ブロックよりも優れています。

 ### 音声の処理

-取得した音声ストリームは、テキストに変換する前に前処理を行う必要があるかもしれません。例えば、アリババクラウドのGummyモデルは単一チャンネルの音声ストリームしか認識できませんが、収集された音声ストリームは通常二重チャンネルです。そのため、二重チャンネルの音声ストリームを単一チャンネルに変換する必要があります。チャンネル数の変換はNumPyライブラリのメソッドを使用して行うことができます。
+取得した音声ストリームは、テキストに変換する前に前処理が必要な場合があります。例えば、アリババクラウドのGummyモデルは単一チャンネルの音声ストリームしか認識できませんが、収集された音声ストリームは通常二重チャンネルであるため、二重チャンネルの音声ストリームを単一チャンネルに変換する必要があります。チャンネル数の変換はNumPyライブラリのメソッドを使って行うことができます。

-既に開発済みの音声取得と音声処理モジュール（パス：`caption-engine/sysaudio`）を使用することもできます：
-
-```python
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream, mergeStreamChannels
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream, mergeStreamChannels
-else:
-    raise NotImplementedError(f"サポートされていないプラットフォーム: {sys.platform}")
-
-# 音声ストリームオブジェクトのインスタンスを作成
-stream = AudioStream(audio_type)
-# 音声ストリームを開く
-stream.openStream()
-while True:  # 音声データを繰り返し読み込む
-    # 音声データを読み込む
-    data = stream.stream.read(stream.CHUNK)
-    # 二重チャンネルの音声データを単一チャンネルに変換
-    data = mergeStreamChannels(data, stream.CHANNELS)
-    # 音声をテキストに変換するモデルを呼び出す
-    # ... ...
-```
+あなたは私によって開発された音声の取得（`caption-engine/sysaudio`）と音声の処理（`caption-engine/audioprcs`）モジュールを直接使用することができます。

 ### 音声からテキストへの変換

-適切な音声ストリームを得た後、それをテキストに変換することができます。通常、様々なモデルを使用してこの変換を行います。必要に応じてモデルを選択してください。
+適切な音声ストリームを得た後、それをテキストに変換することができます。通常、様々なモデルを使って音声ストリームをテキストに変換します。必要に応じてモデルを選択することができます。
+
+ほぼ完全な字幕エンジンの実装例：
+
+```python
+import sys
+import argparse
+
+# システム音声の取得に関する設定
+if sys.platform == 'win32':
+    from sysaudio.win import AudioStream
+elif sys.platform == 'darwin':
+    from sysaudio.darwin import AudioStream
+elif sys.platform == 'linux':
+    from sysaudio.linux import AudioStream
+else:
+    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
+
+# 音声処理関数のインポート
+from audioprcs import mergeChunkChannels
+# 音声からテキストへの変換モジュールのインポート
+from audio2text import InvalidParameter, GummyTranslator
+
+
+def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
+    # 標準出力をラインバッファリングに設定
+    sys.stdout.reconfigure(line_buffering=True) # type: ignore
+
+    # 音声の取得と音声からテキストへの変換のインスタンスを作成
+    stream = AudioStream(audio_type, chunk_rate)
+    if t_lang == 'none':
+        gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
+    else:
+        gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
+
+    # インスタンスを開始
+    stream.openStream()
+    gummy.start()
+
+    while True:
+        try:
+            # 音声ストリームデータを読み込む
+            chunk = stream.read_chunk()
+            chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
+            try:
+                # モデルを使って翻訳を行う
+                gummy.send_audio_frame(chunk_mono)
+            except InvalidParameter:
+                gummy.start()
+                gummy.send_audio_frame(chunk_mono)
+        except KeyboardInterrupt:
+            stream.closeStream()
+            gummy.stop()
+            break
+```

 ### データの伝送

-現在の音声ストリームのテキストを取得したら、それをメインプログラムに伝送する必要があります。キャプションエンジンプロセスは標準出力を通じてキャプションデータをElectronのメインプロセスに伝送します。
+現在の音声ストリームのテキストを得たら、それをメインプログラムに渡す必要があります。字幕エンジンプロセスは標準出力を通じて電子メール主プロセスに字幕データを渡します。

-伝送する内容はJSON文字列でなければならず、JSONオブジェクトには以下のパラメータを含める必要があります：
+渡す内容はJSON文字列でなければなりません。JSONオブジェクトには以下のパラメータを含める必要があります：

 ```typescript
 export interface CaptionItem {
-  index: number, // キャプション番号
-  time_s: string, // 現在のキャプションの開始時間
-  time_t: string, // 現在のキャプションの終了時間
-  text: string, // キャプションの内容
-  translation: string // キャプションの翻訳
+  index: number, // 字幕番号
+  time_s: string, // 現在の字幕開始時間
+  time_t: string, // 現在の字幕終了時間
+  text: string, // 字幕内容
+  translation: string // 字幕翻訳
 }
 ```

-**注意：キャプションJSONデータを出力するたびに必ずバッファをフラッシュし、Electronのメインプロセスが受け取る文字列が常にJSONオブジェクトとして解釈できるようにする必要があります。**
+**必ず、字幕JSONデータを出力するたびにバッファをフラッシュし、electron主プロセスが受け取る文字列が常にJSONオブジェクトとして解釈できるようにする必要があります。**

-Pythonを使用する場合、以下のようにデータをメインプログラムに伝送できます：
+Python言語を使用する場合、以下の方法でデータをメインプログラムに渡すことができます：

 ```python
 # caption-engine\main-gummy.py
@@ -75,44 +110,15 @@ sys.stdout.reconfigure(line_buffering=True)
 ...
    def send_to_node(self, data):
        """
-        データをNode.jsプロセスに送信
+        Node.jsプロセスにデータを送信する
        """
        try:
            json_data = json.dumps(data) + '\n'
            sys.stdout.write(json_data)
            sys.stdout.flush()
        except Exception as e:
-            print(f"Node.jsへのデータ送信エラー: {e}", file=sys.stderr)
+            print(f"Error sending data to Node.js: {e}", file=sys.stderr)
 ...
 ```

-データ受信側のコードは以下の通りです：
-
-```typescript
-// src\main\utils\engine.ts
-...
-    this.process.stdout.on('data', (data) => {
-      const lines = data.toString().split('\n');
-      lines.forEach((line: string) => {
-        if (line.trim()) {
-          try {
-            const caption = JSON.parse(line);
-            addCaptionLog(caption);
-          } catch (e) {
-            controlWindow.sendErrorMessage('キャプションエンジンの出力内容がJSONオブジェクトとして解析できません: ' + e)
-            console.error('[ERROR] JSON解析エラー:', e);
-          }
-        }
-      });
-    });
-
-    this.process.stderr.on('data', (data) => {
-      controlWindow.sendErrorMessage('キャプションエンジンエラー: ' + data)
-      console.error(`[ERROR] サブプロセスエラー: ${data}`);
-    });
-...
-```
-
-## 参考コード
-
-本プロジェクトの `caption-engine` フォルダにある `main-gummy.py` ファイルは、デフォルトのキャプションエンジンのエントリポイントコードです。`src\main\utils\engine.ts` はサーバーサイドでキャプションエンジンのデータを取得および処理するためのコードです。必要に応じて、キャプションエンジンの実装詳細と完全な実行プロセスを理解するために読み込むことができます。
+データ受信側のコードは
--- a/docs/engine-manual/zh.md
+++ b/docs/engine-manual/zh.md
@@ -1,5 +1,7 @@
 # 字幕引擎说明文档

+对应版本：v0.3.0
+
 ![](../../assets/media/structure_zh.png)

 ## 字幕引擎介绍
@@ -18,33 +20,66 @@

 获取到的音频流在转文字之前可能需要进行预处理。比如阿里云的 Gummy 模型只能识别单通道的音频流，而收集的音频流一般是双通道的，因此要将双通道音频流转换为单通道。通道数的转换可以使用 NumPy 库中的方法实现。

-你可以直接使用我开发好的音频获取和音频处理模块（路径：`caption-engine/sysaudio`）：
-
-```python
-if sys.platform == 'win32':
-    from sysaudio.win import AudioStream, mergeStreamChannels
-elif sys.platform == 'linux':
-    from sysaudio.linux import AudioStream, mergeStreamChannels
-else:
-    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
-
-# 创建音频流对象实例
-stream = AudioStream(audio_type)
-# 打开音频流
-stream.openStream()
-while True: # 循环读取音频数据
-    # 读取音频数据
-    data = stream.stream.read(stream.CHUNK)
-    # 将双通道音频数据转换为单通道
-    data = mergeStreamChannels(data, stream.CHANNELS)
-    # 调用音频转文字模型
-    # ... ...
-```
+你可以直接使用我开发好的音频获取（`caption-engine/sysaudio`）和音频处理（`caption-engine/audioprcs`）模块。

 ### 音频转文字

 在得到了合适的音频流后，就可以将音频流转换为文字了。一般使用各种模型来实现音频流转文字。可根据需求自行选择模型。

+一个接近完整的字幕引擎实例如下：
+
+```python
+import sys
+import argparse
+
+# 引入系统音频获取勒
+if sys.platform == 'win32':
+    from sysaudio.win import AudioStream
+elif sys.platform == 'darwin':
+    from sysaudio.darwin import AudioStream
+elif sys.platform == 'linux':
+    from sysaudio.linux import AudioStream
+else:
+    raise NotImplementedError(f"Unsupported platform: {sys.platform}")
+
+# 引入音频处理函数
+from audioprcs import mergeChunkChannels
+# 引入音频转文本模块
+from audio2text import InvalidParameter, GummyTranslator
+
+
+def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
+    # 设置标准输出为行缓冲
+    sys.stdout.reconfigure(line_buffering=True) # type: ignore
+
+    # 创建音频获取和语音转文字实例
+    stream = AudioStream(audio_type, chunk_rate)
+    if t_lang == 'none':
+        gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
+    else:
+        gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
+
+    # 启动实例
+    stream.openStream()
+    gummy.start()
+
+    while True:
+        try:
+            # 读取音频流数据
+            chunk = stream.read_chunk()
+            chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
+            try:
+                # 调用模型进行翻译
+                gummy.send_audio_frame(chunk_mono)
+            except InvalidParameter:
+                gummy.start()
+                gummy.send_audio_frame(chunk_mono)
+        except KeyboardInterrupt:
+            stream.closeStream()
+            gummy.stop()
+            break
+```
+
 ### 数据传递

 在获取到当前音频流的文字后，需要将文字传递给主程序。字幕引擎进程通过标准输出将字幕数据传递给 electron 主进程。