mirror of
https://github.com/HiMeditator/auto-caption.git
synced 2026-02-14 11:34:43 +08:00
release v1.0.0
This commit is contained in:
@@ -156,15 +156,20 @@
|
||||
- 更清晰的日志输出
|
||||
|
||||
|
||||
## v0.8.0
|
||||
## v1.0.0
|
||||
|
||||
2025-09-??
|
||||
2025-09-08
|
||||
|
||||
### 新增功能
|
||||
|
||||
- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭、在字幕引擎启动过程中也可选择关闭字幕引擎
|
||||
- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译、支持调用 Google 翻译 API 进行翻译
|
||||
- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭;在字幕引擎启动过程中可选择关闭字幕引擎
|
||||
- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译;支持调用 Google 翻译 API 进行翻译
|
||||
- 添加新的翻译模型:添加 SOSV 模型,支持识别英语、中文、日语、韩语、粤语
|
||||
- 添加录音功能:可以将字幕引擎识别的音频保存为 .wav 文件
|
||||
- 添加多行字幕功能,用户可以设置字幕窗口显示的字幕的行数
|
||||
|
||||
### 优化体验
|
||||
|
||||
- 优化部分提示信息显示位置
|
||||
- 替换重采样模型,提高音频重采样质量
|
||||
- 带有额外信息的标签颜色改为与主题色一致
|
||||
@@ -21,6 +21,8 @@
|
||||
- [x] 复制字幕记录可选择只复制最近的字幕记录 *2025/08/18*
|
||||
- [x] 添加颜色主题设置 *2025/08/18*
|
||||
- [x] 前端页面添加日志内容展示 *2025/08/19*
|
||||
- [x] 添加 Ollama 模型用于本地字幕引擎的翻译 *2025/09/04*
|
||||
- [x] 验证 / 添加基于 sherpa-onnx 的字幕引擎 *2025/09/06*
|
||||
|
||||
## 待完成
|
||||
|
||||
@@ -29,7 +31,6 @@
|
||||
|
||||
## 后续计划
|
||||
|
||||
- [ ] 添加 Ollama 模型用于本地字幕引擎的翻译
|
||||
- [ ] 验证 / 添加基于 FunASR 的字幕引擎
|
||||
- [ ] 减小软件不必要的体积
|
||||
|
||||
|
||||
@@ -1,175 +1,207 @@
|
||||
# Caption Engine Documentation
|
||||
# Caption Engine Documentation
|
||||
|
||||
Corresponding Version: v0.6.0
|
||||
Corresponding version: v1.0.0
|
||||
|
||||

|
||||

|
||||
|
||||
## Introduction to the Caption Engine
|
||||
## Introduction to the Caption Engine
|
||||
|
||||
The so-called caption engine is essentially a subprogram that continuously captures real-time streaming data from the system's audio input (microphone) or output (speakers) and invokes an audio-to-text model to generate corresponding captions for the audio. The generated captions are converted into JSON-formatted string data and passed to the main program via standard output (ensuring the string can be correctly interpreted as a JSON object by the main program). The main program reads and interprets the caption data, processes it, and displays it in the window.
|
||||
The so-called caption engine is actually a subprocess that captures streaming data from system audio input (microphone) or output (speaker) in real-time, and invokes an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into JSON-formatted string data and transmitted to the main program through standard output (ensuring that the string received by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and displays it in the window.
|
||||
|
||||
**The communication standard between the caption engine process and the Electron main process is: [caption engine api-doc](../api-docs/caption-engine.md).**
|
||||
**Communication between the caption engine process and Electron main process follows the standard: [caption engine api-doc](../api-docs/caption-engine.md).**
|
||||
|
||||
## Workflow
|
||||
## Execution Flow
|
||||
|
||||
The communication flow between the main process and the caption engine:
|
||||
Process of communication between main process and caption engine:
|
||||
|
||||
### Starting the Engine
|
||||
### Starting the Engine
|
||||
|
||||
- **Main Process**: Uses `child_process.spawn()` to launch the caption engine process.
|
||||
- **Caption Engine Process**: Creates a TCP Socket server thread. After creation, it outputs a JSON object string via standard output, containing a `command` field with the value `connect`.
|
||||
- **Main Process**: Monitors the standard output of the caption engine process, attempts to split it line by line, parses it into a JSON object, and checks if the `command` field value is `connect`. If so, it connects to the TCP Socket server.
|
||||
- Electron main process: Use `child_process.spawn()` to start the caption engine process
|
||||
- Caption engine process: Create a TCP Socket server thread, after creation output a JSON object converted to string via standard output, containing the `command` field with value `connect`
|
||||
- Main process: Listen to the caption engine process's standard output, try to split the standard output by lines, parse it into a JSON object, and check if the object's `command` field value is `connect`. If so, connect to the TCP Socket server
|
||||
|
||||
### Caption Recognition
|
||||
### Caption Recognition
|
||||
|
||||
- **Caption Engine Process**: The main thread monitors system audio output, sends audio data chunks to the caption engine for parsing, and outputs the parsed caption data object strings via standard output.
|
||||
- **Main Process**: Continues to monitor the standard output of the caption engine and performs different operations based on the `command` field of the parsed object.
|
||||
- Caption engine process: Create a new thread to monitor system audio output, put acquired audio data chunks into a shared queue (`shared_data.chunk_queue`). The caption engine continuously reads audio data chunks from the shared queue and parses them. The caption engine may also create a new thread to perform translation operations. Finally, the caption engine sends parsed caption data object strings through standard output
|
||||
- Electron main process: Continuously listen to the caption engine's standard output and take different actions based on the parsed object's `command` field
|
||||
|
||||
### Closing the Engine
|
||||
### Stopping the Engine
|
||||
|
||||
- **Main Process**: When the user closes the caption engine via the frontend, the main process sends a JSON object string with the `command` field set to `stop` to the caption engine process via Socket communication.
|
||||
- **Caption Engine Process**: Receives the object string, parses it, and if the `command` field is `stop`, sets the global variable `thread_data.status` to `stop`.
|
||||
- **Caption Engine Process**: The main thread's loop for monitoring system audio output ends when `thread_data.status` is not `running`, releases resources, and terminates.
|
||||
- **Main Process**: Detects the termination of the caption engine process, performs corresponding cleanup, and provides feedback to the frontend.
|
||||
- Electron main process: When the user operates to close the caption engine in the frontend, the main process sends an object string with `command` field set to `stop` to the caption engine process through Socket communication
|
||||
- Caption engine process: Receive the caption data object string sent by the main engine process, parse the string into an object. If the object's `command` field is `stop`, set the value of global variable `shared_data.status` to `stop`
|
||||
- Caption engine process: Main thread continuously monitors system audio output, when `thread_data.status` value is not `running`, end the loop, release resources, and terminate execution
|
||||
- Electron main process: If the caption engine process termination is detected, perform corresponding processing and provide feedback to the frontend
|
||||
|
||||
## Implemented Features
|
||||
## Implemented Features
|
||||
|
||||
The following features are already implemented and can be reused directly.
|
||||
The following features have been implemented and can be directly reused.
|
||||
|
||||
### Standard Output
|
||||
### Standard Output
|
||||
|
||||
Supports printing general information, commands, and error messages.
|
||||
Can output regular information, commands, and error messages.
|
||||
|
||||
Example:
|
||||
Examples:
|
||||
|
||||
```python
|
||||
from utils import stdout, stdout_cmd, stdout_obj, stderr
|
||||
stdout("Hello") # {"command": "print", "content": "Hello"}\n
|
||||
stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
|
||||
stdout_obj({"command": "print", "content": "Hello"})
|
||||
stderr("Error Info")
|
||||
```
|
||||
```python
|
||||
from utils import stdout, stdout_cmd, stdout_obj, stderr
|
||||
# {"command": "print", "content": "Hello"}\n
|
||||
stdout("Hello")
|
||||
# {"command": "connect", "content": "8080"}\n
|
||||
stdout_cmd("connect", "8080")
|
||||
# {"command": "print", "content": "print"}\n
|
||||
stdout_obj({"command": "print", "content": "print"})
|
||||
# sys.stderr.write("Error Info" + "\n")
|
||||
stderr("Error Info")
|
||||
```
|
||||
|
||||
### Creating a Socket Service
|
||||
### Creating Socket Service
|
||||
|
||||
This Socket service listens on a specified port, parses content sent by the Electron main program, and may modify the value of `thread_data.status`.
|
||||
This Socket service listens on a specified port, parses content sent by the Electron main program, and may change the value of `shared_data.status`.
|
||||
|
||||
Example:
|
||||
Example:
|
||||
|
||||
```python
|
||||
from utils import start_server
|
||||
from utils import thread_data
|
||||
port = 8080
|
||||
start_server(port)
|
||||
while thread_data == 'running':
|
||||
# do something
|
||||
pass
|
||||
```
|
||||
```python
|
||||
from utils import start_server
|
||||
from utils import shared_data
|
||||
port = 8080
|
||||
start_server(port)
|
||||
while thread_data == 'running':
|
||||
# do something
|
||||
pass
|
||||
```
|
||||
|
||||
### Audio Capture
|
||||
### Audio Acquisition
|
||||
|
||||
The `AudioStream` class captures audio data and is cross-platform, supporting Windows, Linux, and macOS. Its initialization includes two parameters:
|
||||
The `AudioStream` class is used to acquire audio data, with cross-platform implementation supporting Windows, Linux, and macOS. The class initialization includes two parameters:
|
||||
|
||||
- `audio_type`: The type of audio to capture. `0` for system output audio (speakers), `1` for system input audio (microphone).
|
||||
- `chunk_rate`: The frequency of audio data capture, i.e., the number of audio chunks captured per second.
|
||||
- `audio_type`: Audio acquisition type, 0 for system output audio (speaker), 1 for system input audio (microphone)
|
||||
- `chunk_rate`: Audio data acquisition frequency, number of audio chunks acquired per second, default is 10
|
||||
|
||||
The class includes three methods:
|
||||
The class contains four methods:
|
||||
|
||||
- `open_stream()`: Starts audio capture.
|
||||
- `read_chunk() -> bytes`: Reads an audio chunk.
|
||||
- `close_stream()`: Stops audio capture.
|
||||
- `open_stream()`: Start audio acquisition
|
||||
- `read_chunk() -> bytes`: Read an audio chunk
|
||||
- `close_stream()`: Close audio acquisition
|
||||
- `close_stream_signal()`: Thread-safe closing of system audio input stream
|
||||
|
||||
Example:
|
||||
Example:
|
||||
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
audio_type = 0
|
||||
chunk_rate = 20
|
||||
stream = AudioStream(audio_type, chunk_rate)
|
||||
stream.open_stream()
|
||||
while True:
|
||||
data = stream.read_chunk()
|
||||
# do something with data
|
||||
pass
|
||||
stream.close_stream()
|
||||
```
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
audio_type = 0
|
||||
chunk_rate = 20
|
||||
stream = AudioStream(audio_type, chunk_rate)
|
||||
stream.open_stream()
|
||||
while True:
|
||||
data = stream.read_chunk()
|
||||
# do something with data
|
||||
pass
|
||||
stream.close_stream()
|
||||
```
|
||||
|
||||
### Audio Processing
|
||||
### Audio Processing
|
||||
|
||||
The captured audio stream may require preprocessing before conversion to text. Typically, multi-channel audio needs to be converted to mono, and resampling may be necessary. This project provides three audio processing functions:
|
||||
Before converting audio streams to text, preprocessing may be required. Usually, multi-channel audio needs to be converted to single-channel audio, and resampling may also be needed. This project provides two audio processing functions:
|
||||
|
||||
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Converts a multi-channel audio chunk to mono.
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Converts a multi-channel audio chunk to mono and resamples it.
|
||||
- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Resamples a mono audio chunk.
|
||||
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Convert multi-channel audio chunks to single-channel audio chunks
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`: Convert current multi-channel audio data chunks to single-channel audio data chunks, then perform resampling
|
||||
|
||||
## Features to Be Implemented in the Caption Engine
|
||||
Example:
|
||||
|
||||
### Audio-to-Text Conversion
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
from utils import merge_chunk_channels
|
||||
stream = AudioStream(1)
|
||||
while True:
|
||||
raw_chunk = stream.read_chunk()
|
||||
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
|
||||
# do something with chunk
|
||||
```
|
||||
|
||||
After obtaining a suitable audio stream, it needs to be converted to text. Typically, various models (cloud-based or local) are used for this purpose. Choose the appropriate model based on requirements.
|
||||
## Features to be Implemented by the Caption Engine
|
||||
|
||||
This part is recommended to be encapsulated as a class with three methods:
|
||||
### Audio to Text Conversion
|
||||
|
||||
- `start(self)`: Starts the model.
|
||||
- `send_audio_frame(self, data: bytes)`: Processes the current audio chunk data. **The generated caption data is sent to the Electron main process via standard output.**
|
||||
- `stop(self)`: Stops the model.
|
||||
After obtaining suitable audio streams, the audio stream needs to be converted to text. Generally, various models (cloud or local) are used to implement audio-to-text conversion. Appropriate models should be selected according to requirements.
|
||||
|
||||
Complete caption engine examples:
|
||||
It is recommended to encapsulate this as a class, implementing four methods:
|
||||
|
||||
- [gummy.py](../../engine/audio2text/gummy.py)
|
||||
- [vosk.py](../../engine/audio2text/vosk.py)
|
||||
- `start(self)`: Start the model
|
||||
- `send_audio_frame(self, data: bytes)`: Process current audio chunk data, **generated caption data is sent to Electron main process through standard output**
|
||||
- `translate(self)`: Continuously retrieve data chunks from `shared_data.chunk_queue` and call `send_audio_frame` method to process data chunks
|
||||
- `stop(self)`: Stop the model
|
||||
|
||||
### Caption Translation
|
||||
Complete caption engine examples:
|
||||
|
||||
Some speech-to-text models do not provide translation. If needed, a translation module must be added.
|
||||
- [gummy.py](../../engine/audio2text/gummy.py)
|
||||
- [vosk.py](../../engine/audio2text/vosk.py)
|
||||
- [sosv.py](../../engine/audio2text/sosv.py)
|
||||
|
||||
### Sending Caption Data
|
||||
### Caption Translation
|
||||
|
||||
After obtaining the text for the current audio stream, it must be sent to the main program. The caption engine process passes caption data to the Electron main process via standard output.
|
||||
Some speech-to-text models do not provide translation. If needed, an additional translation module needs to be added, or built-in translation modules can be used.
|
||||
|
||||
The content must be a JSON string, with the JSON object including the following parameters:
|
||||
Example:
|
||||
|
||||
```typescript
|
||||
export interface CaptionItem {
|
||||
command: "caption",
|
||||
index: number, // Caption sequence number
|
||||
time_s: string, // Start time of the current caption
|
||||
time_t: string, // End time of the current caption
|
||||
text: string, // Caption content
|
||||
translation: string // Caption translation
|
||||
}
|
||||
```
|
||||
```python
|
||||
from utils import google_translate, ollama_translate
|
||||
text = "This is a translation test."
|
||||
google_translate("", "en", text, "time_s")
|
||||
ollama_translate("qwen3:0.6b", "en", text, "time_s")
|
||||
```
|
||||
|
||||
**Note: Ensure the buffer is flushed after each JSON output to guarantee the Electron main process receives a string that can be parsed as a JSON object.**
|
||||
### Caption Data Transmission
|
||||
|
||||
It is recommended to use the project's `stdout_obj` function for sending.
|
||||
After obtaining the text from the current audio stream, the text needs to be sent to the main program. The caption engine process transmits caption data to the Electron main process through standard output.
|
||||
|
||||
### Command-Line Parameter Specification
|
||||
The transmitted content must be a JSON string, where the JSON object needs to contain the following parameters:
|
||||
|
||||
Custom caption engine settings are provided via command-line arguments. The current project uses the following parameters:
|
||||
```typescript
|
||||
export interface CaptionItem {
|
||||
command: "caption",
|
||||
index: number, // Caption sequence number
|
||||
time_s: string, // Current caption start time
|
||||
time_t: string, // Current caption end time
|
||||
text: string, // Caption content
|
||||
translation: string // Caption translation
|
||||
}
|
||||
```
|
||||
|
||||
```python
|
||||
import argparse
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
|
||||
# Common parameters
|
||||
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
|
||||
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
|
||||
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
|
||||
parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
|
||||
# Gummy-specific parameters
|
||||
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
|
||||
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
|
||||
# Vosk-specific parameters
|
||||
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
|
||||
```
|
||||
**Note that the buffer must be flushed after each caption JSON data output to ensure that the Electron main process receives strings that can be interpreted as JSON objects each time.** It is recommended to use the project's existing `stdout_obj` function for transmission.
|
||||
|
||||
For example, to use the Gummy model with Japanese as the source language, Chinese as the target language, and system audio output captions with 0.1s audio chunks, the command-line arguments would be:
|
||||
### Command Line Parameter Specification
|
||||
|
||||
```bash
|
||||
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
```
|
||||
Custom caption engine settings provide command line parameter specification, so the caption engine parameters need to be set properly. Currently used parameters in this project are as follows:
|
||||
|
||||
```python
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
|
||||
# all
|
||||
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
|
||||
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
|
||||
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
|
||||
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
|
||||
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
|
||||
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
|
||||
# gummy and sosv
|
||||
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
|
||||
# gummy only
|
||||
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
|
||||
# vosk and sosv
|
||||
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
|
||||
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
|
||||
# vosk only
|
||||
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
|
||||
# sosv only
|
||||
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
|
||||
```
|
||||
|
||||
For example, for this project's caption engine, if I want to use the Gummy model, specify the original text as Japanese, translate to Chinese, and capture captions from system audio output, with 0.1s audio data segments each time, the command line parameters would be:
|
||||
|
||||
```bash
|
||||
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
```
|
||||
|
||||
## Additional Notes
|
||||
|
||||
@@ -183,7 +215,7 @@ python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
|
||||
### Development Recommendations
|
||||
|
||||
Apart from audio-to-text conversion, it is recommended to reuse the existing code. In this case, the following additions are needed:
|
||||
Except for audio-to-text conversion and translation, other components (audio acquisition, audio resampling, and communication with the main process) are recommended to directly reuse the project's code. If this approach is taken, the content that needs to be added includes:
|
||||
|
||||
- `engine/audio2text/`: Add a new audio-to-text class (file-level).
|
||||
- `engine/main.py`: Add new parameter settings and workflow functions (refer to `main_gummy` and `main_vosk` functions).
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
# 字幕エンジン説明ドキュメント
|
||||
|
||||
## 注意:このドキュメントはメンテナンスが行われていないため、記載されている情報は古くなっています。最新の情報については、[中国語版](./zh.md)または[英語版](./en.md)のドキュメントをご参照ください。
|
||||
|
||||
対応バージョン:v0.6.0
|
||||
|
||||
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# 字幕引擎说明文档
|
||||
|
||||
对应版本:v0.6.0
|
||||
对应版本:v1.0.0
|
||||
|
||||

|
||||
|
||||
@@ -16,22 +16,21 @@
|
||||
|
||||
### 启动引擎
|
||||
|
||||
- 主进程:使用 `child_process.spawn()` 启动字幕引擎进程
|
||||
- Electron 主进程:使用 `child_process.spawn()` 启动字幕引擎进程
|
||||
- 字幕引擎进程:创建 TCP Socket 服务器线程,创建后在标准输出中输出转化为字符串的 JSON 对象,该对象中包含 `command` 字段,值为 `connect`
|
||||
- 主进程:监听字幕引擎进程的标准输出,尝试将标准输出按行分割,解析为 JSON 对象,并判断对象的 `command` 字段值是否为 `connect`,如果是则连接 TCP Socket 服务器
|
||||
|
||||
### 字幕识别
|
||||
|
||||
- 字幕引擎进程:在主线程监听系统音频输出,并将音频数据块发送给字幕引擎解析,字幕引擎解析音频数据块,通过标准输出发送解析的字幕数据对象字符串
|
||||
- 主进程:继续监听字幕引擎的标准输出,并根据解析的对象的 `command` 字段采取不同的操作
|
||||
- 字幕引擎进程:新建线程监听系统音频输出,将获取的音频数据块放入共享队列中(`shared_data.chunk_queue`)。字幕引擎不断读取共享队列中的音频数据块并解析。字幕引擎还可能新建线程执行翻译操作。最后字幕引擎通过标准输出发送解析的字幕数据对象字符串
|
||||
- Electron 主进程:持续监听字幕引擎的标准输出,并根据解析的对象的 `command` 字段采取不同的操作
|
||||
|
||||
### 关闭引擎
|
||||
|
||||
- 主进程:当用户在前端操作关闭字幕引擎时,主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
|
||||
- 字幕引擎进程:接收主引擎进程发送的字幕数据对象字符串,将字符串解析为对象,如果对象的 `command` 字段为 `stop`,则将全局变量 `thread_data.status` 的值设置为 `stop`
|
||||
- Electron 主进程:当用户在前端操作关闭字幕引擎时,主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
|
||||
- 字幕引擎进程:接收主引擎进程发送的字幕数据对象字符串,将字符串解析为对象,如果对象的 `command` 字段为 `stop`,则将全局变量 `shared_data.status` 的值设置为 `stop`
|
||||
- 字幕引擎进程:主线程循环监听系统音频输出,当 `thread_data.status` 的值不为 `running` 时,则结束循环,释放资源,结束运行
|
||||
- 主进程:如果检测到字幕引擎进程结束,进行相应处理,并向前端反馈
|
||||
|
||||
- Electron 主进程:如果检测到字幕引擎进程结束,进行相应处理,并向前端反馈
|
||||
|
||||
## 项目已经实现的功能
|
||||
|
||||
@@ -45,21 +44,25 @@
|
||||
|
||||
```python
|
||||
from utils import stdout, stdout_cmd, stdout_obj, stderr
|
||||
stdout("Hello") # {"command": "print", "content": "Hello"}\n
|
||||
stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
|
||||
stdout_obj({"command": "print", "content": "Hello"})
|
||||
# {"command": "print", "content": "Hello"}\n
|
||||
stdout("Hello")
|
||||
# {"command": "connect", "content": "8080"}\n
|
||||
stdout_cmd("connect", "8080")
|
||||
# {"command": "print", "content": "print"}\n
|
||||
stdout_obj({"command": "print", "content": "print"})
|
||||
# sys.stderr.write("Error Info" + "\n")
|
||||
stderr("Error Info")
|
||||
```
|
||||
|
||||
### 创建 Socket 服务
|
||||
|
||||
该 Socket 服务会监听指定端口,会解析 Electron 主程序发送的内容,并可能改变 `thread_data.status` 的值。
|
||||
该 Socket 服务会监听指定端口,会解析 Electron 主程序发送的内容,并可能改变 `shared_data.status` 的值。
|
||||
|
||||
样例:
|
||||
|
||||
```python
|
||||
from utils import start_server
|
||||
from utils import thread_data
|
||||
from utils import shared_data
|
||||
port = 8080
|
||||
start_server(port)
|
||||
while thread_data == 'running':
|
||||
@@ -72,13 +75,14 @@ while thread_data == 'running':
|
||||
`AudioStream` 类用于获取音频数据,实现是跨平台的,支持 Windows、Linux 和 macOS。该类初始化包含两个参数:
|
||||
|
||||
- `audio_type`: 获取音频类型,0 表示系统输出音频(扬声器),1 表示系统输入音频(麦克风)
|
||||
- `chunk_rate`: 音频数据获取频率,每秒音频获取的音频块的数量
|
||||
- `chunk_rate`: 音频数据获取频率,每秒音频获取的音频块的数量,默认为 10
|
||||
|
||||
该类包含三个方法:
|
||||
该类包含四个方法:
|
||||
|
||||
- `open_stream()`: 开启音频获取
|
||||
- `read_chunk() -> bytes`: 读取一个音频块
|
||||
- `close_stream()`: 关闭音频获取
|
||||
- `close_stream_signal()` 线程安全的关闭系统音频输入流
|
||||
|
||||
样例:
|
||||
|
||||
@@ -97,11 +101,22 @@ stream.close_stream()
|
||||
|
||||
### 音频处理
|
||||
|
||||
获取到的音频流在转文字之前可能需要进行预处理。一般需要将多通道音频转换为单通道音频,还可能需要进行重采样。本项目提供了三个音频处理函数:
|
||||
获取到的音频流在转文字之前可能需要进行预处理。一般需要将多通道音频转换为单通道音频,还可能需要进行重采样。本项目提供了两个音频处理函数:
|
||||
|
||||
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: 将多通道音频块转换为单通道音频块
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
|
||||
- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:将当前单通道音频块进行重采样
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`:将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
|
||||
|
||||
样例:
|
||||
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
from utils import merge_chunk_channels
|
||||
stream = AudioStream(1)
|
||||
while True:
|
||||
raw_chunk = stream.read_chunk()
|
||||
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
|
||||
# do something with chunk
|
||||
```
|
||||
|
||||
## 字幕引擎需要实现的功能
|
||||
|
||||
@@ -109,20 +124,31 @@ stream.close_stream()
|
||||
|
||||
在得到了合适的音频流后,需要将音频流转换为文字了。一般使用各种模型(云端或本地)来实现音频流转文字。需要根据需求选择合适的模型。
|
||||
|
||||
这部分建议封装为一个类,需要实现三个方法:
|
||||
这部分建议封装为一个类,需要实现四个方法:
|
||||
|
||||
- `start(self)`:启动模型
|
||||
- `send_audio_frame(self, data: bytes)`:处理当前音频块数据,**生成的字幕数据通过标准输出发送给 Electron 主进程**
|
||||
- `translate(self)`:持续从 `shared_data.chunk_queue` 中取出数据块,并调用 `send_audio_frame` 方法处理数据块
|
||||
- `stop(self)`:停止模型
|
||||
|
||||
完整的字幕引擎实例如下:
|
||||
|
||||
- [gummy.py](../../engine/audio2text/gummy.py)
|
||||
- [vosk.py](../../engine/audio2text/vosk.py)
|
||||
- [sosv.py](../../engine/audio2text/sosv.py)
|
||||
|
||||
### 字幕翻译
|
||||
|
||||
有的语音转文字模型并不提供翻译,如果有需求,需要再添加一个翻译模块。
|
||||
有的语音转文字模型并不提供翻译,如果有需求,需要再添加一个翻译模块,也可以使用自带的翻译模块。
|
||||
|
||||
样例:
|
||||
|
||||
```python
|
||||
from utils import google_translate, ollama_translate
|
||||
text = "这是一个翻译测试。"
|
||||
google_translate("", "en", text, "time_s")
|
||||
ollama_translate("qwen3:0.6b", "en", text, "time_s")
|
||||
```
|
||||
|
||||
### 字幕数据发送
|
||||
|
||||
@@ -133,37 +159,42 @@ stream.close_stream()
|
||||
```typescript
|
||||
export interface CaptionItem {
|
||||
command: "caption",
|
||||
index: number, // 字幕序号
|
||||
time_s: string, // 当前字幕开始时间
|
||||
time_t: string, // 当前字幕结束时间
|
||||
text: string, // 字幕内容
|
||||
translation: string // 字幕翻译
|
||||
index: number, // 字幕序号
|
||||
time_s: string, // 当前字幕开始时间
|
||||
time_t: string, // 当前字幕结束时间
|
||||
text: string, // 字幕内容
|
||||
translation: string // 字幕翻译
|
||||
}
|
||||
```
|
||||
|
||||
**注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。**
|
||||
|
||||
建议使用项目已经实现的 `stdout_obj` 函数来发送。
|
||||
**注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。** 建议使用项目已经实现的 `stdout_obj` 函数来发送。
|
||||
|
||||
### 命令行参数的指定
|
||||
|
||||
自定义字幕引擎的设置提供命令行参数指定,因此需要设置好字幕引擎的参数,本项目目前用到的参数如下:
|
||||
|
||||
```python
|
||||
import argparse
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
|
||||
# both
|
||||
# all
|
||||
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
|
||||
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
|
||||
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
|
||||
parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
|
||||
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
|
||||
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
|
||||
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
|
||||
# gummy and sosv
|
||||
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
|
||||
# gummy only
|
||||
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
|
||||
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
|
||||
# vosk and sosv
|
||||
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
|
||||
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
|
||||
# vosk only
|
||||
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
|
||||
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
|
||||
# sosv only
|
||||
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
|
||||
```
|
||||
|
||||
比如对于本项目的字幕引擎,我想使用 Gummy 模型,指定原文为日语,翻译为中文,获取系统音频输出的字幕,每次截取 0.1s 的音频数据,那么命令行参数如下:
|
||||
@@ -184,7 +215,7 @@ python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
|
||||
### 开发建议
|
||||
|
||||
除音频转文字外,其他建议直接复用本项目代码。如果这样,那么需要添加的内容为:
|
||||
除音频转文字和翻译外,其他(音频获取、音频重采样、与主进程通信)建议直接复用本项目代码。如果这样,那么需要添加的内容为:
|
||||
|
||||
- `engine/audio2text/`:添加新的音频转文字类(文件级别)
|
||||
- `engine/main.py`:添加新参数设置、流程函数(参考 `main_gummy` 函数和 `main_vosk` 函数)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Auto Caption User Manual
|
||||
|
||||
Corresponding Version: v0.6.0
|
||||
Corresponding Version: v1.0.0
|
||||
|
||||
**Note: Due to limited personal resources, the English and Japanese documentation files for this project (except for the README document) will no longer be maintained. The content of this document may not be consistent with the latest version of the project. If you are willing to help with translation, please submit relevant Pull Requests.**
|
||||
|
||||
@@ -43,9 +43,13 @@ Alibaba Cloud provides detailed tutorials for this part, which can be referenced
|
||||
|
||||
## Preparation for Using Vosk Engine
|
||||
|
||||
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings. Currently, the Vosk caption engine does not support translated caption content.
|
||||
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings.
|
||||
|
||||

|
||||

|
||||
|
||||
## Using SOSV Model
|
||||
|
||||
The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## Capturing System Audio Output on macOS
|
||||
|
||||
|
||||
@@ -1,11 +1,9 @@
|
||||
# Auto Caption ユーザーマニュアル
|
||||
|
||||
対応バージョン:v0.6.0
|
||||
対応バージョン:v1.0.0
|
||||
|
||||
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
|
||||
|
||||
**注意:個人のリソースが限られているため、このプロジェクトの英語および日本語のドキュメント(README ドキュメントを除く)のメンテナンスは行われません。このドキュメントの内容は最新版のプロジェクトと一致しない場合があります。翻訳のお手伝いをしていただける場合は、関連するプルリクエストを提出してください。**
|
||||
|
||||
## ソフトウェアの概要
|
||||
|
||||
Auto Caption は、クロスプラットフォームの字幕表示ソフトウェアで、システムの音声入力(録音)または出力(音声再生)のストリーミングデータをリアルタイムで取得し、音声からテキストに変換するモデルを利用して対応する音声の字幕を生成します。このソフトウェアが提供するデフォルトの字幕エンジン(アリババクラウド Gummy モデルを使用)は、9つの言語(中国語、英語、日本語、韓国語、ドイツ語、フランス語、ロシア語、スペイン語、イタリア語)の認識と翻訳をサポートしています。
|
||||
@@ -45,9 +43,13 @@ macOS プラットフォームでオーディオ出力を取得するには追
|
||||
|
||||
## Voskエンジン使用前の準備
|
||||
|
||||
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。現在、Vosk字幕エンジンは字幕の翻訳をサポートしていません。
|
||||
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。
|
||||
|
||||

|
||||

|
||||
|
||||
## SOSVモデルの使用
|
||||
|
||||
SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りです:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## macOS でのシステムオーディオ出力の取得方法
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Auto Caption 用户手册
|
||||
|
||||
对应版本:v0.6.0
|
||||
对应版本:v1.0.0
|
||||
|
||||
## 软件简介
|
||||
|
||||
@@ -41,9 +41,13 @@ Auto Caption 是一个跨平台的字幕显示软件,能够实时获取系统
|
||||
|
||||
## Vosk 引擎使用前准备
|
||||
|
||||
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
|
||||
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。
|
||||
|
||||

|
||||

|
||||
|
||||
## 使用 SOSV 模型
|
||||
|
||||
使用 SOSV 模型的方式和 Vosk 一样,下载地址如下:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## macOS 获取系统音频输出
|
||||
|
||||
|
||||
Reference in New Issue
Block a user