release v1.0.0
81
README.md
@@ -3,7 +3,7 @@
|
||||
<h1 align="center">auto-caption</h1>
|
||||
<p>Auto Caption 是一个跨平台的实时字幕显示软件。</p>
|
||||
<p>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.7.0-blue"></a>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
|
||||
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
|
||||
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
|
||||
@@ -14,14 +14,18 @@
|
||||
| <a href="./README_en.md">English</a>
|
||||
| <a href="./README_ja.md">日本語</a> |
|
||||
</p>
|
||||
<p><i>v0.7.0 版本已经发布,优化了软件界面,添加了日志记录显示。本地的字幕引擎正在尝试开发中,预计以 Python 代码的形式进行发布...</i></p>
|
||||
<p><i>v1.0.0 版本已经发布,新增 SOSV 本地字幕模型。更多的字幕模型正在尝试开发中...</i></p>
|
||||
</div>
|
||||
|
||||

|
||||
|
||||
## 📥 下载
|
||||
|
||||
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
|
||||
软件下载:[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
|
||||
|
||||
Vosk 模型下载:[Vosk Models](https://alphacephei.com/vosk/models)
|
||||
|
||||
SOSV 模型下载:[ Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
|
||||
|
||||
## 📚 相关文档
|
||||
|
||||
@@ -29,51 +33,83 @@
|
||||
|
||||
[字幕引擎说明文档](./docs/engine-manual/zh.md)
|
||||
|
||||
[项目 API 文档](./docs/api-docs/)
|
||||
|
||||
[更新日志](./docs/CHANGELOG.md)
|
||||
|
||||
## ✨ 特性
|
||||
|
||||
- 生成音频输出或麦克风输入的字幕
|
||||
- 支持调用本地 Ollama 模型或云端 Google 翻译 API 进行翻译
|
||||
- 跨平台(Windows、macOS、Linux)、多界面语言(中文、英语、日语)支持
|
||||
- 丰富的字幕样式设置(字体、字体大小、字体粗细、字体颜色、背景颜色等)
|
||||
- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、本地 Vosk 模型、自己开发的模型)
|
||||
- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、本地 Vosk 模型、本地 SOSV 模型、还可以自己开发模型)
|
||||
- 多语言识别与翻译(见下文“⚙️ 自带字幕引擎说明”)
|
||||
- 字幕记录展示与导出(支持导出 `.srt` 和 `.json` 格式)
|
||||
|
||||
## 📖 基本使用
|
||||
|
||||
软件已经适配了 Windows、macOS 和 Linux 平台。测试过的平台信息如下:
|
||||
软件已经适配了 Windows、macOS 和 Linux 平台。测试过的主流平台信息如下:
|
||||
|
||||
| 操作系统版本 | 处理器架构 | 获取系统音频输入 | 获取系统音频输出 |
|
||||
| ------------------ | ---------- | ---------------- | ---------------- |
|
||||
| Windows 11 24H2 | x64 | ✅ | ✅ |
|
||||
| macOS Sequoia 15.5 | arm64 | ✅ [需要额外配置](./docs/user-manual/zh.md#macos-获取系统音频输出) | ✅ |
|
||||
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
|
||||
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
|
||||
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
|
||||
|
||||
macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,详见[Auto Caption 用户手册](./docs/user-manual/zh.md)。
|
||||
macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,详见 [Auto Caption 用户手册](./docs/user-manual/zh.md)。
|
||||
|
||||
> 国际版的阿里云服务并没有提供 Gummy 模型,因此目前非中国用户无法使用 Gummy 字幕引擎。
|
||||
下载软件后,需要根据自己的需求选择对应的模型,然后配置模型。
|
||||
|
||||
| | 识别效果 | 部署类型 | 支持语言 | 翻译 | 备注 |
|
||||
| ------------------------------------------------------------ | -------- | ------------- | ---------- | ---------- | ---------------------------------------------------------- |
|
||||
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 很好😊 | 云端 / 阿里云 | 10 种 | 自带翻译 | 收费,0.54CNY / 小时 |
|
||||
| [Vosk](https://alphacephei.com/vosk) | 较差😞 | 本地 / CPU | 超过 30 种 | 需额外配置 | 支持的语言非常多 |
|
||||
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | 本地 / CPU | 5 种 | 需额外配置 | 仅有一个模型 |
|
||||
| 自己开发 | 🤔 | 自定义 | 自定义 | 自定义 | 根据[文档](./docs/engine-manual/zh.md)使用 Python 自己开发 |
|
||||
|
||||
如果你选择使用 Vosk 或 SOSV 模型,你还需要配置自己的翻译模型。
|
||||
|
||||
### 配置翻译模型
|
||||
|
||||

|
||||
|
||||
> 注意:翻译不是实时的,翻译模型只会在每句话识别完成后再调用。
|
||||
|
||||
#### Ollama 本地模型
|
||||
|
||||
> 注意:使用参数量过大的模型会导致资源消耗和翻译延迟较大。建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`。
|
||||
|
||||
使用该模型之前你需要确定本机安装了 [Ollama](https://ollama.com/) 软件,并已经下载了需要的大语言模型。只需要将需要调用的大模型名称添加到设置中的 `Ollama` 字段中。
|
||||
|
||||
#### Google 翻译 API
|
||||
|
||||
> 注意:Google 翻译 API 在部分地区无法使用。
|
||||
|
||||
无需任何配置,联网即可使用。
|
||||
|
||||
### 使用 Gummy 模型
|
||||
|
||||
> 国际版的阿里云服务似乎并没有提供 Gummy 模型,因此目前非中国用户可能无法使用 Gummy 字幕引擎。
|
||||
|
||||
如果要使用默认的 Gummy 字幕引擎(使用云端模型进行语音识别和翻译),首先需要获取阿里云百炼平台的 API KEY,然后将 API KEY 添加到软件设置中或者配置到环境变量中(仅 Windows 平台支持读取环境变量中的 API KEY),这样才能正常使用该模型。相关教程:
|
||||
|
||||
- [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
|
||||
- [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
|
||||
|
||||
### 使用 Vosk 模型
|
||||
|
||||
> Vosk 模型的识别效果较差,请谨慎使用。
|
||||
|
||||
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
|
||||
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。
|
||||
|
||||

|
||||

|
||||
|
||||
**如果你觉得上述字幕引擎不能满足你的需求,而且你会 Python,那么你可以考虑开发自己的字幕引擎。详细说明请参考[字幕引擎说明文档](./docs/engine-manual/zh.md)。**
|
||||
### 使用 SOSV 模型
|
||||
|
||||
使用 SOSV 模型的方式和 Vosk 一样,下载地址如下:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## ⚙️ 自带字幕引擎说明
|
||||
|
||||
目前软件自带 2 个字幕引擎,正在规划新的引擎。它们的详细信息如下。
|
||||
目前软件自带 3 个字幕引擎,正在规划新的引擎。它们的详细信息如下。
|
||||
|
||||
### Gummy 字幕引擎(云端)
|
||||
|
||||
@@ -102,7 +138,12 @@ $$
|
||||
|
||||
### Vosk 字幕引擎(本地)
|
||||
|
||||
基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。目前只支持生成音频对应的原文,不支持生成翻译内容。
|
||||
基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。该字幕引擎的优点是可选的语言模型非常多(超过 30 种),缺点是识别效果比较差,且生成内容没有标点符号。
|
||||
|
||||
|
||||
### SOSV 字幕引擎(本地)
|
||||
|
||||
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) 是一个整合包,该整合包主要基于 [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html),并添加了端点检测模型和标点恢复模型。该模型支持识别的语言有:英语、中文、日语、韩语、粤语。
|
||||
|
||||
### 新规划字幕引擎
|
||||
|
||||
@@ -112,6 +153,7 @@ $$
|
||||
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
|
||||
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
|
||||
- [FunASR](https://github.com/modelscope/FunASR)
|
||||
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
|
||||
|
||||
## 🚀 项目运行
|
||||
|
||||
@@ -128,6 +170,7 @@ npm install
|
||||
首先进入 `engine` 文件夹,执行如下指令创建虚拟环境(需要使用大于等于 Python 3.10 的 Python 运行环境,建议使用 Python 3.12):
|
||||
|
||||
```bash
|
||||
cd ./engine
|
||||
# in ./engine folder
|
||||
python -m venv .venv
|
||||
# or
|
||||
@@ -149,12 +192,6 @@ source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
如果在 Linux 系统上安装 `samplerate` 模块报错,可以尝试使用以下命令单独安装:
|
||||
|
||||
```bash
|
||||
pip install samplerate --only-binary=:all:
|
||||
```
|
||||
|
||||
然后使用 `pyinstaller` 构建项目:
|
||||
|
||||
```bash
|
||||
|
||||
87
README_en.md
@@ -3,7 +3,7 @@
|
||||
<h1 align="center">auto-caption</h1>
|
||||
<p>Auto Caption is a cross-platform real-time caption display software.</p>
|
||||
<p>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.7.0-blue"></a>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
|
||||
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
|
||||
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
|
||||
@@ -14,14 +14,18 @@
|
||||
| <b>English</b>
|
||||
| <a href="./README_ja.md">日本語</a> |
|
||||
</p>
|
||||
<p><i>Version 0.7.0 has been released, imporving the software interface and adding software log display. The local caption engine is under development and is expected to be released in the form of Python code...</i></p>
|
||||
<p><i>Version 1.0.0 has been released, with the addition of the SOSV local caption model. More caption models are being developed...</i></p>
|
||||
</div>
|
||||
|
||||

|
||||
|
||||
## 📥 Download
|
||||
|
||||
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
|
||||
Software Download: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
|
||||
|
||||
Vosk Model Download: [Vosk Models](https://alphacephei.com/vosk/models)
|
||||
|
||||
SOSV Model Download: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
@@ -29,16 +33,15 @@
|
||||
|
||||
[Caption Engine Documentation](./docs/engine-manual/en.md)
|
||||
|
||||
[Project API Documentation (Chinese)](./docs/api-docs/)
|
||||
|
||||
[Changelog](./docs/CHANGELOG.md)
|
||||
|
||||
## ✨ Features
|
||||
|
||||
- Generate captions from audio output or microphone input
|
||||
- Supports translation by calling local Ollama models or cloud-based Google Translate API
|
||||
- Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support
|
||||
- Rich caption style settings (font, font size, font weight, font color, background color, etc.)
|
||||
- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, self-developed model)
|
||||
- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, local SOSV model, or you can develop your own model)
|
||||
- Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines")
|
||||
- Subtitle record display and export (supports exporting `.srt` and `.json` formats)
|
||||
|
||||
@@ -51,29 +54,63 @@ The software has been adapted for Windows, macOS, and Linux platforms. The teste
|
||||
| Windows 11 24H2 | x64 | ✅ | ✅ |
|
||||
| macOS Sequoia 15.5 | arm64 | ✅ [Additional config required](./docs/user-manual/en.md#capturing-system-audio-output-on-macos) | ✅ |
|
||||
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
|
||||
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
|
||||
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
|
||||
|
||||
Additional configuration is required to capture system audio output on macOS and Linux platforms. See [Auto Caption User Manual](./docs/user-manual/en.md) for details.
|
||||
|
||||
> The international version of Alibaba Cloud services does not provide the Gummy model, so non-Chinese users currently cannot use the Gummy caption engine.
|
||||
|
||||
To use the default Gummy caption engine (which uses cloud-based models for speech recognition and translation), you first need to obtain an API KEY from the Alibaba Cloud Bailian platform. Then add the API KEY to the software settings or configure it in environment variables (only Windows platform supports reading API KEY from environment variables) to properly use this model. Related tutorials:
|
||||
After downloading the software, you need to select the corresponding model according to your needs and then configure the model.
|
||||
|
||||
- [Obtaining API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key)
|
||||
- [Configuring API Key through Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
|
||||
| | Recognition Quality | Deployment Type | Supported Languages | Translation | Notes |
|
||||
| ------------------------------------------------------------ | ------------------- | ------------------ | ------------------- | ------------- | ---------------------------------------------------------- |
|
||||
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Excellent 😊 | Alibaba Cloud | 10 languages | Built-in | Paid, 0.54 CNY/hour |
|
||||
| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Local / CPU | Over 30 languages | Requires setup | Supports many languages |
|
||||
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Fair 😐 | Local / CPU | 5 languages | Requires setup | Only one model available |
|
||||
| DIY Development | 🤔 | Custom | Custom | Custom | Develop your own using Python according to [documentation](./docs/engine-manual/zh.md) |
|
||||
|
||||
> The recognition performance of Vosk models is suboptimal, please use with caution.
|
||||
If you choose to use the Vosk or SOSV model, you also need to configure your own translation model.
|
||||
|
||||
To use the Vosk local caption engine, first download your required model from [Vosk Models](https://alphacephei.com/vosk/models) page, extract the model locally, and add the model folder path to the software settings. Currently, the Vosk caption engine does not support translated captions.
|
||||
### Configuring Translation Models
|
||||
|
||||

|
||||

|
||||
|
||||
**If you find the above caption engines don't meet your needs and you know Python, you may consider developing your own caption engine. For detailed instructions, please refer to the [Caption Engine Documentation](./docs/engine-manual/en.md).**
|
||||
> Note: Translation is not real-time. The translation model is only called after each sentence recognition is completed.
|
||||
|
||||
#### Ollama Local Model
|
||||
|
||||
> Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
|
||||
|
||||
Before using this model, you need to ensure that [Ollama](https://ollama.com/) software is installed on your machine and the required large language model has been downloaded. Simply add the name of the large model you want to call to the `Ollama` field in the settings.
|
||||
|
||||
#### Google Translate API
|
||||
|
||||
> Note: Google Translate API is not available in some regions.
|
||||
|
||||
No configuration required, just connect to the internet to use.
|
||||
|
||||
### Using Gummy Model
|
||||
|
||||
> The international version of Alibaba Cloud services does not seem to provide the Gummy model, so non-Chinese users may not be able to use the Gummy caption engine at present.
|
||||
|
||||
To use the default Gummy caption engine (using cloud models for speech recognition and translation), you first need to obtain an API KEY from Alibaba Cloud Bailian platform, then add the API KEY to the software settings or configure it in the environment variables (only Windows platform supports reading API KEY from environment variables), so that the model can be used normally. Related tutorials:
|
||||
|
||||
- [Get API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
|
||||
- [Configure API Key through Environment Variables](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
|
||||
|
||||
### Using Vosk Model
|
||||
|
||||
> The recognition effect of the Vosk model is poor, please use it with caution.
|
||||
|
||||
To use the Vosk local caption engine, first download the model you need from the [Vosk Models](https://alphacephei.com/vosk/models) page, unzip the model locally, and add the path of the model folder to the software settings.
|
||||
|
||||

|
||||
|
||||
### Using SOSV Model
|
||||
|
||||
The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## ⚙️ Built-in Subtitle Engines
|
||||
|
||||
Currently, the software comes with 2 subtitle engines, with new engines under development. Their detailed information is as follows.
|
||||
Currently, the software comes with 3 caption engines, with new engines under development. Their detailed information is as follows.
|
||||
|
||||
### Gummy Subtitle Engine (Cloud)
|
||||
|
||||
@@ -92,7 +129,7 @@ Developed based on Tongyi Lab's [Gummy Speech Translation Model](https://help.al
|
||||
|
||||
**Network Traffic Consumption:**
|
||||
|
||||
The subtitle engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:
|
||||
The caption engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:
|
||||
|
||||
$$
|
||||
48000\ \text{samples/second} \times 2\ \text{bytes/sample} \times 1\ \text{channel} = 93.75\ \text{KB/s}
|
||||
@@ -102,7 +139,11 @@ The engine only uploads data when receiving audio streams, so the actual upload
|
||||
|
||||
### Vosk Subtitle Engine (Local)
|
||||
|
||||
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). Currently only supports generating original text from audio, does not support translation content.
|
||||
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation.
|
||||
|
||||
### SOSV Subtitle Engine (Local)
|
||||
|
||||
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) is an integrated package, mainly based on [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html), with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese.
|
||||
|
||||
### Planned New Subtitle Engines
|
||||
|
||||
@@ -112,6 +153,7 @@ The following are candidate models that will be selected based on model performa
|
||||
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
|
||||
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
|
||||
- [FunASR](https://github.com/modelscope/FunASR)
|
||||
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
|
||||
|
||||
## 🚀 Project Setup
|
||||
|
||||
@@ -128,6 +170,7 @@ npm install
|
||||
First enter the `engine` folder and execute the following commands to create a virtual environment (requires Python 3.10 or higher, with Python 3.12 recommended):
|
||||
|
||||
```bash
|
||||
cd ./engine
|
||||
# in ./engine folder
|
||||
python -m venv .venv
|
||||
# or
|
||||
@@ -149,12 +192,6 @@ Then install dependencies (this step might result in errors on macOS and Linux,
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
If you encounter errors when installing the `samplerate` module on Linux systems, you can try installing it separately with this command:
|
||||
|
||||
```bash
|
||||
pip install samplerate --only-binary=:all:
|
||||
```
|
||||
|
||||
Then use `pyinstaller` to build the project:
|
||||
|
||||
```bash
|
||||
|
||||
82
README_ja.md
@@ -3,7 +3,7 @@
|
||||
<h1 align="center">auto-caption</h1>
|
||||
<p>Auto Caption はクロスプラットフォームのリアルタイム字幕表示ソフトウェアです。</p>
|
||||
<p>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.7.0-blue"></a>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
|
||||
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
|
||||
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
|
||||
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
|
||||
@@ -14,14 +14,18 @@
|
||||
| <a href="./README_en.md">English</a>
|
||||
| <b>日本語</b> |
|
||||
</p>
|
||||
<p><i>バージョン 0.7.0 がリリースされ、ソフトウェアインターフェースが最適化され、ログ記録表示機能が追加されました。ローカルの字幕エンジンは現在開発中であり、Pythonコードの形式でリリースされる予定です...</i></p>
|
||||
<p><i>v1.0.0 バージョンがリリースされ、SOSV ローカル字幕モデルが追加されました。より多くの字幕モデルが開発中です...</i></p>
|
||||
</div>
|
||||
|
||||

|
||||
|
||||
## 📥 ダウンロード
|
||||
|
||||
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
|
||||
ソフトウェアダウンロード: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
|
||||
|
||||
Vosk モデルダウンロード: [Vosk Models](https://alphacephei.com/vosk/models)
|
||||
|
||||
SOSV モデルダウンロード: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
|
||||
|
||||
## 📚 関連ドキュメント
|
||||
|
||||
@@ -29,16 +33,15 @@
|
||||
|
||||
[字幕エンジン説明ドキュメント](./docs/engine-manual/ja.md)
|
||||
|
||||
[プロジェクト API ドキュメント(中国語)](./docs/api-docs/)
|
||||
|
||||
[更新履歴](./docs/CHANGELOG.md)
|
||||
|
||||
## ✨ 特徴
|
||||
|
||||
- 音声出力またはマイク入力からの字幕生成
|
||||
- ローカルのOllamaモデルまたはクラウドベースのGoogle翻訳APIを呼び出して翻訳をサポート
|
||||
- クロスプラットフォーム(Windows、macOS、Linux)、多言語インターフェース(中国語、英語、日本語)対応
|
||||
- 豊富な字幕スタイル設定(フォント、フォントサイズ、フォント太さ、フォント色、背景色など)
|
||||
- 柔軟な字幕エンジン選択(阿里雲 Gummy クラウドモデル、ローカル Vosk モデル、独自開発モデル)
|
||||
- 柔軟な字幕エンジン選択(阿里云Gummyクラウドモデル、ローカルVoskモデル、ローカルSOSVモデル、または独自にモデルを開発可能)
|
||||
- 多言語認識と翻訳(下記「⚙️ 字幕エンジン説明」参照)
|
||||
- 字幕記録表示とエクスポート(`.srt` および `.json` 形式のエクスポートに対応)
|
||||
|
||||
@@ -56,24 +59,59 @@
|
||||
|
||||
macOS および Linux プラットフォームでシステムオーディオ出力を取得するには追加設定が必要です。詳細は[Auto Captionユーザーマニュアル](./docs/user-manual/ja.md)をご覧ください。
|
||||
|
||||
> 阿里雲の国際版サービスでは Gummy モデルを提供していないため、現在中国以外のユーザーは Gummy 字幕エンジンを使用できません。
|
||||
ソフトウェアをダウンロードした後、自分のニーズに応じて対応するモデルを選択し、モデルを設定する必要があります。
|
||||
|
||||
デフォルトの Gummy 字幕エンジン(クラウドベースのモデルを使用した音声認識と翻訳)を使用するには、まず阿里雲百煉プラットフォームから API KEY を取得する必要があります。その後、API KEY をソフトウェア設定に追加するか、環境変数に設定します(Windows プラットフォームのみ環境変数からの API KEY 読み取りをサポート)。関連チュートリアル:
|
||||
| | 認識効果 | デプロイタイプ | 対応言語 | 翻訳 | 備考 |
|
||||
| ------------------------------------------------------------ | -------- | ----------------- | ---------- | ---------- | ---------------------------------------------------------- |
|
||||
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 良好😊 | クラウド / 阿里云 | 10種 | 内蔵翻訳 | 有料、0.54CNY / 時間 |
|
||||
| [Vosk](https://alphacephei.com/vosk) | 不良😞 | ローカル / CPU | 30種以上 | 追加設定必要 | 対応言語が非常に多い |
|
||||
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | ローカル / CPU | 5種 | 追加設定必要 | モデルは一つのみ |
|
||||
| 自前開発 | 🤔 | カスタム | カスタム | カスタム | [ドキュメント](./docs/engine-manual/zh.md)に従ってPythonで自前開発 |
|
||||
|
||||
- [API KEY の取得(中国語)](https://help.aliyun.com/zh/model-studio/get-api-key)
|
||||
- [環境変数を通じて API Key を設定(中国語)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
|
||||
VoskまたはSOSVモデルを使用する場合、独自の翻訳モデルも設定する必要があります。
|
||||
|
||||
> Vosk モデルの認識精度は低いため、注意してご使用ください。
|
||||
### 翻訳モデルの設定
|
||||
|
||||
Vosk ローカル字幕エンジンを使用するには、まず [Vosk Models](https://alphacephei.com/vosk/models) ページから必要なモデルをダウンロードし、ローカルに解凍した後、モデルフォルダのパスをソフトウェア設定に追加してください。現在、Vosk 字幕エンジンは字幕の翻訳をサポートしていません。
|
||||

|
||||
|
||||

|
||||
> 注意:翻訳はリアルタイムではありません。翻訳モデルは各文の認識が完了した後にのみ呼び出されます。
|
||||
|
||||
**上記の字幕エンジンがご要望を満たさず、かつ Python の知識をお持ちの場合、独自の字幕エンジンを開発することも可能です。詳細な説明は[字幕エンジン説明書](./docs/engine-manual/ja.md)をご参照ください。**
|
||||
#### Ollama ローカルモデル
|
||||
|
||||
> 注意:パラメータ数が多すぎるモデルを使用すると、リソース消費と翻訳遅延が大きくなります。1B未満のパラメータ数のモデルを使用することを推奨します。例:`qwen2.5:0.5b`、`qwen3:0.6b`。
|
||||
|
||||
このモデルを使用する前に、ローカルマシンに[Ollama](https://ollama.com/)ソフトウェアがインストールされ、必要な大規模言語モデルがダウンロードされていることを確認してください。必要な大規模モデル名を設定の`Ollama`フィールドに追加するだけでOKです。
|
||||
|
||||
#### Google翻訳API
|
||||
|
||||
> 注意:Google翻訳APIは一部の地域では使用できません。
|
||||
|
||||
設定不要で、ネット接続があれば使用できます。
|
||||
|
||||
### Gummyモデルの使用
|
||||
|
||||
> 阿里云の国際版サービスにはGummyモデルが提供されていないため、現在中国以外のユーザーはGummy字幕エンジンを使用できない可能性があります。
|
||||
|
||||
デフォルトのGummy字幕エンジン(クラウドモデルを使用した音声認識と翻訳)を使用するには、まず阿里云百煉プラットフォームのAPI KEYを取得し、API KEYをソフトウェア設定に追加するか環境変数に設定する必要があります(Windowsプラットフォームのみ環境変数からのAPI KEY読み取りをサポート)。関連チュートリアル:
|
||||
|
||||
- [API KEYの取得](https://help.aliyun.com/zh/model-studio/get-api-key)
|
||||
- [環境変数へのAPI Keyの設定](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
|
||||
|
||||
### Voskモデルの使用
|
||||
|
||||
> Voskモデルの認識効果は不良のため、注意して使用してください。
|
||||
|
||||
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードし、ローカルにモデルを解凍し、モデルフォルダのパスをソフトウェア設定に追加してください。
|
||||
|
||||

|
||||
|
||||
### SOSVモデルの使用
|
||||
|
||||
SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りです:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## ⚙️ 字幕エンジン説明
|
||||
|
||||
現在、ソフトウェアには2つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。
|
||||
現在、ソフトウェアには3つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。
|
||||
|
||||
### Gummy 字幕エンジン(クラウド)
|
||||
|
||||
@@ -102,7 +140,11 @@ $$
|
||||
|
||||
### Vosk字幕エンジン(ローカル)
|
||||
|
||||
[vosk-api](https://github.com/alphacep/vosk-api) をベースに開発されています。現在は音声に対応する原文の生成のみをサポートしており、翻訳コンテンツはサポートしていません。
|
||||
[vosk-api](https://github.com/alphacep/vosk-api)をベースに開発。この字幕エンジンの利点は選択可能な言語モデルが非常に多く(30言語以上)、欠点は認識効果が比較的悪く、生成内容に句読点がないことです。
|
||||
|
||||
### SOSV 字幕エンジン(ローカル)
|
||||
|
||||
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)は統合パッケージで、主に[Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)をベースにし、エンドポイント検出モデルと句読点復元モデルを追加しています。このモデルが認識をサポートする言語は:英語、中国語、日本語、韓国語、広東語です。
|
||||
|
||||
### 新規計画字幕エンジン
|
||||
|
||||
@@ -112,6 +154,7 @@ $$
|
||||
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
|
||||
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
|
||||
- [FunASR](https://github.com/modelscope/FunASR)
|
||||
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
|
||||
|
||||
## 🚀 プロジェクト実行
|
||||
|
||||
@@ -128,6 +171,7 @@ npm install
|
||||
まず `engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成します(Python 3.10 以上が必要で、Python 3.12 が推奨されます):
|
||||
|
||||
```bash
|
||||
cd ./engine
|
||||
# ./engine フォルダ内
|
||||
python -m venv .venv
|
||||
# または
|
||||
@@ -149,12 +193,6 @@ source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Linux システムで `samplerate` モジュールのインストールに問題が発生した場合、以下のコマンドで個別にインストールを試すことができます:
|
||||
|
||||
```bash
|
||||
pip install samplerate --only-binary=:all:
|
||||
```
|
||||
|
||||
その後、`pyinstaller` を使用してプロジェクトをビルドします:
|
||||
|
||||
```bash
|
||||
|
||||
BIN
assets/media/config_en.png
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
assets/media/config_ja.png
Normal file
|
After Width: | Height: | Size: 69 KiB |
BIN
assets/media/config_zh.png
Normal file
|
After Width: | Height: | Size: 72 KiB |
BIN
assets/media/engine_en.png
Normal file
|
After Width: | Height: | Size: 60 KiB |
BIN
assets/media/engine_ja.png
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
assets/media/engine_zh.png
Normal file
|
After Width: | Height: | Size: 81 KiB |
|
Before Width: | Height: | Size: 400 KiB After Width: | Height: | Size: 404 KiB |
|
Before Width: | Height: | Size: 413 KiB After Width: | Height: | Size: 417 KiB |
|
Before Width: | Height: | Size: 416 KiB After Width: | Height: | Size: 417 KiB |
|
Before Width: | Height: | Size: 68 KiB |
|
Before Width: | Height: | Size: 76 KiB |
|
Before Width: | Height: | Size: 74 KiB |
@@ -156,15 +156,20 @@
|
||||
- 更清晰的日志输出
|
||||
|
||||
|
||||
## v0.8.0
|
||||
## v1.0.0
|
||||
|
||||
2025-09-??
|
||||
2025-09-08
|
||||
|
||||
### 新增功能
|
||||
|
||||
- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭、在字幕引擎启动过程中也可选择关闭字幕引擎
|
||||
- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译、支持调用 Google 翻译 API 进行翻译
|
||||
- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭;在字幕引擎启动过程中可选择关闭字幕引擎
|
||||
- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译;支持调用 Google 翻译 API 进行翻译
|
||||
- 添加新的翻译模型:添加 SOSV 模型,支持识别英语、中文、日语、韩语、粤语
|
||||
- 添加录音功能:可以将字幕引擎识别的音频保存为 .wav 文件
|
||||
- 添加多行字幕功能,用户可以设置字幕窗口显示的字幕的行数
|
||||
|
||||
### 优化体验
|
||||
|
||||
- 优化部分提示信息显示位置
|
||||
- 替换重采样模型,提高音频重采样质量
|
||||
- 带有额外信息的标签颜色改为与主题色一致
|
||||
@@ -21,6 +21,8 @@
|
||||
- [x] 复制字幕记录可选择只复制最近的字幕记录 *2025/08/18*
|
||||
- [x] 添加颜色主题设置 *2025/08/18*
|
||||
- [x] 前端页面添加日志内容展示 *2025/08/19*
|
||||
- [x] 添加 Ollama 模型用于本地字幕引擎的翻译 *2025/09/04*
|
||||
- [x] 验证 / 添加基于 sherpa-onnx 的字幕引擎 *2025/09/06*
|
||||
|
||||
## 待完成
|
||||
|
||||
@@ -29,7 +31,6 @@
|
||||
|
||||
## 后续计划
|
||||
|
||||
- [ ] 添加 Ollama 模型用于本地字幕引擎的翻译
|
||||
- [ ] 验证 / 添加基于 FunASR 的字幕引擎
|
||||
- [ ] 减小软件不必要的体积
|
||||
|
||||
|
||||
@@ -1,175 +1,207 @@
|
||||
# Caption Engine Documentation
|
||||
# Caption Engine Documentation
|
||||
|
||||
Corresponding Version: v0.6.0
|
||||
Corresponding version: v1.0.0
|
||||
|
||||

|
||||

|
||||
|
||||
## Introduction to the Caption Engine
|
||||
## Introduction to the Caption Engine
|
||||
|
||||
The so-called caption engine is essentially a subprogram that continuously captures real-time streaming data from the system's audio input (microphone) or output (speakers) and invokes an audio-to-text model to generate corresponding captions for the audio. The generated captions are converted into JSON-formatted string data and passed to the main program via standard output (ensuring the string can be correctly interpreted as a JSON object by the main program). The main program reads and interprets the caption data, processes it, and displays it in the window.
|
||||
The so-called caption engine is actually a subprocess that captures streaming data from system audio input (microphone) or output (speaker) in real-time, and invokes an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into JSON-formatted string data and transmitted to the main program through standard output (ensuring that the string received by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and displays it in the window.
|
||||
|
||||
**The communication standard between the caption engine process and the Electron main process is: [caption engine api-doc](../api-docs/caption-engine.md).**
|
||||
**Communication between the caption engine process and Electron main process follows the standard: [caption engine api-doc](../api-docs/caption-engine.md).**
|
||||
|
||||
## Workflow
|
||||
## Execution Flow
|
||||
|
||||
The communication flow between the main process and the caption engine:
|
||||
Process of communication between main process and caption engine:
|
||||
|
||||
### Starting the Engine
|
||||
### Starting the Engine
|
||||
|
||||
- **Main Process**: Uses `child_process.spawn()` to launch the caption engine process.
|
||||
- **Caption Engine Process**: Creates a TCP Socket server thread. After creation, it outputs a JSON object string via standard output, containing a `command` field with the value `connect`.
|
||||
- **Main Process**: Monitors the standard output of the caption engine process, attempts to split it line by line, parses it into a JSON object, and checks if the `command` field value is `connect`. If so, it connects to the TCP Socket server.
|
||||
- Electron main process: Use `child_process.spawn()` to start the caption engine process
|
||||
- Caption engine process: Create a TCP Socket server thread, after creation output a JSON object converted to string via standard output, containing the `command` field with value `connect`
|
||||
- Main process: Listen to the caption engine process's standard output, try to split the standard output by lines, parse it into a JSON object, and check if the object's `command` field value is `connect`. If so, connect to the TCP Socket server
|
||||
|
||||
### Caption Recognition
|
||||
### Caption Recognition
|
||||
|
||||
- **Caption Engine Process**: The main thread monitors system audio output, sends audio data chunks to the caption engine for parsing, and outputs the parsed caption data object strings via standard output.
|
||||
- **Main Process**: Continues to monitor the standard output of the caption engine and performs different operations based on the `command` field of the parsed object.
|
||||
- Caption engine process: Create a new thread to monitor system audio output, put acquired audio data chunks into a shared queue (`shared_data.chunk_queue`). The caption engine continuously reads audio data chunks from the shared queue and parses them. The caption engine may also create a new thread to perform translation operations. Finally, the caption engine sends parsed caption data object strings through standard output
|
||||
- Electron main process: Continuously listen to the caption engine's standard output and take different actions based on the parsed object's `command` field
|
||||
|
||||
### Closing the Engine
|
||||
### Stopping the Engine
|
||||
|
||||
- **Main Process**: When the user closes the caption engine via the frontend, the main process sends a JSON object string with the `command` field set to `stop` to the caption engine process via Socket communication.
|
||||
- **Caption Engine Process**: Receives the object string, parses it, and if the `command` field is `stop`, sets the global variable `thread_data.status` to `stop`.
|
||||
- **Caption Engine Process**: The main thread's loop for monitoring system audio output ends when `thread_data.status` is not `running`, releases resources, and terminates.
|
||||
- **Main Process**: Detects the termination of the caption engine process, performs corresponding cleanup, and provides feedback to the frontend.
|
||||
- Electron main process: When the user operates to close the caption engine in the frontend, the main process sends an object string with `command` field set to `stop` to the caption engine process through Socket communication
|
||||
- Caption engine process: Receive the caption data object string sent by the main engine process, parse the string into an object. If the object's `command` field is `stop`, set the value of global variable `shared_data.status` to `stop`
|
||||
- Caption engine process: Main thread continuously monitors system audio output, when `thread_data.status` value is not `running`, end the loop, release resources, and terminate execution
|
||||
- Electron main process: If the caption engine process termination is detected, perform corresponding processing and provide feedback to the frontend
|
||||
|
||||
## Implemented Features
|
||||
## Implemented Features
|
||||
|
||||
The following features are already implemented and can be reused directly.
|
||||
The following features have been implemented and can be directly reused.
|
||||
|
||||
### Standard Output
|
||||
### Standard Output
|
||||
|
||||
Supports printing general information, commands, and error messages.
|
||||
Can output regular information, commands, and error messages.
|
||||
|
||||
Example:
|
||||
Examples:
|
||||
|
||||
```python
|
||||
from utils import stdout, stdout_cmd, stdout_obj, stderr
|
||||
stdout("Hello") # {"command": "print", "content": "Hello"}\n
|
||||
stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
|
||||
stdout_obj({"command": "print", "content": "Hello"})
|
||||
stderr("Error Info")
|
||||
```
|
||||
```python
|
||||
from utils import stdout, stdout_cmd, stdout_obj, stderr
|
||||
# {"command": "print", "content": "Hello"}\n
|
||||
stdout("Hello")
|
||||
# {"command": "connect", "content": "8080"}\n
|
||||
stdout_cmd("connect", "8080")
|
||||
# {"command": "print", "content": "print"}\n
|
||||
stdout_obj({"command": "print", "content": "print"})
|
||||
# sys.stderr.write("Error Info" + "\n")
|
||||
stderr("Error Info")
|
||||
```
|
||||
|
||||
### Creating a Socket Service
|
||||
### Creating Socket Service
|
||||
|
||||
This Socket service listens on a specified port, parses content sent by the Electron main program, and may modify the value of `thread_data.status`.
|
||||
This Socket service listens on a specified port, parses content sent by the Electron main program, and may change the value of `shared_data.status`.
|
||||
|
||||
Example:
|
||||
Example:
|
||||
|
||||
```python
|
||||
from utils import start_server
|
||||
from utils import thread_data
|
||||
port = 8080
|
||||
start_server(port)
|
||||
while thread_data == 'running':
|
||||
# do something
|
||||
pass
|
||||
```
|
||||
```python
|
||||
from utils import start_server
|
||||
from utils import shared_data
|
||||
port = 8080
|
||||
start_server(port)
|
||||
while thread_data == 'running':
|
||||
# do something
|
||||
pass
|
||||
```
|
||||
|
||||
### Audio Capture
|
||||
### Audio Acquisition
|
||||
|
||||
The `AudioStream` class captures audio data and is cross-platform, supporting Windows, Linux, and macOS. Its initialization includes two parameters:
|
||||
The `AudioStream` class is used to acquire audio data, with cross-platform implementation supporting Windows, Linux, and macOS. The class initialization includes two parameters:
|
||||
|
||||
- `audio_type`: The type of audio to capture. `0` for system output audio (speakers), `1` for system input audio (microphone).
|
||||
- `chunk_rate`: The frequency of audio data capture, i.e., the number of audio chunks captured per second.
|
||||
- `audio_type`: Audio acquisition type, 0 for system output audio (speaker), 1 for system input audio (microphone)
|
||||
- `chunk_rate`: Audio data acquisition frequency, number of audio chunks acquired per second, default is 10
|
||||
|
||||
The class includes three methods:
|
||||
The class contains four methods:
|
||||
|
||||
- `open_stream()`: Starts audio capture.
|
||||
- `read_chunk() -> bytes`: Reads an audio chunk.
|
||||
- `close_stream()`: Stops audio capture.
|
||||
- `open_stream()`: Start audio acquisition
|
||||
- `read_chunk() -> bytes`: Read an audio chunk
|
||||
- `close_stream()`: Close audio acquisition
|
||||
- `close_stream_signal()`: Thread-safe closing of system audio input stream
|
||||
|
||||
Example:
|
||||
Example:
|
||||
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
audio_type = 0
|
||||
chunk_rate = 20
|
||||
stream = AudioStream(audio_type, chunk_rate)
|
||||
stream.open_stream()
|
||||
while True:
|
||||
data = stream.read_chunk()
|
||||
# do something with data
|
||||
pass
|
||||
stream.close_stream()
|
||||
```
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
audio_type = 0
|
||||
chunk_rate = 20
|
||||
stream = AudioStream(audio_type, chunk_rate)
|
||||
stream.open_stream()
|
||||
while True:
|
||||
data = stream.read_chunk()
|
||||
# do something with data
|
||||
pass
|
||||
stream.close_stream()
|
||||
```
|
||||
|
||||
### Audio Processing
|
||||
### Audio Processing
|
||||
|
||||
The captured audio stream may require preprocessing before conversion to text. Typically, multi-channel audio needs to be converted to mono, and resampling may be necessary. This project provides three audio processing functions:
|
||||
Before converting audio streams to text, preprocessing may be required. Usually, multi-channel audio needs to be converted to single-channel audio, and resampling may also be needed. This project provides two audio processing functions:
|
||||
|
||||
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Converts a multi-channel audio chunk to mono.
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Converts a multi-channel audio chunk to mono and resamples it.
|
||||
- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Resamples a mono audio chunk.
|
||||
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Convert multi-channel audio chunks to single-channel audio chunks
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`: Convert current multi-channel audio data chunks to single-channel audio data chunks, then perform resampling
|
||||
|
||||
## Features to Be Implemented in the Caption Engine
|
||||
Example:
|
||||
|
||||
### Audio-to-Text Conversion
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
from utils import merge_chunk_channels
|
||||
stream = AudioStream(1)
|
||||
while True:
|
||||
raw_chunk = stream.read_chunk()
|
||||
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
|
||||
# do something with chunk
|
||||
```
|
||||
|
||||
After obtaining a suitable audio stream, it needs to be converted to text. Typically, various models (cloud-based or local) are used for this purpose. Choose the appropriate model based on requirements.
|
||||
## Features to be Implemented by the Caption Engine
|
||||
|
||||
This part is recommended to be encapsulated as a class with three methods:
|
||||
### Audio to Text Conversion
|
||||
|
||||
- `start(self)`: Starts the model.
|
||||
- `send_audio_frame(self, data: bytes)`: Processes the current audio chunk data. **The generated caption data is sent to the Electron main process via standard output.**
|
||||
- `stop(self)`: Stops the model.
|
||||
After obtaining suitable audio streams, the audio stream needs to be converted to text. Generally, various models (cloud or local) are used to implement audio-to-text conversion. Appropriate models should be selected according to requirements.
|
||||
|
||||
Complete caption engine examples:
|
||||
It is recommended to encapsulate this as a class, implementing four methods:
|
||||
|
||||
- [gummy.py](../../engine/audio2text/gummy.py)
|
||||
- [vosk.py](../../engine/audio2text/vosk.py)
|
||||
- `start(self)`: Start the model
|
||||
- `send_audio_frame(self, data: bytes)`: Process current audio chunk data, **generated caption data is sent to Electron main process through standard output**
|
||||
- `translate(self)`: Continuously retrieve data chunks from `shared_data.chunk_queue` and call `send_audio_frame` method to process data chunks
|
||||
- `stop(self)`: Stop the model
|
||||
|
||||
### Caption Translation
|
||||
Complete caption engine examples:
|
||||
|
||||
Some speech-to-text models do not provide translation. If needed, a translation module must be added.
|
||||
- [gummy.py](../../engine/audio2text/gummy.py)
|
||||
- [vosk.py](../../engine/audio2text/vosk.py)
|
||||
- [sosv.py](../../engine/audio2text/sosv.py)
|
||||
|
||||
### Sending Caption Data
|
||||
### Caption Translation
|
||||
|
||||
After obtaining the text for the current audio stream, it must be sent to the main program. The caption engine process passes caption data to the Electron main process via standard output.
|
||||
Some speech-to-text models do not provide translation. If needed, an additional translation module needs to be added, or built-in translation modules can be used.
|
||||
|
||||
The content must be a JSON string, with the JSON object including the following parameters:
|
||||
Example:
|
||||
|
||||
```typescript
|
||||
export interface CaptionItem {
|
||||
command: "caption",
|
||||
index: number, // Caption sequence number
|
||||
time_s: string, // Start time of the current caption
|
||||
time_t: string, // End time of the current caption
|
||||
text: string, // Caption content
|
||||
translation: string // Caption translation
|
||||
}
|
||||
```
|
||||
```python
|
||||
from utils import google_translate, ollama_translate
|
||||
text = "This is a translation test."
|
||||
google_translate("", "en", text, "time_s")
|
||||
ollama_translate("qwen3:0.6b", "en", text, "time_s")
|
||||
```
|
||||
|
||||
**Note: Ensure the buffer is flushed after each JSON output to guarantee the Electron main process receives a string that can be parsed as a JSON object.**
|
||||
### Caption Data Transmission
|
||||
|
||||
It is recommended to use the project's `stdout_obj` function for sending.
|
||||
After obtaining the text from the current audio stream, the text needs to be sent to the main program. The caption engine process transmits caption data to the Electron main process through standard output.
|
||||
|
||||
### Command-Line Parameter Specification
|
||||
The transmitted content must be a JSON string, where the JSON object needs to contain the following parameters:
|
||||
|
||||
Custom caption engine settings are provided via command-line arguments. The current project uses the following parameters:
|
||||
```typescript
|
||||
export interface CaptionItem {
|
||||
command: "caption",
|
||||
index: number, // Caption sequence number
|
||||
time_s: string, // Current caption start time
|
||||
time_t: string, // Current caption end time
|
||||
text: string, // Caption content
|
||||
translation: string // Caption translation
|
||||
}
|
||||
```
|
||||
|
||||
```python
|
||||
import argparse
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
|
||||
# Common parameters
|
||||
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
|
||||
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
|
||||
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
|
||||
parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
|
||||
# Gummy-specific parameters
|
||||
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
|
||||
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
|
||||
# Vosk-specific parameters
|
||||
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
|
||||
```
|
||||
**Note that the buffer must be flushed after each caption JSON data output to ensure that the Electron main process receives strings that can be interpreted as JSON objects each time.** It is recommended to use the project's existing `stdout_obj` function for transmission.
|
||||
|
||||
For example, to use the Gummy model with Japanese as the source language, Chinese as the target language, and system audio output captions with 0.1s audio chunks, the command-line arguments would be:
|
||||
### Command Line Parameter Specification
|
||||
|
||||
```bash
|
||||
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
```
|
||||
Custom caption engine settings provide command line parameter specification, so the caption engine parameters need to be set properly. Currently used parameters in this project are as follows:
|
||||
|
||||
```python
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
|
||||
# all
|
||||
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
|
||||
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
|
||||
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
|
||||
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
|
||||
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
|
||||
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
|
||||
# gummy and sosv
|
||||
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
|
||||
# gummy only
|
||||
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
|
||||
# vosk and sosv
|
||||
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
|
||||
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
|
||||
# vosk only
|
||||
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
|
||||
# sosv only
|
||||
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
|
||||
```
|
||||
|
||||
For example, for this project's caption engine, if I want to use the Gummy model, specify the original text as Japanese, translate to Chinese, and capture captions from system audio output, with 0.1s audio data segments each time, the command line parameters would be:
|
||||
|
||||
```bash
|
||||
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
```
|
||||
|
||||
## Additional Notes
|
||||
|
||||
@@ -183,7 +215,7 @@ python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
|
||||
### Development Recommendations
|
||||
|
||||
Apart from audio-to-text conversion, it is recommended to reuse the existing code. In this case, the following additions are needed:
|
||||
Except for audio-to-text conversion and translation, other components (audio acquisition, audio resampling, and communication with the main process) are recommended to directly reuse the project's code. If this approach is taken, the content that needs to be added includes:
|
||||
|
||||
- `engine/audio2text/`: Add a new audio-to-text class (file-level).
|
||||
- `engine/main.py`: Add new parameter settings and workflow functions (refer to `main_gummy` and `main_vosk` functions).
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
# 字幕エンジン説明ドキュメント
|
||||
|
||||
## 注意:このドキュメントはメンテナンスが行われていないため、記載されている情報は古くなっています。最新の情報については、[中国語版](./zh.md)または[英語版](./en.md)のドキュメントをご参照ください。
|
||||
|
||||
対応バージョン:v0.6.0
|
||||
|
||||
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# 字幕引擎说明文档
|
||||
|
||||
对应版本:v0.6.0
|
||||
对应版本:v1.0.0
|
||||
|
||||

|
||||
|
||||
@@ -16,22 +16,21 @@
|
||||
|
||||
### 启动引擎
|
||||
|
||||
- 主进程:使用 `child_process.spawn()` 启动字幕引擎进程
|
||||
- Electron 主进程:使用 `child_process.spawn()` 启动字幕引擎进程
|
||||
- 字幕引擎进程:创建 TCP Socket 服务器线程,创建后在标准输出中输出转化为字符串的 JSON 对象,该对象中包含 `command` 字段,值为 `connect`
|
||||
- 主进程:监听字幕引擎进程的标准输出,尝试将标准输出按行分割,解析为 JSON 对象,并判断对象的 `command` 字段值是否为 `connect`,如果是则连接 TCP Socket 服务器
|
||||
|
||||
### 字幕识别
|
||||
|
||||
- 字幕引擎进程:在主线程监听系统音频输出,并将音频数据块发送给字幕引擎解析,字幕引擎解析音频数据块,通过标准输出发送解析的字幕数据对象字符串
|
||||
- 主进程:继续监听字幕引擎的标准输出,并根据解析的对象的 `command` 字段采取不同的操作
|
||||
- 字幕引擎进程:新建线程监听系统音频输出,将获取的音频数据块放入共享队列中(`shared_data.chunk_queue`)。字幕引擎不断读取共享队列中的音频数据块并解析。字幕引擎还可能新建线程执行翻译操作。最后字幕引擎通过标准输出发送解析的字幕数据对象字符串
|
||||
- Electron 主进程:持续监听字幕引擎的标准输出,并根据解析的对象的 `command` 字段采取不同的操作
|
||||
|
||||
### 关闭引擎
|
||||
|
||||
- 主进程:当用户在前端操作关闭字幕引擎时,主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
|
||||
- 字幕引擎进程:接收主引擎进程发送的字幕数据对象字符串,将字符串解析为对象,如果对象的 `command` 字段为 `stop`,则将全局变量 `thread_data.status` 的值设置为 `stop`
|
||||
- Electron 主进程:当用户在前端操作关闭字幕引擎时,主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
|
||||
- 字幕引擎进程:接收主引擎进程发送的字幕数据对象字符串,将字符串解析为对象,如果对象的 `command` 字段为 `stop`,则将全局变量 `shared_data.status` 的值设置为 `stop`
|
||||
- 字幕引擎进程:主线程循环监听系统音频输出,当 `thread_data.status` 的值不为 `running` 时,则结束循环,释放资源,结束运行
|
||||
- 主进程:如果检测到字幕引擎进程结束,进行相应处理,并向前端反馈
|
||||
|
||||
- Electron 主进程:如果检测到字幕引擎进程结束,进行相应处理,并向前端反馈
|
||||
|
||||
## 项目已经实现的功能
|
||||
|
||||
@@ -45,21 +44,25 @@
|
||||
|
||||
```python
|
||||
from utils import stdout, stdout_cmd, stdout_obj, stderr
|
||||
stdout("Hello") # {"command": "print", "content": "Hello"}\n
|
||||
stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
|
||||
stdout_obj({"command": "print", "content": "Hello"})
|
||||
# {"command": "print", "content": "Hello"}\n
|
||||
stdout("Hello")
|
||||
# {"command": "connect", "content": "8080"}\n
|
||||
stdout_cmd("connect", "8080")
|
||||
# {"command": "print", "content": "print"}\n
|
||||
stdout_obj({"command": "print", "content": "print"})
|
||||
# sys.stderr.write("Error Info" + "\n")
|
||||
stderr("Error Info")
|
||||
```
|
||||
|
||||
### 创建 Socket 服务
|
||||
|
||||
该 Socket 服务会监听指定端口,会解析 Electron 主程序发送的内容,并可能改变 `thread_data.status` 的值。
|
||||
该 Socket 服务会监听指定端口,会解析 Electron 主程序发送的内容,并可能改变 `shared_data.status` 的值。
|
||||
|
||||
样例:
|
||||
|
||||
```python
|
||||
from utils import start_server
|
||||
from utils import thread_data
|
||||
from utils import shared_data
|
||||
port = 8080
|
||||
start_server(port)
|
||||
while thread_data == 'running':
|
||||
@@ -72,13 +75,14 @@ while thread_data == 'running':
|
||||
`AudioStream` 类用于获取音频数据,实现是跨平台的,支持 Windows、Linux 和 macOS。该类初始化包含两个参数:
|
||||
|
||||
- `audio_type`: 获取音频类型,0 表示系统输出音频(扬声器),1 表示系统输入音频(麦克风)
|
||||
- `chunk_rate`: 音频数据获取频率,每秒音频获取的音频块的数量
|
||||
- `chunk_rate`: 音频数据获取频率,每秒音频获取的音频块的数量,默认为 10
|
||||
|
||||
该类包含三个方法:
|
||||
该类包含四个方法:
|
||||
|
||||
- `open_stream()`: 开启音频获取
|
||||
- `read_chunk() -> bytes`: 读取一个音频块
|
||||
- `close_stream()`: 关闭音频获取
|
||||
- `close_stream_signal()` 线程安全的关闭系统音频输入流
|
||||
|
||||
样例:
|
||||
|
||||
@@ -97,11 +101,22 @@ stream.close_stream()
|
||||
|
||||
### 音频处理
|
||||
|
||||
获取到的音频流在转文字之前可能需要进行预处理。一般需要将多通道音频转换为单通道音频,还可能需要进行重采样。本项目提供了三个音频处理函数:
|
||||
获取到的音频流在转文字之前可能需要进行预处理。一般需要将多通道音频转换为单通道音频,还可能需要进行重采样。本项目提供了两个音频处理函数:
|
||||
|
||||
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: 将多通道音频块转换为单通道音频块
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
|
||||
- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:将当前单通道音频块进行重采样
|
||||
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`:将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
|
||||
|
||||
样例:
|
||||
|
||||
```python
|
||||
from sysaudio import AudioStream
|
||||
from utils import merge_chunk_channels
|
||||
stream = AudioStream(1)
|
||||
while True:
|
||||
raw_chunk = stream.read_chunk()
|
||||
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
|
||||
# do something with chunk
|
||||
```
|
||||
|
||||
## 字幕引擎需要实现的功能
|
||||
|
||||
@@ -109,20 +124,31 @@ stream.close_stream()
|
||||
|
||||
在得到了合适的音频流后,需要将音频流转换为文字了。一般使用各种模型(云端或本地)来实现音频流转文字。需要根据需求选择合适的模型。
|
||||
|
||||
这部分建议封装为一个类,需要实现三个方法:
|
||||
这部分建议封装为一个类,需要实现四个方法:
|
||||
|
||||
- `start(self)`:启动模型
|
||||
- `send_audio_frame(self, data: bytes)`:处理当前音频块数据,**生成的字幕数据通过标准输出发送给 Electron 主进程**
|
||||
- `translate(self)`:持续从 `shared_data.chunk_queue` 中取出数据块,并调用 `send_audio_frame` 方法处理数据块
|
||||
- `stop(self)`:停止模型
|
||||
|
||||
完整的字幕引擎实例如下:
|
||||
|
||||
- [gummy.py](../../engine/audio2text/gummy.py)
|
||||
- [vosk.py](../../engine/audio2text/vosk.py)
|
||||
- [sosv.py](../../engine/audio2text/sosv.py)
|
||||
|
||||
### 字幕翻译
|
||||
|
||||
有的语音转文字模型并不提供翻译,如果有需求,需要再添加一个翻译模块。
|
||||
有的语音转文字模型并不提供翻译,如果有需求,需要再添加一个翻译模块,也可以使用自带的翻译模块。
|
||||
|
||||
样例:
|
||||
|
||||
```python
|
||||
from utils import google_translate, ollama_translate
|
||||
text = "这是一个翻译测试。"
|
||||
google_translate("", "en", text, "time_s")
|
||||
ollama_translate("qwen3:0.6b", "en", text, "time_s")
|
||||
```
|
||||
|
||||
### 字幕数据发送
|
||||
|
||||
@@ -133,37 +159,42 @@ stream.close_stream()
|
||||
```typescript
|
||||
export interface CaptionItem {
|
||||
command: "caption",
|
||||
index: number, // 字幕序号
|
||||
time_s: string, // 当前字幕开始时间
|
||||
time_t: string, // 当前字幕结束时间
|
||||
text: string, // 字幕内容
|
||||
translation: string // 字幕翻译
|
||||
index: number, // 字幕序号
|
||||
time_s: string, // 当前字幕开始时间
|
||||
time_t: string, // 当前字幕结束时间
|
||||
text: string, // 字幕内容
|
||||
translation: string // 字幕翻译
|
||||
}
|
||||
```
|
||||
|
||||
**注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。**
|
||||
|
||||
建议使用项目已经实现的 `stdout_obj` 函数来发送。
|
||||
**注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。** 建议使用项目已经实现的 `stdout_obj` 函数来发送。
|
||||
|
||||
### 命令行参数的指定
|
||||
|
||||
自定义字幕引擎的设置提供命令行参数指定,因此需要设置好字幕引擎的参数,本项目目前用到的参数如下:
|
||||
|
||||
```python
|
||||
import argparse
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
|
||||
# both
|
||||
# all
|
||||
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
|
||||
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
|
||||
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
|
||||
parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
|
||||
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
|
||||
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
|
||||
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
|
||||
# gummy and sosv
|
||||
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
|
||||
# gummy only
|
||||
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
|
||||
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
|
||||
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
|
||||
# vosk and sosv
|
||||
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
|
||||
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
|
||||
# vosk only
|
||||
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
|
||||
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
|
||||
# sosv only
|
||||
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
|
||||
```
|
||||
|
||||
比如对于本项目的字幕引擎,我想使用 Gummy 模型,指定原文为日语,翻译为中文,获取系统音频输出的字幕,每次截取 0.1s 的音频数据,那么命令行参数如下:
|
||||
@@ -184,7 +215,7 @@ python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
|
||||
|
||||
### 开发建议
|
||||
|
||||
除音频转文字外,其他建议直接复用本项目代码。如果这样,那么需要添加的内容为:
|
||||
除音频转文字和翻译外,其他(音频获取、音频重采样、与主进程通信)建议直接复用本项目代码。如果这样,那么需要添加的内容为:
|
||||
|
||||
- `engine/audio2text/`:添加新的音频转文字类(文件级别)
|
||||
- `engine/main.py`:添加新参数设置、流程函数(参考 `main_gummy` 函数和 `main_vosk` 函数)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Auto Caption User Manual
|
||||
|
||||
Corresponding Version: v0.6.0
|
||||
Corresponding Version: v1.0.0
|
||||
|
||||
**Note: Due to limited personal resources, the English and Japanese documentation files for this project (except for the README document) will no longer be maintained. The content of this document may not be consistent with the latest version of the project. If you are willing to help with translation, please submit relevant Pull Requests.**
|
||||
|
||||
@@ -43,9 +43,13 @@ Alibaba Cloud provides detailed tutorials for this part, which can be referenced
|
||||
|
||||
## Preparation for Using Vosk Engine
|
||||
|
||||
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings. Currently, the Vosk caption engine does not support translated caption content.
|
||||
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings.
|
||||
|
||||

|
||||

|
||||
|
||||
## Using SOSV Model
|
||||
|
||||
The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## Capturing System Audio Output on macOS
|
||||
|
||||
|
||||
@@ -1,11 +1,9 @@
|
||||
# Auto Caption ユーザーマニュアル
|
||||
|
||||
対応バージョン:v0.6.0
|
||||
対応バージョン:v1.0.0
|
||||
|
||||
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
|
||||
|
||||
**注意:個人のリソースが限られているため、このプロジェクトの英語および日本語のドキュメント(README ドキュメントを除く)のメンテナンスは行われません。このドキュメントの内容は最新版のプロジェクトと一致しない場合があります。翻訳のお手伝いをしていただける場合は、関連するプルリクエストを提出してください。**
|
||||
|
||||
## ソフトウェアの概要
|
||||
|
||||
Auto Caption は、クロスプラットフォームの字幕表示ソフトウェアで、システムの音声入力(録音)または出力(音声再生)のストリーミングデータをリアルタイムで取得し、音声からテキストに変換するモデルを利用して対応する音声の字幕を生成します。このソフトウェアが提供するデフォルトの字幕エンジン(アリババクラウド Gummy モデルを使用)は、9つの言語(中国語、英語、日本語、韓国語、ドイツ語、フランス語、ロシア語、スペイン語、イタリア語)の認識と翻訳をサポートしています。
|
||||
@@ -45,9 +43,13 @@ macOS プラットフォームでオーディオ出力を取得するには追
|
||||
|
||||
## Voskエンジン使用前の準備
|
||||
|
||||
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。現在、Vosk字幕エンジンは字幕の翻訳をサポートしていません。
|
||||
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。
|
||||
|
||||

|
||||

|
||||
|
||||
## SOSVモデルの使用
|
||||
|
||||
SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りです:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## macOS でのシステムオーディオ出力の取得方法
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Auto Caption 用户手册
|
||||
|
||||
对应版本:v0.6.0
|
||||
对应版本:v1.0.0
|
||||
|
||||
## 软件简介
|
||||
|
||||
@@ -41,9 +41,13 @@ Auto Caption 是一个跨平台的字幕显示软件,能够实时获取系统
|
||||
|
||||
## Vosk 引擎使用前准备
|
||||
|
||||
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
|
||||
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。
|
||||
|
||||

|
||||

|
||||
|
||||
## 使用 SOSV 模型
|
||||
|
||||
使用 SOSV 模型的方式和 Vosk 一样,下载地址如下:https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
|
||||
|
||||
## macOS 获取系统音频输出
|
||||
|
||||
|
||||
4
package-lock.json
generated
@@ -1,12 +1,12 @@
|
||||
{
|
||||
"name": "auto-caption",
|
||||
"version": "0.7.0",
|
||||
"version": "1.0.0",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {
|
||||
"": {
|
||||
"name": "auto-caption",
|
||||
"version": "0.7.0",
|
||||
"version": "1.0.0",
|
||||
"hasInstallScript": true,
|
||||
"dependencies": {
|
||||
"@electron-toolkit/preload": "^3.0.1",
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "auto-caption",
|
||||
"productName": "Auto Caption",
|
||||
"version": "0.7.0",
|
||||
"version": "1.0.0",
|
||||
"description": "A cross-platform subtitle display software.",
|
||||
"main": "./out/main/index.js",
|
||||
"author": "himeditator",
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="UTF-8" />
|
||||
<title>Auto Caption v0.7.0</title>
|
||||
<title>Auto Caption v1.0.0</title>
|
||||
<!-- https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP -->
|
||||
<meta
|
||||
http-equiv="Content-Security-Policy"
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
</template>
|
||||
|
||||
<div class="input-item">
|
||||
<span class="input-label">{{ '字幕行数' }}</span>
|
||||
<span class="input-label">{{ $t('style.lineNumber') }}</span>
|
||||
<a-radio-group v-model:value="currentLineNumber">
|
||||
<a-radio-button :value="1">1</a-radio-button>
|
||||
<a-radio-button :value="2">2</a-radio-button>
|
||||
|
||||
@@ -101,7 +101,7 @@
|
||||
<p class="about-desc">{{ $t('status.about.desc') }}</p>
|
||||
<a-divider />
|
||||
<div class="about-info">
|
||||
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v0.7.0</a-tag></p>
|
||||
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v1.0.0</a-tag></p>
|
||||
<p>
|
||||
<b>{{ $t('status.about.author') }}</b>
|
||||
<a
|
||||
|
||||
@@ -85,6 +85,7 @@ export default {
|
||||
"title": "Caption Style Settings",
|
||||
"applyStyle": "Apply",
|
||||
"cancelChange": "Cancel",
|
||||
"lineNumber": "CaptionLines",
|
||||
"resetStyle": "Reset",
|
||||
"longCaption": "LongCaption",
|
||||
"fontFamily": "Font Family",
|
||||
@@ -142,7 +143,7 @@ export default {
|
||||
"projLink": "Project Link",
|
||||
"manual": "User Manual",
|
||||
"engineDoc": "Caption Engine Manual",
|
||||
"date": "August 20, 2025"
|
||||
"date": "September 8th, 2025"
|
||||
}
|
||||
},
|
||||
log: {
|
||||
|
||||
@@ -85,6 +85,7 @@ export default {
|
||||
"applyStyle": "適用",
|
||||
"cancelChange": "キャンセル",
|
||||
"resetStyle": "リセット",
|
||||
"lineNumber": "字幕行数",
|
||||
"longCaption": "長い字幕",
|
||||
"fontFamily": "フォント",
|
||||
"fontColor": "カラー",
|
||||
@@ -141,7 +142,7 @@ export default {
|
||||
"projLink": "プロジェクトリンク",
|
||||
"manual": "ユーザーマニュアル",
|
||||
"engineDoc": "字幕エンジンマニュアル",
|
||||
"date": "2025 年 8 月 20 日"
|
||||
"date": "2025 年 9 月 8 日"
|
||||
}
|
||||
},
|
||||
log: {
|
||||
|
||||
@@ -85,6 +85,7 @@ export default {
|
||||
"applyStyle": "应用样式",
|
||||
"cancelChange": "取消更改",
|
||||
"resetStyle": "恢复默认",
|
||||
"lineNumber": "字幕行数",
|
||||
"longCaption": "长字幕",
|
||||
"fontFamily": "字体族",
|
||||
"fontColor": "字体颜色",
|
||||
@@ -141,7 +142,7 @@ export default {
|
||||
"projLink": "项目链接",
|
||||
"manual": "用户手册",
|
||||
"engineDoc": "字幕引擎手册",
|
||||
"date": "2025 年 8 月 20 日"
|
||||
"date": "2025 年 9 月 8 日"
|
||||
}
|
||||
},
|
||||
log: {
|
||||
|
||||