21 Commits

Author SHA1 Message Date
himeditator
564954a834 release v1.1.1 2026-01-31 13:45:25 +08:00
himeditator
aed15af386 feat(window): 字幕窗口改为始终顶置,窗口顶置选项改为鼠标穿透选项(#26) 2026-01-31 13:38:53 +08:00
himeditator
4f9d33abc1 feat(README): 添加预览视频 2026-01-17 21:47:17 +08:00
himeditator
0dc70d491e release v1.1.0 2026-01-10 22:50:57 +08:00
HS RedWoods
086ea90a5f Merge pull request #25 from nocmt/dev_glmasr
feat(engine): 添加GLM-ASR语音识别引擎支持
2026-01-10 20:17:35 +08:00
himeditator
3324b630d1 feat(app): 适配最新版本
- 修改软件内部分提示文本
- 添加 API KEY 掩码,防止直接输出 API KEY 内容
2026-01-10 20:15:32 +08:00
nocmt
0825e48902 feat(engine): 添加GLM-ASR语音识别引擎支持
- 新增GLM-ASR云端语音识别引擎实现
- 扩展配置界面添加GLM相关参数设置
- Ollama支持自定义域名和Apikey以支持云端和其他LLM
- 修改音频处理逻辑以支持新引擎
- 更新依赖项和构建配置
- 修复Ollama翻译功能相关问题
2026-01-10 16:02:24 +08:00
himeditator
383e582a2d docs(readme): 更新说明并添加终端使用指南 2025-11-02 20:53:56 +08:00
himeditator
e6a65f8362 feat(engine): 字幕引擎添加在终端直接显示字幕的功能 2025-11-01 21:06:28 +08:00
himeditator
77726753bb fix(renderer): 修复暗色主题下部分字体不可见的问题 2025-09-15 22:17:47 +08:00
himeditator
4b47e50d9e release v1.0.0 2025-09-08 15:19:10 +08:00
himeditator
4494b2c68b feat(renderer):实现多行字幕显示功能
- 在 CaptionStyle 组件中添加字幕行数设置选项
- 修改组件以支持多行字幕显示
- 优化字幕数据处理逻辑,支持按时间顺序显示多条字幕
2025-09-07 21:06:11 +08:00
himeditator
4abd6d0808 feat(renderer): 在用户界面中添加新功能的设置
- 添加录音功能,可保存为 WAV 文件
- 优化字幕引擎设置界面,支持更多配置选项
- 更新多语言翻译,增加模型下载链接等信息
2025-09-07 14:35:18 +08:00
himeditator
6bff978b88 feat(engine): 替换重采样模型、SOSV 添加标点恢复模型
- 将 samplerate 库替换为 resampy 库,提高重采样质量
- Shepra-ONNX SenseVoice 添加中文和英语标点恢复模型
2025-09-06 23:15:33 +08:00
himeditator
eba2c5ca45 feat(engine): 重构字幕引擎,新增 Sherpa-ONNX SenseVoice 语音识别模型
- 重构字幕引擎,将音频采集改为在新线程上进行
- 重构 audio2text 中的类,调整运行逻辑
- 更新 main 函数,添加对 Sosv 模型的支持
- 修改 AudioStream 类,默认使用 16000Hz 采样率
2025-09-06 20:49:46 +08:00
himeditator
2b7ce06f04 feat(translation): 添加非实时翻译功能用户界面组件 2025-09-04 23:41:22 +08:00
himeditator
14987cbfc5 feat(vosk): 为 Vosk 模型添加非实时翻译功能 (#14)
- 添加 Ollama 大模型翻译和 Google 翻译(非实时),支持多种语言
- 为 Vosk 引擎添加非实时翻译
- 为新增的翻译功能添加和修改接口
- 修改 Electron 构建配置,之后不同平台构建无需修改构建文件
2025-09-02 23:19:53 +08:00
himeditator
56fdc348f8 fix(engine): 解决在引擎状态不为 running 时强制关闭字幕引擎失败的问题
- 合并了 CaptionEngine 类中的 kill 和 forceKill 方法,删除了状态警告中的提前  return
- 更新了 README 文件中的macOS兼容性说明,添加了配置链接
2025-08-30 20:57:26 +08:00
Chen Janai
f42458124e Merge pull request #17 from xuemian168/main
feat(engine): 添加启动超时功能和强制终止引擎的支持
2025-08-28 12:25:33 +08:00
himeditator
2352bcee5d feat(engine): 优化超时启动功能的小问题
- 更新接口文档
- 修改国际化文本使得内容不超过标签长度
- 解决强制关闭按钮点击无效的问题
2025-08-28 12:22:19 +08:00
xuemian
051a497f3a feat(engine): 添加启动超时功能和强制终止引擎的支持
- 在 ControlWindow 中添加了 'control.engine.forceKill' 事件处理,允许强制终止引擎。
- 在 CaptionEngine 中实现了启动超时机制,若引擎启动超时,将自动强制停止并发送错误消息。
- 更新了国际化文件,添加了与启动超时相关的提示信息。
- 在 EngineControl 组件中添加了启动超时的输入选项,允许用户设置超时时间。
- 更新了相关类型定义以支持新的启动超时配置。
2025-08-28 10:24:08 +10:00
74 changed files with 2953 additions and 757 deletions

5
.gitignore vendored
View File

@@ -7,8 +7,13 @@ out
__pycache__
.venv
test.py
engine/build
engine/portaudio
engine/pyinstaller_cache
engine/models
engine/notebook
# engine/main.spec
.repomap
.virtualme

150
README.md
View File

@@ -3,7 +3,7 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption 是一个跨平台的实时字幕显示软件。</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.7.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.1.1-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
@@ -14,14 +14,18 @@
| <a href="./README_en.md">English</a>
| <a href="./README_ja.md">日本語</a> |
</p>
<p><i>v0.7.0 版本已经发布,优化了软件界面,添加了日志记录显示。本地的字幕引擎正在尝试开发中,预计以 Python 代码的形式进行发布...</i></p>
<p><i>v1.1.1 版本已经发布,新增 GLM-ASR 云端字幕模型和 OpenAI 兼容模型翻译...</i></p>
</div>
![](./assets/media/main_zh.png)
## 📥 下载
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
软件下载:[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Vosk 模型下载:[Vosk Models](https://alphacephei.com/vosk/models)
SOSV 模型下载:[ Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
## 📚 相关文档
@@ -29,51 +33,130 @@
[字幕引擎说明文档](./docs/engine-manual/zh.md)
[项目 API 文档](./docs/api-docs/)
[更新日志](./docs/CHANGELOG.md)
## 👁️‍🗨️ 预览
https://github.com/user-attachments/assets/9c188d78-9520-4397-bacf-4c8fdcc54874
## ✨ 特性
- 生成音频输出或麦克风输入的字幕
- 支持调用本地 Ollama 模型、云端 OpenAI 兼容模型、或云端 Google 翻译 API 进行翻译
- 跨平台Windows、macOS、Linux、多界面语言中文、英语、日语支持
- 丰富的字幕样式设置(字体、字体大小、字体粗细、字体颜色、背景颜色等)
- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、本地 Vosk 模型、自己开发模型)
- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、GLM-ASR 云端模型、本地 Vosk 模型、本地 SOSV 模型、还可以自己开发模型)
- 多语言识别与翻译(见下文“⚙️ 自带字幕引擎说明”)
- 字幕记录展示与导出(支持导出 `.srt``.json` 格式)
## 📖 基本使用
软件已经适配了 Windows、macOS 和 Linux 平台。测试过的平台信息如下:
> ⚠️ 注意:目前只维护了 Windows 平台的软件的最新版本,其他平台的最后版本停留在 v1.0.0。
软件已经适配了 Windows、macOS 和 Linux 平台。测试过的主流平台信息如下:
| 操作系统版本 | 处理器架构 | 获取系统音频输入 | 获取系统音频输出 |
| ------------------ | ---------- | ---------------- | ---------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅需要额外配置 | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ [需要额外配置](./docs/user-manual/zh.md#macos-获取系统音频输出) | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,详见[Auto Caption 用户手册](./docs/user-manual/zh.md)。
macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,详见 [Auto Caption 用户手册](./docs/user-manual/zh.md)。
> 国际版的阿里云服务并没有提供 Gummy 模型,因此目前非中国用户无法使用 Gummy 字幕引擎
下载软件后,需要根据自己的需求选择对应的模型,然后配置模型
如果要使用默认的 Gummy 字幕引擎(使用云端模型进行语音识别和翻译),首先需要获取阿里云百炼平台的 API KEY然后将 API KEY 添加到软件设置中或者配置到环境变量中(仅 Windows 平台支持读取环境变量中的 API KEY这样才能正常使用该模型。相关教程
| | 准确率 | 实时性 | 部署类型 | 支持语言 | 翻译 | 备注 |
| ------------------------------------------------------------ | -------- | ------------- | ---------- | ---------- | ---------------------------------------------------------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 很好😊 | 很好😊 | 云端 / 阿里云 | 10 种 | 自带翻译 | 收费0.54CNY / 小时 |
| [glm-asr-2512](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512) | 很好😊 | 较差😞 | 云端 / 智谱 AI | 4 种 | 需额外配置 | 收费,约 0.72CNY / 小时 |
| [Vosk](https://alphacephei.com/vosk) | 较差😞 | 很好😊 | 本地 / CPU | 超过 30 种 | 需额外配置 | 支持的语言非常多 |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | 一般😐 | 本地 / CPU | 5 种 | 需额外配置 | 仅有一个模型 |
| 自己开发 | 🤔 | 🤔 | 自定义 | 自定义 | 自定义 | 根据[文档](./docs/engine-manual/zh.md)使用 Python 自己开发 |
如果你选择的不是 Gummy 模型,你还需要配置自己的翻译模型。
### 配置翻译模型
![](./assets/media/engine_zh.png)
> 注意:翻译不是实时的,翻译模型只会在每句话识别完成后再调用。
#### Ollama 本地模型
> 注意:使用参数量过大的模型会导致资源消耗和翻译延迟较大。建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`。
使用该模型之前你需要确定本机安装了 [Ollama](https://ollama.com/) 软件,并已经下载了需要的大语言模型。只需要将需要调用的大模型名称添加到设置中的 `模型名称` 字段中,并保证 `Base URL` 字段为空。
#### OpenAI 兼容模型
如果觉得本地 Ollama 模型的翻译效果不佳,或者不想在本地安装 Ollama 模型,那么可以使用云端的 OpenAI 兼容模型。
以下是一些模型提供商的 `Base URL`
- OpenAI: https://api.openai.com/v1
- DeepSeekhttps://api.deepseek.com
- 阿里云https://dashscope.aliyuncs.com/compatible-mode/v1
API Key 需要在对应的模型提供商处获取。
#### Google 翻译 API
> 注意Google 翻译 API 在无法访问国际网络的地区无法使用。
无需任何配置,联网即可使用。
### 使用 Gummy 模型
> 国际版的阿里云服务似乎并没有提供 Gummy 模型,因此目前非中国用户可能无法使用 Gummy 字幕引擎。
如果要使用默认的 Gummy 字幕引擎(使用云端模型进行语音识别和翻译),首先需要获取阿里云百炼平台的 API KEY然后将 API KEY 添加到软件设置中(在字幕引擎设置的更多设置中)或者配置到环境变量中(仅 Windows 平台支持读取环境变量中的 API KEY这样才能正常使用该模型。相关教程
- [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### 使用 GLM-ASR 模型
使用前需要获取智谱 AI 平台的 API KEY并添加到软件设置中。
API KEY 获取相关链接:[快速开始](https://docs.bigmodel.cn/cn/guide/start/quick-start)。
### 使用 Vosk 模型
> Vosk 模型的识别效果较差,请谨慎使用。
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。
![](./assets/media/vosk_zh.png)
![](./assets/media/config_zh.png)
**如果你觉得上述字幕引擎不能满足你的需求,而且你会 Python那么你可以考虑开发自己的字幕引擎。详细说明请参考[字幕引擎说明文档](./docs/engine-manual/zh.md)。**
### 使用 SOSV 模型
使用 SOSV 模型的方式和 Vosk 一样下载地址如下https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## ⌨️ 在终端中使用
软件采用模块化设计,可用分为软件主体和字幕引擎两部分,软件主体通过图形界面调用字幕引擎。核心的音频获取和音频识别功能都在字幕引擎中实现,而字幕引擎是可用脱离软件主体单独使用的。
字幕引擎使用 Python 开发,通过 PyInstaller 打包为可执行文件。因此字幕引擎有两种使用方式:
1. 使用项目字幕引擎部分的源代码,使用安装了对应库的 Python 环境进行运行
2. 使用打包好的字幕引擎的可执行文件,通过终端运行
运行参数和详细使用介绍请参考[用户手册](./docs/user-manual/zh.md#单独使用字幕引擎)。
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
![](./docs/img/07.png)
## ⚙️ 自带字幕引擎说明
目前软件自带 2 个字幕引擎,正在规划新的引擎。它们的详细信息如下。
目前软件自带 4 个字幕引擎。它们的详细信息如下。
### Gummy 字幕引擎(云端)
@@ -100,18 +183,18 @@ $$
而且引擎只会获取到音频流的时候才会上传数据,因此实际上传速率可能更小。模型结果回传流量消耗较小,没有纳入考虑。
### GLM-ASR 字幕引擎(云端)
https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512
### Vosk 字幕引擎(本地)
基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。目前只支持生成音频对应的原文,不支持生成翻译内容
基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。该字幕引擎的优点是可选的语言模型非常多(超过 30 种),缺点是识别效果比较差,且生成内容没有标点符号
### 新规划字幕引擎
以下为备选模型,将根据模型效果和集成难易程度选择。
### SOSV 字幕引擎(本地)
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) 是一个整合包,该整合包主要基于 [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html),并添加了端点检测模型和标点恢复模型。该模型支持识别的语言有:英语、中文、日语、韩语、粤语。
## 🚀 项目运行
@@ -128,6 +211,7 @@ npm install
首先进入 `engine` 文件夹,执行如下指令创建虚拟环境(需要使用大于等于 Python 3.10 的 Python 运行环境,建议使用 Python 3.12
```bash
cd ./engine
# in ./engine folder
python -m venv .venv
# or
@@ -149,12 +233,6 @@ source .venv/bin/activate
pip install -r requirements.txt
```
如果在 Linux 系统上安装 `samplerate` 模块报错,可以尝试使用以下命令单独安装:
```bash
pip install samplerate --only-binary=:all:
```
然后使用 `pyinstaller` 构建项目:
```bash
@@ -188,15 +266,3 @@ npm run build:mac
# For Linux
npm run build:linux
```
注意,根据不同的平台需要修改项目根目录下 `electron-builder.yml` 文件中的配置内容:
```yml
extraResources:
# For Windows
- from: ./engine/dist/main.exe
to: ./engine/main.exe
# For macOS and Linux
# - from: ./engine/dist/main
# to: ./engine/main
```

View File

@@ -3,7 +3,7 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption is a cross-platform real-time caption display software.</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.7.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.1.1-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
@@ -14,14 +14,18 @@
| <b>English</b>
| <a href="./README_ja.md">日本語</a> |
</p>
<p><i>Version 0.7.0 has been released, imporving the software interface and adding software log display. The local caption engine is under development and is expected to be released in the form of Python code...</i></p>
<p><i>v1.1.1 has been released, adding the GLM-ASR cloud caption model and OpenAI compatible model translation...</i></p>
</div>
![](./assets/media/main_en.png)
## 📥 Download
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Software Download: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Vosk Model Download: [Vosk Models](https://alphacephei.com/vosk/models)
SOSV Model Download: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
## 📚 Documentation
@@ -29,51 +33,131 @@
[Caption Engine Documentation](./docs/engine-manual/en.md)
[Project API Documentation (Chinese)](./docs/api-docs/)
[Changelog](./docs/CHANGELOG.md)
## 👁️‍🗨️ Preview
https://github.com/user-attachments/assets/9c188d78-9520-4397-bacf-4c8fdcc54874
## ✨ Features
- Generate captions from audio output or microphone input
- Supports calling local Ollama models, cloud-based OpenAI compatible models, or cloud-based Google Translate API for translation
- Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support
- Rich caption style settings (font, font size, font weight, font color, background color, etc.)
- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, self-developed model)
- Flexible caption engine selection (Aliyun Gummy cloud model,GLM-ASR cloud model, local Vosk model, local SOSV model, or you can develop your own model)
- Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines")
- Subtitle record display and export (supports exporting `.srt` and `.json` formats)
## 📖 Basic Usage
> ⚠️ Note: Currently, only the latest version of the software on Windows platform is maintained, while the last versions for other platforms remain at v1.0.0.
The software has been adapted for Windows, macOS, and Linux platforms. The tested platform information is as follows:
| OS Version | Architecture | System Audio Input | System Audio Output |
| ------------------ | ------------ | ------------------ | ------------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ Additional config required | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ [Additional config required](./docs/user-manual/en.md#capturing-system-audio-output-on-macos) | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
Additional configuration is required to capture system audio output on macOS and Linux platforms. See [Auto Caption User Manual](./docs/user-manual/en.md) for details.
> The international version of Alibaba Cloud services does not provide the Gummy model, so non-Chinese users currently cannot use the Gummy caption engine.
To use the default Gummy caption engine (which uses cloud-based models for speech recognition and translation), you first need to obtain an API KEY from the Alibaba Cloud Bailian platform. Then add the API KEY to the software settings or configure it in environment variables (only Windows platform supports reading API KEY from environment variables) to properly use this model. Related tutorials:
After downloading the software, you need to select the corresponding model according to your needs and then configure the model.
- [Obtaining API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configuring API Key through Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
| | Accuracy | Real-time | Deployment Type | Supported Languages | Translation | Notes |
| ------------------------------------------------------------ | -------- | --------- | --------------- | ------------------- | ----------- | ----- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Very good 😊 | Very good 😊 | Cloud / Alibaba Cloud | 10 languages | Built-in translation | Paid, 0.54 CNY/hour |
| [glm-asr-2512](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512) | Very good 😊 | Poor 😞 | Cloud / Zhipu AI | 4 languages | Requires additional configuration | Paid, approximately 0.72 CNY/hour |
| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Very good 😊 | Local / CPU | Over 30 languages | Requires additional configuration | Supports many languages |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Average 😐 | Average 😐 | Local / CPU | 5 languages | Requires additional configuration | Only one model |
| Self-developed | 🤔 | 🤔 | Custom | Custom | Custom | Develop your own using Python according to the [documentation](./docs/engine-manual/en.md) |
> The recognition performance of Vosk models is suboptimal, please use with caution.
If you choose a model other than Gummy, you also need to configure your own translation model.
To use the Vosk local caption engine, first download your required model from [Vosk Models](https://alphacephei.com/vosk/models) page, extract the model locally, and add the model folder path to the software settings. Currently, the Vosk caption engine does not support translated captions.
### Configuring Translation Models
![](./assets/media/vosk_en.png)
![](./assets/media/engine_en.png)
**If you find the above caption engines don't meet your needs and you know Python, you may consider developing your own caption engine. For detailed instructions, please refer to the [Caption Engine Documentation](./docs/engine-manual/en.md).**
> Note: Translation is not real-time. The translation model is only called after each sentence recognition is completed.
#### Ollama Local Model
> Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
Before using this model, you need to confirm that the [Ollama](https://ollama.com/) software is installed on your local machine and that you have downloaded the required large language model. Simply add the name of the large model you want to call to the `Model Name` field in the settings, and ensure that the `Base URL` field is empty.
#### OpenAI Compatible Model
If you feel the translation effect of the local Ollama model is not good enough, or don't want to install the Ollama model locally, then you can use cloud-based OpenAI compatible models.
Here are some model provider `Base URL`s:
- OpenAI: https://api.openai.com/v1
- DeepSeek: https://api.deepseek.com
- Alibaba Cloud: https://dashscope.aliyuncs.com/compatible-mode/v1
The API Key needs to be obtained from the corresponding model provider.
#### Google Translate API
> Note: Google Translate API is not available in some regions.
No configuration required, just connect to the internet to use.
### Using Gummy Model
> The international version of Alibaba Cloud services does not seem to provide the Gummy model, so non-Chinese users may not be able to use the Gummy caption engine at present.
To use the default Gummy caption engine (using cloud models for speech recognition and translation), you first need to obtain an API KEY from Alibaba Cloud Bailian platform, then add the API KEY to the software settings or configure it in the environment variables (only Windows platform supports reading API KEY from environment variables), so that the model can be used normally. Related tutorials:
- [Get API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configure API Key through Environment Variables](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### Using the GLM-ASR Model
Before using it, you need to obtain an API KEY from the Zhipu AI platform and add it to the software settings.
For API KEY acquisition, see: [Quick Start](https://docs.bigmodel.cn/en/guide/start/quick-start).
### Using Vosk Model
> The recognition effect of the Vosk model is poor, please use it with caution.
To use the Vosk local caption engine, first download the model you need from the [Vosk Models](https://alphacephei.com/vosk/models) page, unzip the model locally, and add the path of the model folder to the software settings.
![](./assets/media/config_en.png)
### Using SOSV Model
The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## ⌨️ Using in Terminal
The software adopts a modular design and can be divided into two parts: the main software body and caption engine. The main software calls caption engine through a graphical interface. Audio acquisition and speech recognition functions are implemented in the caption engine, which can be used independently without the main software.
Caption engine is developed using Python and packaged as executable files via PyInstaller. Therefore, there are two ways to use caption engine:
1. Use the source code of the project's caption engine part and run it with a Python environment that has the required libraries installed
2. Run the packaged executable file of the caption engine through the terminal
For runtime parameters and detailed usage instructions, please refer to the [User Manual](./docs/user-manual/en.md#using-caption-engine-standalone).
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
![](./docs/img/07.png)
## ⚙️ Built-in Subtitle Engines
Currently, the software comes with 2 subtitle engines, with new engines under development. Their detailed information is as follows.
Currently, the software comes with 4 caption engines, with new engines under development. Their detailed information is as follows.
### Gummy Subtitle Engine (Cloud)
@@ -92,7 +176,7 @@ Developed based on Tongyi Lab's [Gummy Speech Translation Model](https://help.al
**Network Traffic Consumption:**
The subtitle engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:
The caption engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:
$$
48000\ \text{samples/second} \times 2\ \text{bytes/sample} \times 1\ \text{channel} = 93.75\ \text{KB/s}
@@ -100,18 +184,17 @@ $$
The engine only uploads data when receiving audio streams, so the actual upload rate may be lower. The return traffic consumption of model results is small and not considered here.
### GLM-ASR Caption Engine (Cloud)
https://docs.bigmodel.cn/en/guide/models/sound-and-video/glm-asr-2512
### Vosk Subtitle Engine (Local)
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). Currently only supports generating original text from audio, does not support translation content.
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation.
### Planned New Subtitle Engines
### SOSV Subtitle Engine (Local)
The following are candidate models that will be selected based on model performance and ease of integration.
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) is an integrated package, mainly based on [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html), with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese.
## 🚀 Project Setup
@@ -128,6 +211,7 @@ npm install
First enter the `engine` folder and execute the following commands to create a virtual environment (requires Python 3.10 or higher, with Python 3.12 recommended):
```bash
cd ./engine
# in ./engine folder
python -m venv .venv
# or
@@ -149,12 +233,6 @@ Then install dependencies (this step might result in errors on macOS and Linux,
pip install -r requirements.txt
```
If you encounter errors when installing the `samplerate` module on Linux systems, you can try installing it separately with this command:
```bash
pip install samplerate --only-binary=:all:
```
Then use `pyinstaller` to build the project:
```bash
@@ -188,15 +266,3 @@ npm run build:mac
# For Linux
npm run build:linux
```
Note: You need to modify the configuration content in the `electron-builder.yml` file in the project root directory according to different platforms:
```yml
extraResources:
# For Windows
- from: ./engine/dist/main.exe
to: ./engine/main.exe
# For macOS and Linux
# - from: ./engine/dist/main
# to: ./engine/main
```

View File

@@ -3,7 +3,7 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption はクロスプラットフォームのリアルタイム字幕表示ソフトウェアです。</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-0.7.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.1.1-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
@@ -14,14 +14,18 @@
| <a href="./README_en.md">English</a>
| <b>日本語</b> |
</p>
<p><i>バージョン 0.7.0 がリリースされ、ソフトウェアインターフェースが最適化され、ログ記録表示機能が追加されました。ローカルの字幕エンジンは現在開発中であり、Pythonコードの形式でリリースされる予定です...</i></p>
<p><i>v1.1.1 バージョンがリリースされました。GLM-ASR クラウド字幕モデルと OpenAI 互換モデル翻訳が追加されました...</i></p>
</div>
![](./assets/media/main_ja.png)
## 📥 ダウンロード
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
ソフトウェアダウンロード: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Vosk モデルダウンロード: [Vosk Models](https://alphacephei.com/vosk/models)
SOSV モデルダウンロード: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
## 📚 関連ドキュメント
@@ -29,51 +33,132 @@
[字幕エンジン説明ドキュメント](./docs/engine-manual/ja.md)
[プロジェクト API ドキュメント(中国語)](./docs/api-docs/)
[更新履歴](./docs/CHANGELOG.md)
## 👁️‍🗨️ プレビュー
https://github.com/user-attachments/assets/9c188d78-9520-4397-bacf-4c8fdcc54874
## ✨ 特徴
- 音声出力またはマイク入力からの字幕生成
- ローカルのOllamaモデル、クラウド上のOpenAI互換モデル、またはクラウド上のGoogle翻訳APIを呼び出して翻訳を行うことをサポートしています
- クロスプラットフォームWindows、macOS、Linux、多言語インターフェース中国語、英語、日本語対応
- 豊富な字幕スタイル設定(フォント、フォントサイズ、フォント太さ、フォント色、背景色など)
- 柔軟な字幕エンジン選択(阿里Gummy クラウドモデル、ローカル Vosk モデル、独自開発モデル
- 柔軟な字幕エンジン選択(阿里Gummyクラウドモデル、GLM-ASRクラウドモデル、ローカルVoskモデル、ローカルSOSVモデル、または独自にモデルを開発可能
- 多言語認識と翻訳(下記「⚙️ 字幕エンジン説明」参照)
- 字幕記録表示とエクスポート(`.srt` および `.json` 形式のエクスポートに対応)
## 📖 基本使い方
> ⚠️ 注意現在、Windowsプラットフォームのソフトウェアの最新バージョンのみがメンテナンスされており、他のプラットフォームの最終バージョンはv1.0.0のままです。
このソフトウェアは Windows、macOS、Linux プラットフォームに対応しています。テスト済みのプラットフォーム情報は以下の通りです:
| OS バージョン | アーキテクチャ | システムオーディオ入力 | システムオーディオ出力 |
| ------------------ | ------------ | ------------------ | ------------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ 追加設定が必要 | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ [追加設定が必要](./docs/user-manual/ja.md#macos-でのシステムオーディオ出力の取得方法) | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
macOS および Linux プラットフォームでシステムオーディオ出力を取得するには追加設定が必要です。詳細は[Auto Captionユーザーマニュアル](./docs/user-manual/ja.md)をご覧ください。
> 阿里雲の国際版サービスでは Gummy モデルを提供していないため、現在中国以外のユーザーは Gummy 字幕エンジンを使用できません
ソフトウェアをダウンロードした後、自分のニーズに応じて対応するモデルを選択し、モデルを設定する必要があります
デフォルトの Gummy 字幕エンジン(クラウドベースのモデルを使用した音声認識と翻訳)を使用するには、まず阿里雲百煉プラットフォームから API KEY を取得する必要があります。その後、API KEY をソフトウェア設定に追加するか、環境変数に設定しますWindows プラットフォームのみ環境変数からの API KEY 読み取りをサポート)。関連チュートリアル:
| | 正確性 | 実時間性 | デプロイタイプ | 対応言語 | 翻訳 | 備考 |
| ------------------------------------------------------------ | -------- | --------- | -------------- | -------- | ---- | ---- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | とても良い😊 | とても良い😊 | クラウド / アリババクラウド | 10言語 | 内蔵翻訳 | 有料、0.54元/時間 |
| [glm-asr-2512](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512) | とても良い😊 | 悪い😞 | クラウド / Zhipu AI | 4言語 | 追加設定が必要 | 有料、約0.72元/時間 |
| [Vosk](https://alphacephei.com/vosk) | 悪い😞 | とても良い😊 | ローカル / CPU | 30言語以上 | 追加設定が必要 | 多くの言語に対応 |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 普通😐 | 普通😐 | ローカル / CPU | 5言語 | 追加設定が必要 | 1つのモデルのみ |
| 自分で開発 | 🤔 | 🤔 | カスタム | カスタム | カスタム | [ドキュメント](./docs/engine-manual/ja.md)に従ってPythonを使用して自分で開発 |
- [API KEY の取得(中国語)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数を通じて API Key を設定(中国語)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
Gummyモデル以外を選択した場合、独自の翻訳モデルを設定する必要があります。
> Vosk モデルの認識精度は低いため、注意してご使用ください。
### 翻訳モデルの設定
Vosk ローカル字幕エンジンを使用するには、まず [Vosk Models](https://alphacephei.com/vosk/models) ページから必要なモデルをダウンロードし、ローカルに解凍した後、モデルフォルダのパスをソフトウェア設定に追加してください。現在、Vosk 字幕エンジンは字幕の翻訳をサポートしていません。
![](./assets/media/engine_ja.png)
![](./assets/media/vosk_ja.png)
> 注意:翻訳はリアルタイムではありません。翻訳モデルは各文の認識が完了した後にのみ呼び出されます。
**上記の字幕エンジンがご要望を満たさず、かつ Python の知識をお持ちの場合、独自の字幕エンジンを開発することも可能です。詳細な説明は[字幕エンジン説明書](./docs/engine-manual/ja.md)をご参照ください。**
#### Ollama ローカルモデル
> 注意パラメータ数が多すぎるモデルを使用すると、リソース消費と翻訳遅延が大きくなります。1B未満のパラメータ数のモデルを使用することを推奨します。例`qwen2.5:0.5b`、`qwen3:0.6b`。
このモデルを使用する前に、ローカルマシンに[Ollama](https://ollama.com/)ソフトウェアがインストールされており、必要な大規模言語モデルをダウンロード済みであることを確認してください。設定で呼び出す必要がある大規模モデル名を「モデル名」フィールドに入力し、「Base URL」フィールドが空であることを確認してください。
#### OpenAI互換モデル
ローカルのOllamaモデルの翻訳効果が良くないと感じる場合や、ローカルにOllamaモデルをインストールしたくない場合は、クラウド上のOpenAI互換モデルを使用できます。
いくつかのモデルプロバイダの「Base URL」
- OpenAI: https://api.openai.com/v1
- DeepSeek: https://api.deepseek.com
- アリババクラウド: https://dashscope.aliyuncs.com/compatible-mode/v1
API Keyは対応するモデルプロバイダから取得する必要があります。
#### Google翻訳API
> 注意Google翻訳APIは一部の地域では使用できません。
設定不要で、ネット接続があれば使用できます。
### Gummyモデルの使用
> 阿里云の国際版サービスにはGummyモデルが提供されていないため、現在中国以外のユーザーはGummy字幕エンジンを使用できない可能性があります。
デフォルトのGummy字幕エンジンクラウドモデルを使用した音声認識と翻訳を使用するには、まず阿里云百煉プラットフォームのAPI KEYを取得し、API KEYをソフトウェア設定に追加するか環境変数に設定する必要がありますWindowsプラットフォームのみ環境変数からのAPI KEY読み取りをサポート。関連チュートリアル
- [API KEYの取得](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数へのAPI Keyの設定](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### GLM-ASR モデルの使用
使用前に、Zhipu AI プラットフォームから API キーを取得し、それをソフトウェアの設定に追加する必要があります。
API キーの取得についてはこちらをご覧ください:[クイックスタート](https://docs.bigmodel.cn/ja/guide/start/quick-start)。
### Voskモデルの使用
> Voskモデルの認識効果は不良のため、注意して使用してください。
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードし、ローカルにモデルを解凍し、モデルフォルダのパスをソフトウェア設定に追加してください。
![](./assets/media/config_ja.png)
### SOSVモデルの使用
SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りですhttps://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## ⌨️ ターミナルでの使用
ソフトウェアはモジュール化設計を採用しており、ソフトウェア本体と字幕エンジンの2つの部分に分けることができます。ソフトウェア本体はグラフィカルインターフェースを通じて字幕エンジンを呼び出します。コアとなる音声取得および音声認識機能はすべて字幕エンジンに実装されており、字幕エンジンはソフトウェア本体から独立して単独で使用できます。
字幕エンジンはPythonを使用して開発され、PyInstallerによって実行可能ファイルとしてパッケージ化されます。したがって、字幕エンジンの使用方法は以下の2つがあります
1. プロジェクトの字幕エンジン部分のソースコードを使用し、必要なライブラリがインストールされたPython環境で実行する
2. パッケージ化された字幕エンジンの実行可能ファイルをターミナルから実行する
実行引数および詳細な使用方法については、[User Manual](./docs/user-manual/en.md#using-caption-engine-standalone)をご参照ください。
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
![](./docs/img/07.png)
## ⚙️ 字幕エンジン説明
現在、ソフトウェアには2つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。
現在、ソフトウェアには4つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。
### Gummy 字幕エンジン(クラウド)
@@ -100,18 +185,17 @@ $$
また、エンジンはオーディオストームを取得したときのみデータをアップロードするため、実際のアップロードレートはさらに小さくなる可能性があります。モデル結果の返信トラフィック消費量は小さく、ここでは考慮していません。
### GLM-ASR 字幕エンジン(クラウド)
https://docs.bigmodel.cn/ja/guide/models/sound-and-video/glm-asr-2512
### Vosk字幕エンジンローカル
[vosk-api](https://github.com/alphacep/vosk-api) をベースに開発されています。現在は音声に対応する原文の生成のみをサポートしており、翻訳コンテンツはサポートしていません
[vosk-api](https://github.com/alphacep/vosk-api)をベースに開発。この字幕エンジンの利点は選択可能な言語モデルが非常に多く30言語以上、欠点は認識効果が比較的悪く、生成内容に句読点がないことです
### 新規計画字幕エンジン
### SOSV 字幕エンジン(ローカル)
以下は候補モデルであり、モデルの性能と統合の容易さに基づいて選択されます。
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)は統合パッケージで、主に[Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)をベースにし、エンドポイント検出モデルと句読点復元モデルを追加しています。このモデルが認識をサポートする言語は:英語、中国語、日本語、韓国語、広東語です。
## 🚀 プロジェクト実行
@@ -128,6 +212,7 @@ npm install
まず `engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成しますPython 3.10 以上が必要で、Python 3.12 が推奨されます):
```bash
cd ./engine
# ./engine フォルダ内
python -m venv .venv
# または
@@ -149,12 +234,6 @@ source .venv/bin/activate
pip install -r requirements.txt
```
Linux システムで `samplerate` モジュールのインストールに問題が発生した場合、以下のコマンドで個別にインストールを試すことができます:
```bash
pip install samplerate --only-binary=:all:
```
その後、`pyinstaller` を使用してプロジェクトをビルドします:
```bash
@@ -188,15 +267,3 @@ npm run build:mac
# Linux 用
npm run build:linux
```
注意: プラットフォームに応じて、プロジェクトルートディレクトリにある `electron-builder.yml` ファイルの設定内容を変更する必要があります:
```yml
extraResources:
# Windows 用
- from: ./engine/dist/main.exe
to: ./engine/main.exe
# macOS と Linux 用
# - from: ./engine/dist/main
# to: ./engine/main
```

BIN
assets/media/config_en.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

BIN
assets/media/config_ja.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

BIN
assets/media/config_zh.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

BIN
assets/media/engine_en.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

BIN
assets/media/engine_ja.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

BIN
assets/media/engine_zh.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 400 KiB

After

Width:  |  Height:  |  Size: 476 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 413 KiB

After

Width:  |  Height:  |  Size: 488 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 416 KiB

After

Width:  |  Height:  |  Size: 486 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 74 KiB

View File

@@ -8,5 +8,9 @@
<true/>
<key>com.apple.security.cs.allow-dyld-environment-variables</key>
<true/>
<key>com.apple.security.cs.disable-library-validation</key>
<true/>
<key>com.apple.security.device.audio-input</key>
<true/>
</dict>
</plist>
</plist>

View File

@@ -153,4 +153,38 @@
### 优化体验
- 优化软件用户界面的部分组件
- 更清晰的日志输出
- 更清晰的日志输出
## v1.0.0
2025-09-08
### 新增功能
- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭;在字幕引擎启动过程中可选择关闭字幕引擎
- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译;支持调用 Google 翻译 API 进行翻译
- 添加新的翻译模型:添加 SOSV 模型,支持识别英语、中文、日语、韩语、粤语
- 添加录音功能:可以将字幕引擎识别的音频保存为 .wav 文件
- 添加多行字幕功能,用户可以设置字幕窗口显示的字幕的行数
### 优化体验
- 优化部分提示信息显示位置
- 替换重采样模型,提高音频重采样质量
- 带有额外信息的标签颜色改为与主题色一致
## v1.1.0
### 新增功能
- 添加基于 GLM-ASR 的字幕引擎
- 添加 OpenAI API 兼容模型作为新的翻译模型
## v1.1.1
### 优化体验
- 取消字幕窗口的顶置选项,字幕窗口将始终处于顶置状态
- 将字幕窗口顶置选项改为鼠标穿透选项,当图钉图标为实心时,表示启用鼠标穿透

View File

@@ -21,18 +21,10 @@
- [x] 复制字幕记录可选择只复制最近的字幕记录 *2025/08/18*
- [x] 添加颜色主题设置 *2025/08/18*
- [x] 前端页面添加日志内容展示 *2025/08/19*
- [x] 添加 Ollama 模型用于本地字幕引擎的翻译 *2025/09/04*
- [x] 验证 / 添加基于 sherpa-onnx 的字幕引擎 *2025/09/06*
- [x] 添加 GLM-ASR 模型 *2026/01/10*
## 待完成
## TODO
- [ ] 调研更多的云端模型火山、OpenAI、Google等
- [ ] 验证 / 添加基于 sherpa-onnx 的字幕引擎
## 后续计划
- [ ] 添加 Ollama 模型用于本地字幕引擎的翻译
- [ ] 验证 / 添加基于 FunASR 的字幕引擎
- [ ] 减小软件不必要的体积
## 遥远的未来
- [ ] 使用 Tauri 框架重新开发
暂无

View File

@@ -58,6 +58,19 @@ Electron 主进程通过 TCP Socket 向 Python 进程发送数据。发送的数
Python 端监听到的音频流转换为的字幕数据。
### `translation`
```js
{
command: "translation",
time_s: string,
text: string,
translation: string
}
```
语音识别的内容的翻译,可以根据起始时间确定对应的字幕。
### `print`
```js
@@ -67,7 +80,7 @@ Python 端监听到的音频流转换为的字幕数据。
}
```
输出 Python 端打印的内容。
输出 Python 端打印的内容,不计入日志
### `info`
@@ -78,7 +91,18 @@ Python 端监听到的音频流转换为的字幕数据。
}
```
Python 端打印的提示信息,比起 `print`,该信息更希望 Electron 端的关注
Python 端打印的提示信息,会计入日志
### `warn`
```js
{
command: "warn",
content: string
}
```
Python 端打印的警告信息,会计入日志。
### `error`
@@ -89,7 +113,7 @@ Python 端打印的提示信息,比起 `print`,该信息更希望 Electron
}
```
Python 端打印的错误信息,该错误信息需要在前端弹窗显示。
Python 端打印的错误信息,该错误信息在前端弹窗显示。
### `usage`

View File

@@ -182,6 +182,16 @@
**数据类型:** 无数据
### `control.engine.forceKill`
**介绍:** 强制关闭启动超时的字幕引擎
**发起方:** 前端控制窗口
**接收方:** 后端控制窗口实例
**数据类型:** 无数据
### `caption.windowHeight.change`
**介绍:** 字幕窗口宽度发生改变
@@ -192,9 +202,9 @@
**数据类型:** `number`
### `caption.pin.set`
### `caption.mouseEvents.ignore`
**介绍:** 是否将窗口置顶
**介绍:** 是否设置鼠标穿透
**发起方:** 前端字幕窗口

View File

@@ -1,175 +1,207 @@
# Caption Engine Documentation
# Caption Engine Documentation
Corresponding Version: v0.6.0
Corresponding version: v1.0.0
![](../../assets/media/structure_en.png)
![](../../assets/media/structure_en.png)
## Introduction to the Caption Engine
## Introduction to the Caption Engine
The so-called caption engine is essentially a subprogram that continuously captures real-time streaming data from the system's audio input (microphone) or output (speakers) and invokes an audio-to-text model to generate corresponding captions for the audio. The generated captions are converted into JSON-formatted string data and passed to the main program via standard output (ensuring the string can be correctly interpreted as a JSON object by the main program). The main program reads and interprets the caption data, processes it, and displays it in the window.
The so-called caption engine is actually a subprocess that captures streaming data from system audio input (microphone) or output (speaker) in real-time, and invokes an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into JSON-formatted string data and transmitted to the main program through standard output (ensuring that the string received by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and displays it in the window.
**The communication standard between the caption engine process and the Electron main process is: [caption engine api-doc](../api-docs/caption-engine.md).**
**Communication between the caption engine process and Electron main process follows the standard: [caption engine api-doc](../api-docs/caption-engine.md).**
## Workflow
## Execution Flow
The communication flow between the main process and the caption engine:
Process of communication between main process and caption engine:
### Starting the Engine
### Starting the Engine
- **Main Process**: Uses `child_process.spawn()` to launch the caption engine process.
- **Caption Engine Process**: Creates a TCP Socket server thread. After creation, it outputs a JSON object string via standard output, containing a `command` field with the value `connect`.
- **Main Process**: Monitors the standard output of the caption engine process, attempts to split it line by line, parses it into a JSON object, and checks if the `command` field value is `connect`. If so, it connects to the TCP Socket server.
- Electron main process: Use `child_process.spawn()` to start the caption engine process
- Caption engine process: Create a TCP Socket server thread, after creation output a JSON object converted to string via standard output, containing the `command` field with value `connect`
- Main process: Listen to the caption engine process's standard output, try to split the standard output by lines, parse it into a JSON object, and check if the object's `command` field value is `connect`. If so, connect to the TCP Socket server
### Caption Recognition
### Caption Recognition
- **Caption Engine Process**: The main thread monitors system audio output, sends audio data chunks to the caption engine for parsing, and outputs the parsed caption data object strings via standard output.
- **Main Process**: Continues to monitor the standard output of the caption engine and performs different operations based on the `command` field of the parsed object.
- Caption engine process: Create a new thread to monitor system audio output, put acquired audio data chunks into a shared queue (`shared_data.chunk_queue`). The caption engine continuously reads audio data chunks from the shared queue and parses them. The caption engine may also create a new thread to perform translation operations. Finally, the caption engine sends parsed caption data object strings through standard output
- Electron main process: Continuously listen to the caption engine's standard output and take different actions based on the parsed object's `command` field
### Closing the Engine
### Stopping the Engine
- **Main Process**: When the user closes the caption engine via the frontend, the main process sends a JSON object string with the `command` field set to `stop` to the caption engine process via Socket communication.
- **Caption Engine Process**: Receives the object string, parses it, and if the `command` field is `stop`, sets the global variable `thread_data.status` to `stop`.
- **Caption Engine Process**: The main thread's loop for monitoring system audio output ends when `thread_data.status` is not `running`, releases resources, and terminates.
- **Main Process**: Detects the termination of the caption engine process, performs corresponding cleanup, and provides feedback to the frontend.
- Electron main process: When the user operates to close the caption engine in the frontend, the main process sends an object string with `command` field set to `stop` to the caption engine process through Socket communication
- Caption engine process: Receive the caption data object string sent by the main engine process, parse the string into an object. If the object's `command` field is `stop`, set the value of global variable `shared_data.status` to `stop`
- Caption engine process: Main thread continuously monitors system audio output, when `thread_data.status` value is not `running`, end the loop, release resources, and terminate execution
- Electron main process: If the caption engine process termination is detected, perform corresponding processing and provide feedback to the frontend
## Implemented Features
## Implemented Features
The following features are already implemented and can be reused directly.
The following features have been implemented and can be directly reused.
### Standard Output
### Standard Output
Supports printing general information, commands, and error messages.
Can output regular information, commands, and error messages.
Example:
Examples:
```python
from utils import stdout, stdout_cmd, stdout_obj, stderr
stdout("Hello") # {"command": "print", "content": "Hello"}\n
stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
stdout_obj({"command": "print", "content": "Hello"})
stderr("Error Info")
```
```python
from utils import stdout, stdout_cmd, stdout_obj, stderr
# {"command": "print", "content": "Hello"}\n
stdout("Hello")
# {"command": "connect", "content": "8080"}\n
stdout_cmd("connect", "8080")
# {"command": "print", "content": "print"}\n
stdout_obj({"command": "print", "content": "print"})
# sys.stderr.write("Error Info" + "\n")
stderr("Error Info")
```
### Creating a Socket Service
### Creating Socket Service
This Socket service listens on a specified port, parses content sent by the Electron main program, and may modify the value of `thread_data.status`.
This Socket service listens on a specified port, parses content sent by the Electron main program, and may change the value of `shared_data.status`.
Example:
Example:
```python
from utils import start_server
from utils import thread_data
port = 8080
start_server(port)
while thread_data == 'running':
# do something
pass
```
```python
from utils import start_server
from utils import shared_data
port = 8080
start_server(port)
while thread_data == 'running':
# do something
pass
```
### Audio Capture
### Audio Acquisition
The `AudioStream` class captures audio data and is cross-platform, supporting Windows, Linux, and macOS. Its initialization includes two parameters:
The `AudioStream` class is used to acquire audio data, with cross-platform implementation supporting Windows, Linux, and macOS. The class initialization includes two parameters:
- `audio_type`: The type of audio to capture. `0` for system output audio (speakers), `1` for system input audio (microphone).
- `chunk_rate`: The frequency of audio data capture, i.e., the number of audio chunks captured per second.
- `audio_type`: Audio acquisition type, 0 for system output audio (speaker), 1 for system input audio (microphone)
- `chunk_rate`: Audio data acquisition frequency, number of audio chunks acquired per second, default is 10
The class includes three methods:
The class contains four methods:
- `open_stream()`: Starts audio capture.
- `read_chunk() -> bytes`: Reads an audio chunk.
- `close_stream()`: Stops audio capture.
- `open_stream()`: Start audio acquisition
- `read_chunk() -> bytes`: Read an audio chunk
- `close_stream()`: Close audio acquisition
- `close_stream_signal()`: Thread-safe closing of system audio input stream
Example:
Example:
```python
from sysaudio import AudioStream
audio_type = 0
chunk_rate = 20
stream = AudioStream(audio_type, chunk_rate)
stream.open_stream()
while True:
data = stream.read_chunk()
# do something with data
pass
stream.close_stream()
```
```python
from sysaudio import AudioStream
audio_type = 0
chunk_rate = 20
stream = AudioStream(audio_type, chunk_rate)
stream.open_stream()
while True:
data = stream.read_chunk()
# do something with data
pass
stream.close_stream()
```
### Audio Processing
### Audio Processing
The captured audio stream may require preprocessing before conversion to text. Typically, multi-channel audio needs to be converted to mono, and resampling may be necessary. This project provides three audio processing functions:
Before converting audio streams to text, preprocessing may be required. Usually, multi-channel audio needs to be converted to single-channel audio, and resampling may also be needed. This project provides two audio processing functions:
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Converts a multi-channel audio chunk to mono.
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Converts a multi-channel audio chunk to mono and resamples it.
- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`: Resamples a mono audio chunk.
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Convert multi-channel audio chunks to single-channel audio chunks
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`: Convert current multi-channel audio data chunks to single-channel audio data chunks, then perform resampling
## Features to Be Implemented in the Caption Engine
Example:
### Audio-to-Text Conversion
```python
from sysaudio import AudioStream
from utils import merge_chunk_channels
stream = AudioStream(1)
while True:
raw_chunk = stream.read_chunk()
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
# do something with chunk
```
After obtaining a suitable audio stream, it needs to be converted to text. Typically, various models (cloud-based or local) are used for this purpose. Choose the appropriate model based on requirements.
## Features to be Implemented by the Caption Engine
This part is recommended to be encapsulated as a class with three methods:
### Audio to Text Conversion
- `start(self)`: Starts the model.
- `send_audio_frame(self, data: bytes)`: Processes the current audio chunk data. **The generated caption data is sent to the Electron main process via standard output.**
- `stop(self)`: Stops the model.
After obtaining suitable audio streams, the audio stream needs to be converted to text. Generally, various models (cloud or local) are used to implement audio-to-text conversion. Appropriate models should be selected according to requirements.
Complete caption engine examples:
It is recommended to encapsulate this as a class, implementing four methods:
- [gummy.py](../../engine/audio2text/gummy.py)
- [vosk.py](../../engine/audio2text/vosk.py)
- `start(self)`: Start the model
- `send_audio_frame(self, data: bytes)`: Process current audio chunk data, **generated caption data is sent to Electron main process through standard output**
- `translate(self)`: Continuously retrieve data chunks from `shared_data.chunk_queue` and call `send_audio_frame` method to process data chunks
- `stop(self)`: Stop the model
### Caption Translation
Complete caption engine examples:
Some speech-to-text models do not provide translation. If needed, a translation module must be added.
- [gummy.py](../../engine/audio2text/gummy.py)
- [vosk.py](../../engine/audio2text/vosk.py)
- [sosv.py](../../engine/audio2text/sosv.py)
### Sending Caption Data
### Caption Translation
After obtaining the text for the current audio stream, it must be sent to the main program. The caption engine process passes caption data to the Electron main process via standard output.
Some speech-to-text models do not provide translation. If needed, an additional translation module needs to be added, or built-in translation modules can be used.
The content must be a JSON string, with the JSON object including the following parameters:
Example:
```typescript
export interface CaptionItem {
command: "caption",
index: number, // Caption sequence number
time_s: string, // Start time of the current caption
time_t: string, // End time of the current caption
text: string, // Caption content
translation: string // Caption translation
}
```
```python
from utils import google_translate, ollama_translate
text = "This is a translation test."
google_translate("", "en", text, "time_s")
ollama_translate("qwen3:0.6b", "en", text, "time_s")
```
**Note: Ensure the buffer is flushed after each JSON output to guarantee the Electron main process receives a string that can be parsed as a JSON object.**
### Caption Data Transmission
It is recommended to use the project's `stdout_obj` function for sending.
After obtaining the text from the current audio stream, the text needs to be sent to the main program. The caption engine process transmits caption data to the Electron main process through standard output.
### Command-Line Parameter Specification
The transmitted content must be a JSON string, where the JSON object needs to contain the following parameters:
Custom caption engine settings are provided via command-line arguments. The current project uses the following parameters:
```typescript
export interface CaptionItem {
command: "caption",
index: number, // Caption sequence number
time_s: string, // Current caption start time
time_t: string, // Current caption end time
text: string, // Caption content
translation: string // Caption translation
}
```
```python
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# Common parameters
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
# Gummy-specific parameters
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# Vosk-specific parameters
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
```
**Note that the buffer must be flushed after each caption JSON data output to ensure that the Electron main process receives strings that can be interpreted as JSON objects each time.** It is recommended to use the project's existing `stdout_obj` function for transmission.
For example, to use the Gummy model with Japanese as the source language, Chinese as the target language, and system audio output captions with 0.1s audio chunks, the command-line arguments would be:
### Command Line Parameter Specification
```bash
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
```
Custom caption engine settings provide command line parameter specification, so the caption engine parameters need to be set properly. Currently used parameters in this project are as follows:
```python
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# all
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
# gummy and sosv
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
# gummy only
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# vosk and sosv
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
# vosk only
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
# sosv only
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
```
For example, for this project's caption engine, if I want to use the Gummy model, specify the original text as Japanese, translate to Chinese, and capture captions from system audio output, with 0.1s audio data segments each time, the command line parameters would be:
```bash
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
```
## Additional Notes
@@ -183,7 +215,7 @@ python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
### Development Recommendations
Apart from audio-to-text conversion, it is recommended to reuse the existing code. In this case, the following additions are needed:
Except for audio-to-text conversion and translation, other components (audio acquisition, audio resampling, and communication with the main process) are recommended to directly reuse the project's code. If this approach is taken, the content that needs to be added includes:
- `engine/audio2text/`: Add a new audio-to-text class (file-level).
- `engine/main.py`: Add new parameter settings and workflow functions (refer to `main_gummy` and `main_vosk` functions).

View File

@@ -1,5 +1,7 @@
# 字幕エンジン説明ドキュメント
## 注意:このドキュメントはメンテナンスが行われていないため、記載されている情報は古くなっています。最新の情報については、[中国語版](./zh.md)または[英語版](./en.md)のドキュメントをご参照ください。
対応バージョンv0.6.0
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。

View File

@@ -1,6 +1,6 @@
# 字幕引擎说明文档
对应版本v0.6.0
对应版本v1.0.0
![](../../assets/media/structure_zh.png)
@@ -16,22 +16,21 @@
### 启动引擎
- 主进程:使用 `child_process.spawn()` 启动字幕引擎进程
- Electron 主进程:使用 `child_process.spawn()` 启动字幕引擎进程
- 字幕引擎进程:创建 TCP Socket 服务器线程,创建后在标准输出中输出转化为字符串的 JSON 对象,该对象中包含 `command` 字段,值为 `connect`
- 主进程:监听字幕引擎进程的标准输出,尝试将标准输出按行分割,解析为 JSON 对象,并判断对象的 `command` 字段值是否为 `connect`,如果是则连接 TCP Socket 服务器
### 字幕识别
- 字幕引擎进程:在主线程监听系统音频输出,将音频数据块发送给字幕引擎解析,字幕引擎解析音频数据块,通过标准输出发送解析的字幕数据对象字符串
- 主进程:续监听字幕引擎的标准输出,并根据解析的对象的 `command` 字段采取不同的操作
- 字幕引擎进程:新建线程监听系统音频输出,将获取的音频数据块放入共享队列中(`shared_data.chunk_queue`)。字幕引擎不断读取共享队列中的音频数据块并解析。字幕引擎还可能新建线程执行翻译操作。最后字幕引擎通过标准输出发送解析的字幕数据对象字符串
- Electron 主进程:续监听字幕引擎的标准输出,并根据解析的对象的 `command` 字段采取不同的操作
### 关闭引擎
- 主进程:当用户在前端操作关闭字幕引擎时,主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
- 字幕引擎进程:接收主引擎进程发送的字幕数据对象字符串,将字符串解析为对象,如果对象的 `command` 字段为 `stop`,则将全局变量 `thread_data.status` 的值设置为 `stop`
- Electron 主进程:当用户在前端操作关闭字幕引擎时,主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
- 字幕引擎进程:接收主引擎进程发送的字幕数据对象字符串,将字符串解析为对象,如果对象的 `command` 字段为 `stop`,则将全局变量 `shared_data.status` 的值设置为 `stop`
- 字幕引擎进程:主线程循环监听系统音频输出,当 `thread_data.status` 的值不为 `running` 时,则结束循环,释放资源,结束运行
- 主进程:如果检测到字幕引擎进程结束,进行相应处理,并向前端反馈
- Electron 主进程:如果检测到字幕引擎进程结束,进行相应处理,并向前端反馈
## 项目已经实现的功能
@@ -45,21 +44,25 @@
```python
from utils import stdout, stdout_cmd, stdout_obj, stderr
stdout("Hello") # {"command": "print", "content": "Hello"}\n
stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
stdout_obj({"command": "print", "content": "Hello"})
# {"command": "print", "content": "Hello"}\n
stdout("Hello")
# {"command": "connect", "content": "8080"}\n
stdout_cmd("connect", "8080")
# {"command": "print", "content": "print"}\n
stdout_obj({"command": "print", "content": "print"})
# sys.stderr.write("Error Info" + "\n")
stderr("Error Info")
```
### 创建 Socket 服务
该 Socket 服务会监听指定端口,会解析 Electron 主程序发送的内容,并可能改变 `thread_data.status` 的值。
该 Socket 服务会监听指定端口,会解析 Electron 主程序发送的内容,并可能改变 `shared_data.status` 的值。
样例:
```python
from utils import start_server
from utils import thread_data
from utils import shared_data
port = 8080
start_server(port)
while thread_data == 'running':
@@ -72,13 +75,14 @@ while thread_data == 'running':
`AudioStream` 类用于获取音频数据,实现是跨平台的,支持 Windows、Linux 和 macOS。该类初始化包含两个参数
- `audio_type`: 获取音频类型0 表示系统输出音频扬声器1 表示系统输入音频(麦克风)
- `chunk_rate`: 音频数据获取频率,每秒音频获取的音频块的数量
- `chunk_rate`: 音频数据获取频率,每秒音频获取的音频块的数量,默认为 10
该类包含个方法:
该类包含个方法:
- `open_stream()`: 开启音频获取
- `read_chunk() -> bytes`: 读取一个音频块
- `close_stream()`: 关闭音频获取
- `close_stream_signal()` 线程安全的关闭系统音频输入流
样例:
@@ -97,11 +101,22 @@ stream.close_stream()
### 音频处理
获取到的音频流在转文字之前可能需要进行预处理。一般需要将多通道音频转换为单通道音频,还可能需要进行重采样。本项目提供了个音频处理函数:
获取到的音频流在转文字之前可能需要进行预处理。一般需要将多通道音频转换为单通道音频,还可能需要进行重采样。本项目提供了个音频处理函数:
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes` 将多通道音频块转换为单通道音频块
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:将当前单通道音频块进行重采样
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`:将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
样例:
```python
from sysaudio import AudioStream
from utils import merge_chunk_channels
stream = AudioStream(1)
while True:
raw_chunk = stream.read_chunk()
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
# do something with chunk
```
## 字幕引擎需要实现的功能
@@ -109,20 +124,31 @@ stream.close_stream()
在得到了合适的音频流后,需要将音频流转换为文字了。一般使用各种模型(云端或本地)来实现音频流转文字。需要根据需求选择合适的模型。
这部分建议封装为一个类,需要实现个方法:
这部分建议封装为一个类,需要实现个方法:
- `start(self)`:启动模型
- `send_audio_frame(self, data: bytes)`:处理当前音频块数据,**生成的字幕数据通过标准输出发送给 Electron 主进程**
- `translate(self)`:持续从 `shared_data.chunk_queue` 中取出数据块,并调用 `send_audio_frame` 方法处理数据块
- `stop(self)`:停止模型
完整的字幕引擎实例如下:
- [gummy.py](../../engine/audio2text/gummy.py)
- [vosk.py](../../engine/audio2text/vosk.py)
- [sosv.py](../../engine/audio2text/sosv.py)
### 字幕翻译
有的语音转文字模型并不提供翻译,如果有需求,需要再添加一个翻译模块。
有的语音转文字模型并不提供翻译,如果有需求,需要再添加一个翻译模块,也可以使用自带的翻译模块
样例:
```python
from utils import google_translate, ollama_translate
text = "这是一个翻译测试。"
google_translate("", "en", text, "time_s")
ollama_translate("qwen3:0.6b", "en", text, "time_s")
```
### 字幕数据发送
@@ -133,37 +159,42 @@ stream.close_stream()
```typescript
export interface CaptionItem {
command: "caption",
index: number, // 字幕序号
time_s: string, // 当前字幕开始时间
time_t: string, // 当前字幕结束时间
text: string, // 字幕内容
translation: string // 字幕翻译
index: number, // 字幕序号
time_s: string, // 当前字幕开始时间
time_t: string, // 当前字幕结束时间
text: string, // 字幕内容
translation: string // 字幕翻译
}
```
**注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。**
建议使用项目已经实现的 `stdout_obj` 函数来发送。
**注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。** 建议使用项目已经实现的 `stdout_obj` 函数来发送。
### 命令行参数的指定
自定义字幕引擎的设置提供命令行参数指定,因此需要设置好字幕引擎的参数,本项目目前用到的参数如下:
```python
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# both
# all
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
# gummy and sosv
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
# gummy only
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# vosk and sosv
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
# vosk only
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
# sosv only
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
```
比如对于本项目的字幕引擎,我想使用 Gummy 模型,指定原文为日语,翻译为中文,获取系统音频输出的字幕,每次截取 0.1s 的音频数据,那么命令行参数如下:
@@ -184,7 +215,7 @@ python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
### 开发建议
除音频转文字外,其他建议直接复用本项目代码。如果这样,那么需要添加的内容为:
除音频转文字和翻译外,其他(音频获取、音频重采样、与主进程通信)建议直接复用本项目代码。如果这样,那么需要添加的内容为:
- `engine/audio2text/`:添加新的音频转文字类(文件级别)
- `engine/main.py`:添加新参数设置、流程函数(参考 `main_gummy` 函数和 `main_vosk` 函数)

BIN
docs/img/06.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

BIN
docs/img/07.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

View File

@@ -1,6 +1,6 @@
# Auto Caption User Manual
Corresponding Version: v0.6.0
Corresponding Version: v1.1.1
**Note: Due to limited personal resources, the English and Japanese documentation files for this project (except for the README document) will no longer be maintained. The content of this document may not be consistent with the latest version of the project. If you are willing to help with translation, please submit relevant Pull Requests.**
@@ -41,11 +41,20 @@ Alibaba Cloud provides detailed tutorials for this part, which can be referenced
- [Obtaining API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configuring API Key through Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
## Preparation for GLM Engine
You need to obtain an API KEY first, refer to: [Quick Start](https://docs.bigmodel.cn/en/guide/start/quick-start).
## Preparation for Using Vosk Engine
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings. Currently, the Vosk caption engine does not support translated caption content.
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings.
![](../../assets/media/vosk_en.png)
![](../../assets/media/config_en.png)
## Using SOSV Model
The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## Capturing System Audio Output on macOS
@@ -107,7 +116,7 @@ After completing all configurations, click the "Start Caption Engine" button on
### Adjusting the Caption Display Window
The following image shows the caption display window, which displays the latest captions in real-time. The three buttons in the upper right corner of the window have the following functions: pin the window to the front, open the caption control window, and close the caption display window. The width of the window can be adjusted by moving the mouse to the left or right edge of the window and dragging the mouse.
The following image shows the caption display window, which displays the latest captions in real-time. The functions of the three buttons in the upper right corner of the window are: to close the caption display window, to open the caption control window, and to enable mouse pass-through. The width of the window can be adjusted by moving the mouse to the left or right edge of the window and dragging the mouse.
![](../img/01.png)
@@ -126,3 +135,199 @@ The software provides two default caption engines. If you need other caption eng
Note that when using a custom caption engine, all previous caption engine settings will be ineffective, and the configuration of the custom caption engine is entirely done through the engine command.
If you are a developer and want to develop a custom caption engine, please refer to the [Caption Engine Explanation Document](../engine-manual/en.md).
## Using Caption Engine Standalone
### Runtime Parameter Description
> The following content assumes users have some knowledge of running programs via terminal.
The complete set of runtime parameters available for the caption engine is shown below:
![](../img/06.png)
However, when used standalone, some parameters may not need to be used or should not be modified.
The following parameter descriptions only include necessary parameters.
#### `-e , --caption_engine`
The caption engine model to select, currently three options are available: `gummy, glm, vosk, sosv`.
The default value is `gummy`.
This applies to all models.
#### `-a, --audio_type`
The audio type to recognize, where `0` represents system audio output and `1` represents microphone audio input.
The default value is `0`.
This applies to all models.
#### `-d, --display_caption`
Whether to display captions in the console, `0` means do not display, `1` means display.
The default value is `0`, but it's recommended to choose `1` when using only the caption engine.
This applies to all models.
#### `-t, --target_language`
> Note that Vosk and SOSV models have poor sentence segmentation, which can make translated content difficult to understand. It's not recommended to use translation with these two models.
Target language for translation. All models support the following translation languages:
- `none` No translation
- `zh` Simplified Chinese
- `en` English
- `ja` Japanese
- `ko` Korean
Additionally, `vosk` and `sosv` models also support the following translations:
- `de` German
- `fr` French
- `ru` Russian
- `es` Spanish
- `it` Italian
The default value is `none`.
This applies to all models.
#### `-s, --source_language`
Source language for recognition. Default value is `auto`, meaning no specific source language.
Specifying the source language can improve recognition accuracy to some extent. You can specify the source language using the language codes above.
This applies to Gummy, GLM and SOSV models.
The Gummy model can use all the languages mentioned above, plus Cantonese (`yue`).
The GLM model supports specifying the following languages: English, Chinese, Japanese, Korean.
The SOSV model supports specifying the following languages: English, Chinese, Japanese, Korean, and Cantonese.
#### `-k, --api_key`
Specify the Alibaba Cloud API KEY required for the `Gummy` model.
Default value is empty.
This only applies to the Gummy model.
#### `-gkey, --glm_api_key`
Specifies the API KEY required for the `glm` model. The default value is empty.
#### `-gmodel, --glm_model`
Specifies the model name to be used for the `glm` model. The default value is `glm-asr-2512`.
#### `-gurl, --glm_url`
Specifies the API URL required for the `glm` model. The default value is: `https://open.bigmodel.cn/api/paas/v4/audio/transcriptions`.
#### `-tm, --translation_model`
Specify the translation method for Vosk and SOSV models. Default is `ollama`.
Supported values are:
- `ollama` Use local Ollama model for translation. Users need to install Ollama software and corresponding models
- `google` Use Google Translate API for translation. No additional configuration needed, but requires network access to Google
This only applies to Vosk and SOSV models.
#### `-omn, --ollama_name`
Specifies the name of the translation model to be used, which can be either a local Ollama model or a cloud model compatible with the OpenAI API. If the Base URL field is not filled in, the local Ollama service will be called by default; otherwise, the API service at the specified address will be invoked via the Python OpenAI library.
If using an Ollama model, it is recommended to use a model with fewer than 1B parameters, such as `qwen2.5:0.5b` or `qwen3:0.6b`. The corresponding model must be downloaded in Ollama for normal use.
The default value is empty and applies to models other than Gummy.
#### `-ourl, --ollama_url`
The base request URL for calling the OpenAI API. If left blank, the local Ollama model on the default port will be called.
The default value is empty and applies to models other than Gummy.
#### `-okey, --ollama_api_key`
Specifies the API KEY for calling OpenAI-compatible models.
The default value is empty and applies to models other than Gummy.
#### `-vosk, --vosk_model`
Specify the path to the local folder of the Vosk model to call. Default value is empty.
This only applies to the Vosk model.
#### `-sosv, --sosv_model`
Specify the path to the local folder of the SOSV model to call. Default value is empty.
This only applies to the SOSV model.
### Running Caption Engine Using Source Code
> The following content assumes users who use this method have knowledge of Python environment configuration and usage.
First, download the project source code locally. The caption engine source code is located in the `engine` directory of the project. Then configure the Python environment, where the project dependencies are listed in the `requirements.txt` file in the `engine` directory.
After configuration, enter the `engine` directory and execute commands to run the caption engine.
For example, to use the Gummy model, specify audio type as system audio output, source language as English, and target language as Chinese, execute the following command:
> Note: For better visualization, the commands below are written on multiple lines. If execution fails, try removing backslashes and executing as a single line command.
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
To specify the Vosk model, audio type as system audio output, translate to English, and use Ollama `qwen3:0.6b` model for translation:
```bash
python main.py \
-e vosk \
-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
-a 0 \
-d 1 \
-t en \
```
To specify the SOSV model, audio type as microphone, automatically select source language, and no translation:
```bash
python main.py \
-e sosv \
-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
-a 1 \
-d 1 \
-s auto \
-t none
```
Running result using the Gummy model is shown below:
![](../img/07.png)
### Running Subtitle Engine Executable File
First, download the executable file for your platform from [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases/tag/engine) (currently only Windows and Linux platform executable files are provided).
Then open a terminal in the directory containing the caption engine executable file and execute commands to run the caption engine.
Simply replace `python main.py` in the above commands with the executable file name (for example: `engine-win.exe`).

View File

@@ -1,11 +1,9 @@
# Auto Caption ユーザーマニュアル
対応バージョンv0.6.0
対応バージョンv1.1.1
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
**注意個人のリソースが限られているため、このプロジェクトの英語および日本語のドキュメントREADME ドキュメントを除く)のメンテナンスは行われません。このドキュメントの内容は最新版のプロジェクトと一致しない場合があります。翻訳のお手伝いをしていただける場合は、関連するプルリクエストを提出してください。**
## ソフトウェアの概要
Auto Caption は、クロスプラットフォームの字幕表示ソフトウェアで、システムの音声入力(録音)または出力(音声再生)のストリーミングデータをリアルタイムで取得し、音声からテキストに変換するモデルを利用して対応する音声の字幕を生成します。このソフトウェアが提供するデフォルトの字幕エンジン(アリババクラウド Gummy モデルを使用は、9つの言語中国語、英語、日本語、韓国語、ドイツ語、フランス語、ロシア語、スペイン語、イタリア語の認識と翻訳をサポートしています。
@@ -43,11 +41,19 @@ macOS プラットフォームでオーディオ出力を取得するには追
- [API KEY の取得(中国語)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数を通じて API Key を設定(中国語)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
## GLM エンジン使用前の準備
まずAPI KEYを取得する必要があります。参考[クイックスタート](https://docs.bigmodel.cn/en/guide/start/quick-start)。
## Voskエンジン使用前の準備
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。現在、Vosk字幕エンジンは字幕の翻訳をサポートしていません。
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。
![](../../assets/media/vosk_ja.png)
![](../../assets/media/config_ja.png)
## SOSVモデルの使用
SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りですhttps://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## macOS でのシステムオーディオ出力の取得方法
@@ -110,7 +116,7 @@ sudo yum install pulseaudio pavucontrol
### 字幕表示ウィンドウの調整
下の図は字幕表示ウィンドウです。このウィンドウは現在の最新の字幕をリアルタイムで表示します。ウィンドウの右上にある3つのボタンの機能はそれぞれ次の通りです:ウィンドウを最前面に固定する、字幕制御ウィンドウを開く、字幕表示ウィンドウを閉じる。このウィンドウの幅は調整可能です。マウスをウィンドウの左右の端に移動し、ドラッグして幅を調整します。
下の図は字幕表示ウィンドウです。このウィンドウは現在の最新の字幕をリアルタイムで表示します。ウィンドウの右上にある3つのボタンの機能はそれぞれ字幕表示ウィンドウを閉じる、字幕制御ウィンドウを開く、マウス透過を有効化することです。このウィンドウの幅は調整可能です。マウスをウィンドウの左右の端に移動し、ドラッグして幅を調整します。
![](../img/01.png)

View File

@@ -1,6 +1,6 @@
# Auto Caption 用户手册
对应版本v0.6.0
对应版本v1.1.1
## 软件简介
@@ -39,11 +39,19 @@ Auto Caption 是一个跨平台的字幕显示软件,能够实时获取系统
- [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
## GLM 引擎使用前准备
需要先获取 API KEY参考[Quick Start](https://docs.bigmodel.cn/en/guide/start/quick-start)。
## Vosk 引擎使用前准备
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。
![](../../assets/media/vosk_zh.png)
![](../../assets/media/config_zh.png)
## 使用 SOSV 模型
使用 SOSV 模型的方式和 Vosk 一样下载地址如下https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## macOS 获取系统音频输出
@@ -105,7 +113,7 @@ sudo yum install pulseaudio pavucontrol
### 调整字幕展示窗口
如下图为字幕展示窗口,该窗口实时展示当前最新字幕。窗口右上角三个按钮的功能分别是:将窗口固定在最前面、打开字幕控制窗口、关闭字幕展示窗口。该窗口宽度可以调整,将鼠标移动至窗口的左右边缘,拖动鼠标即可调整宽度。
如下图为字幕展示窗口,该窗口实时展示当前最新字幕。窗口右上角三个按钮的功能分别是:关闭字幕展示窗口、打开字幕控制窗口、启用鼠标穿透。该窗口宽度可以调整,将鼠标移动至窗口的左右边缘,拖动鼠标即可调整宽度。
![](../img/01.png)
@@ -124,3 +132,199 @@ sudo yum install pulseaudio pavucontrol
注意使用自定义字幕引擎时,前面的字幕引擎的设置将全部不起作用,自定义字幕引擎的配置完全通过引擎指令进行配置。
如果你是开发者,想开发自定义字幕引擎,请查看[字幕引擎说明文档](../engine-manual/zh.md)。
## 单独使用字幕引擎
### 运行参数说明
> 以下内容默认用户对使用终端运行程序有一定了解。
字幕引擎可用使用的完整的运行参数如下:
![](../img/06.png)
而在单独使用时其中某些参数并不需要使用,或者不适合进行修改。
下面的运行参数说明仅包含必要的参数。
#### `-e , --caption_engine`
需要选择的字幕引擎模型,目前有四个可用,分别为:`gummy, glm, vosk, sosv`
该项的默认值为 `gummy`
该项适用于所有模型。
#### `-a, --audio_type`
需要识别的音频类型,其中 `0` 表示系统音频输出,`1` 表示麦克风音频输入。
该项的默认值为 `0`
该项适用于所有模型。
#### `-d, --display_caption`
是否在控制台显示字幕,`0` 表示不显示,`1` 表示显示。
该项默认值为 `0`,只使用字幕引擎的话建议选 `1`
该项适用于所有模型。
#### `-t, --target_language`
> 其中 Vosk 和 SOSV 模型分句效果较差,会导致翻译内容难以理解,不太建议这两个模型使用翻译。
需要翻译成的目标语言,所有模型都支持的翻译语言如下:
- `none` 不进行翻译
- `zh` 简体中文
- `en` 英语
- `ja` 日语
- `ko` 韩语
除此之外 `vosk``sosv` 模型还支持如下翻译:
- `de` 德语
- `fr` 法语
- `ru` 俄语
- `es` 西班牙语
- `it` 意大利语
该项的默认值为 `none`
该项适用于所有模型。
#### `-s, --source_language`
需要识别的语言的源语言,默认值为 `auto`,表示不指定源语言。
但是指定源语言能在一定程度上提高识别准确率,可用使用上面的语言代码指定源语言。
该项适用于 Gummy、GLM 和 SOSV 模型。
其中 Gummy 模型可用使用上述全部的语言,在加上粤语(`yue`)。
GLM 模型支持指定的语言有:英语、中文、日语、韩语。
SOSV 模型支持指定的语言有:英语、中文、日语、韩语、粤语。
#### `-k, --api_key`
指定 `Gummy` 模型需要使用的阿里云 API KEY。
该项默认值为空。
该项仅适用于 Gummy 模型。
#### `-gkey, --glm_api_key`
指定 `glm` 模型需要使用的 API KEY默认为空。
#### `-gmodel, --glm_model`
指定 `glm` 模型需要使用的模型名称,默认为 `glm-asr-2512`
#### `-gurl, --glm_url`
指定 `glm` 模型需要使用的 API URL默认值为`https://open.bigmodel.cn/api/paas/v4/audio/transcriptions`
#### `-tm, --translation_model`
指定 Vosk 和 SOSV 模型的翻译方式,默认为 `ollama`
该项支持的值有:
- `ollama` 使用本地 Ollama 模型进行翻译,需要用户安装 Ollama 软件和对应的模型
- `google` 使用 Google 翻译 API 进行翻译,无需额外配置,但是需要有能访问 Google 的网络
该项仅适用于 Vosk 和 SOSV 模型。
#### `-omn, --ollama_name`
指定要使用的翻译模型名称,可以是 Ollama 本地模型,也可以是 OpenAI API 兼容的云端模型。若未填写 Base URL 字段,则默认调用本地 Ollama 服务,否则会通过 Python OpenAI 库调用该地址指向的 API 服务。
如果使用 Ollama 模型,建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`。需要在 Ollama 中下载了对应的模型才能正常使用。
默认值为空,适用于除了 Gummy 外的其他模型。
#### `-ourl, --ollama_url`
调用 OpenAI API 的基础请求地址,如果不填写则调用本地默认端口的 Ollama 模型。
默认值为空,适用于除了 Gummy 外的其他模型。
#### `-okey, --ollama_api_key`
指定调用 OpenAI 兼容模型的 API KEY。
默认值为空,适用于除了 Gummy 外的其他模型。
#### `-vosk, --vosk_model`
指定需要调用的 Vosk 模型的本地文件夹的路径。该项默认值为空。
该项仅适用于 Vosk 模型。
#### `-sosv, --sosv_model`
指定需要调用的 SOSV 模型的本地文件夹的路径。该项默认值为空。
该项仅适用于 SOSV 模型。
### 使用源代码运行字幕引擎
> 以下内容默认使用该方式的用户对 Python 环境配置和使用有所了解。
首先下载项目源代码到本地,其中字幕引擎源代码在项目的 `engine` 目录下。然后配置 Python 环境,其中项目依赖的 Python 包在 `engine` 目录下 `requirements.txt` 文件中。
配置好后进入 `engine` 目录,执行命令进行运行字幕引擎。
比如要使用 Gummy 模型,指定音频类型为系统音频输出,源语言为英语,翻译语言为中文,执行的命令如下:
> 注意:为了更直观,下面的命令写在了多行,如果执行失败,尝试去掉反斜杠,并改换单行命令执行。
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
指定 Vosk 模型,指定音频类型为系统音频输出,翻译语言为英语,使用 Ollama `qwen3:0.6b` 模型进行翻译:
```bash
python main.py \
-e vosk \
-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
-a 0 \
-d 1 \
-t en \
```
指定 SOSV 模型,指定音频类型为麦克风,自动选择源语言,不翻译,执行的命令如下:
```bash
python main.py \
-e sosv \
-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
-a 1 \
-d 1 \
-s auto \
-t none
```
使用 Gummy 模型的运行效果如下:
![](../img/07.png)
### 运行字幕引擎可执行文件
首先在 [GitHub Release](https://github.com/HiMeditator/auto-caption/releases/tag/engine) 中下载对应平台的可执行文件(目前仅提供 Windows 和 Linux 平台的字幕引擎可执行文件)。
然后再字幕引擎可执行文件所在目录打开终端,执行命令进行运行字幕引擎。
只需要将上述指令中的 `python main.py` 替换为可执行文件名称即可(比如:`engine-win.exe`)。

View File

@@ -1,5 +1,5 @@
appId: com.himeditator.autocaption
productName: auto-caption
productName: Auto Caption
directories:
buildResources: build
files:
@@ -13,13 +13,15 @@ files:
- '!engine/*'
- '!docs/*'
- '!assets/*'
- '!.repomap/*'
- '!.virtualme/*'
extraResources:
# For Windows
- from: ./engine/dist/main.exe
to: ./engine/main.exe
# For macOS and Linux
# - from: ./engine/dist/main
# to: ./engine/main
- from: ./engine/dist/main
to: ./engine/main
win:
executableName: auto-caption
icon: build/icon.png

View File

@@ -1,3 +1,4 @@
from dashscope.common.error import InvalidParameter
from .gummy import GummyRecognizer
from .vosk import VoskRecognizer
from .vosk import VoskRecognizer
from .sosv import SosvRecognizer
from .glm import GlmRecognizer

163
engine/audio2text/glm.py Normal file
View File

@@ -0,0 +1,163 @@
import threading
import io
import wave
import struct
import math
import audioop
import requests
from datetime import datetime
from utils import shared_data
from utils import stdout_cmd, stdout_obj, google_translate, ollama_translate
class GlmRecognizer:
"""
使用 GLM-ASR 引擎处理音频数据,并在标准输出中输出 Auto Caption 软件可读取的 JSON 字符串数据
初始化参数:
url: GLM-ASR API URL
model: GLM-ASR 模型名称
api_key: GLM-ASR API Key
source: 源语言
target: 目标语言
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, url: str, model: str, api_key: str, source: str, target: str | None, trans_model: str, ollama_name: str, ollama_url: str = '', ollama_api_key: str = ''):
self.url = url
self.model = model
self.api_key = api_key
self.source = source
self.target = target
if trans_model == 'google':
self.trans_func = google_translate
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.ollama_url = ollama_url
self.ollama_api_key = ollama_api_key
self.audio_buffer = []
self.is_speech = False
self.silence_frames = 0
self.speech_start_time = None
self.time_str = ''
self.cur_id = 0
# VAD settings (假设 16k 16bit, chunk size 1024 or similar)
# 16bit = 2 bytes per sample.
# RMS threshold needs tuning. 500 is a conservative guess for silence.
self.threshold = 500
self.silence_limit = 15 # frames (approx 0.5-1s depending on chunk size)
self.min_speech_frames = 10 # frames
def start(self):
"""启动 GLM 引擎"""
stdout_cmd('info', 'GLM-ASR recognizer started.')
def stop(self):
"""停止 GLM 引擎"""
stdout_cmd('info', 'GLM-ASR recognizer stopped.')
def process_audio(self, chunk):
# chunk is bytes (int16)
rms = audioop.rms(chunk, 2)
if rms > self.threshold:
if not self.is_speech:
self.is_speech = True
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.audio_buffer = []
self.audio_buffer.append(chunk)
self.silence_frames = 0
else:
if self.is_speech:
self.audio_buffer.append(chunk)
self.silence_frames += 1
if self.silence_frames > self.silence_limit:
# Speech ended
if len(self.audio_buffer) > self.min_speech_frames:
self.recognize(self.audio_buffer, self.time_str)
self.is_speech = False
self.audio_buffer = []
self.silence_frames = 0
def recognize(self, audio_frames, time_s):
audio_bytes = b''.join(audio_frames)
wav_io = io.BytesIO()
with wave.open(wav_io, 'wb') as wav_file:
wav_file.setnchannels(1)
wav_file.setsampwidth(2)
wav_file.setframerate(16000)
wav_file.writeframes(audio_bytes)
wav_io.seek(0)
threading.Thread(
target=self._do_request,
args=(wav_io.read(), time_s, self.cur_id)
).start()
self.cur_id += 1
def _do_request(self, audio_content, time_s, index):
try:
files = {
'file': ('audio.wav', audio_content, 'audio/wav')
}
data = {
'model': self.model,
'stream': 'false'
}
headers = {
'Authorization': f'Bearer {self.api_key}'
}
response = requests.post(self.url, headers=headers, data=data, files=files, timeout=15)
if response.status_code == 200:
res_json = response.json()
text = res_json.get('text', '')
if text:
self.output_caption(text, time_s, index)
else:
try:
err_msg = response.json()
stdout_cmd('error', f"GLM API Error: {err_msg}")
except:
stdout_cmd('error', f"GLM API Error: {response.text}")
except Exception as e:
stdout_cmd('error', f"GLM Request Failed: {str(e)}")
def output_caption(self, text, time_s, index):
caption = {
'command': 'caption',
'index': index,
'time_s': time_s,
'time_t': datetime.now().strftime('%H:%M:%S.%f')[:-3],
'text': text,
'translation': ''
}
if self.target:
if self.trans_func == ollama_translate:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], time_s, self.ollama_url, self.ollama_api_key),
daemon=True
)
else:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], time_s),
daemon=True
)
th.start()
stdout_obj(caption)
def translate(self):
global shared_data
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
self.process_audio(chunk)

View File

@@ -5,9 +5,10 @@ from dashscope.audio.asr import (
TranslationRecognizerRealtime
)
import dashscope
from dashscope.common.error import InvalidParameter
from datetime import datetime
from utils import stdout_cmd, stdout_obj, stderr
from utils import stdout_cmd, stdout_obj, stdout_err
from utils import shared_data
class Callback(TranslationRecognizerCallback):
"""
@@ -90,9 +91,23 @@ class GummyRecognizer:
"""启动 Gummy 引擎"""
self.translator.start()
def send_audio_frame(self, data):
"""发送音频帧,擎将自动识别将识别结果输出到标准输出中"""
self.translator.send_audio_frame(data)
def translate(self):
"""持续读取共享数据中的音频帧,并进行语音识别将识别结果输出到标准输出中"""
global shared_data
restart_count = 0
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
try:
self.translator.send_audio_frame(chunk)
except InvalidParameter as e:
restart_count += 1
if restart_count > 5:
stdout_err(str(e))
shared_data.status = "kill"
stdout_cmd('kill')
break
else:
stdout_cmd('info', f'Gummy engine stopped, restart attempt: {restart_count}...')
def stop(self):
"""停止 Gummy 引擎"""

178
engine/audio2text/sosv.py Normal file
View File

@@ -0,0 +1,178 @@
"""
Shepra-ONNX SenseVoice Model
This code file references the following:
https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/simulate-streaming-sense-voice-microphone.py
"""
import time
from datetime import datetime
import sherpa_onnx
import threading
import numpy as np
from utils import shared_data
from utils import stdout_cmd, stdout_obj
from utils import google_translate, ollama_translate
class SosvRecognizer:
"""
使用 Sense Voice 非流式模型处理流式音频数据,并在标准输出中输出 Auto Caption 软件可读取的 JSON 字符串数据
初始化参数:
model_path: Shepra ONNX Sense Voice 识别模型路径
vad_model: Silero VAD 模型路径
source: 识别源语言(auto, zh, en, ja, ko, yue)
target: 翻译目标语言
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, model_path: str, source: str, target: str | None, trans_model: str, ollama_name: str, ollama_url: str = '', ollama_api_key: str = ''):
if model_path.startswith('"'):
model_path = model_path[1:]
if model_path.endswith('"'):
model_path = model_path[:-1]
self.model_path = model_path
self.ext = ""
if self.model_path[-4:] == "int8":
self.ext = ".int8"
self.source = source
self.target = target
if trans_model == 'google':
self.trans_func = google_translate
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.ollama_url = ollama_url
self.ollama_api_key = ollama_api_key
self.time_str = ''
self.cur_id = 0
self.prev_content = ''
def start(self):
"""启动 Sense Voice 模型"""
self.recognizer = sherpa_onnx.OfflineRecognizer.from_sense_voice(
model=f"{self.model_path}/sensevoice/model{self.ext}.onnx",
tokens=f"{self.model_path}/sensevoice/tokens.txt",
language=self.source,
num_threads = 2,
)
vad_config = sherpa_onnx.VadModelConfig()
vad_config.silero_vad.model = f"{self.model_path}/silero_vad.onnx"
vad_config.silero_vad.threshold = 0.5
vad_config.silero_vad.min_silence_duration = 0.1
vad_config.silero_vad.min_speech_duration = 0.25
vad_config.silero_vad.max_speech_duration = 5
vad_config.sample_rate = 16000
self.window_size = vad_config.silero_vad.window_size
self.vad = sherpa_onnx.VoiceActivityDetector(vad_config, buffer_size_in_seconds=100)
if self.source == 'en':
model_config = sherpa_onnx.OnlinePunctuationModelConfig(
cnn_bilstm=f"{self.model_path}/punct-en/model{self.ext}.onnx",
bpe_vocab=f"{self.model_path}/punct-en/bpe.vocab"
)
punct_config = sherpa_onnx.OnlinePunctuationConfig(
model_config=model_config,
)
self.punct = sherpa_onnx.OnlinePunctuation(punct_config)
else:
punct_config = sherpa_onnx.OfflinePunctuationConfig(
model=sherpa_onnx.OfflinePunctuationModelConfig(
ct_transformer=f"{self.model_path}/punct/model{self.ext}.onnx"
),
)
self.punct = sherpa_onnx.OfflinePunctuation(punct_config)
self.buffer = []
self.offset = 0
self.started = False
self.started_time = .0
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
stdout_cmd('info', 'Shepra ONNX Sense Voice recognizer started.')
def send_audio_frame(self, data: bytes):
"""
发送音频帧给 SOSV 引擎,引擎将自动识别并将识别结果输出到标准输出中
Args:
data: 音频帧数据,采样率必须为 16000Hz
"""
caption = {}
caption['command'] = 'caption'
caption['translation'] = ''
data_np = np.frombuffer(data, dtype=np.int16).astype(np.float32)
self.buffer = np.concatenate([self.buffer, data_np])
while self.offset + self.window_size < len(self.buffer):
self.vad.accept_waveform(self.buffer[self.offset: self.offset + self.window_size])
if not self.started and self.vad.is_speech_detected():
self.started = True
self.started_time = time.time()
self.offset += self.window_size
if not self.started:
if len(self.buffer) > 10 * self.window_size:
self.offset -= len(self.buffer) - 10 * self.window_size
self.buffer = self.buffer[-10 * self.window_size:]
if self.started and time.time() - self.started_time > 0.2:
stream = self.recognizer.create_stream()
stream.accept_waveform(16000, self.buffer)
self.recognizer.decode_stream(stream)
text = stream.result.text.strip()
if text and self.prev_content != text:
caption['index'] = self.cur_id
caption['text'] = text
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.prev_content = text
stdout_obj(caption)
self.started_time = time.time()
while not self.vad.empty():
stream = self.recognizer.create_stream()
stream.accept_waveform(16000, self.vad.front.samples)
self.vad.pop()
self.recognizer.decode_stream(stream)
text = stream.result.text.strip()
if self.source == 'en':
text_with_punct = self.punct.add_punctuation_with_case(text)
else:
text_with_punct = self.punct.add_punctuation(text)
caption['index'] = self.cur_id
caption['text'] = text_with_punct
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
if text:
stdout_obj(caption)
if self.target:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], self.time_str, self.ollama_url, self.ollama_api_key),
daemon=True
)
th.start()
self.cur_id += 1
self.prev_content = ''
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.buffer = []
self.offset = 0
self.started = False
self.started_time = .0
def translate(self):
"""持续读取共享数据中的音频帧,并进行语音识别,将识别结果输出到标准输出中"""
global shared_data
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
self.send_audio_frame(chunk)
def stop(self):
"""停止 Sense Voice 模型"""
stdout_cmd('info', 'Shepra ONNX Sense Voice recognizer closed.')

View File

@@ -1,8 +1,11 @@
import json
import threading
import time
from datetime import datetime
from vosk import Model, KaldiRecognizer, SetLogLevel
from utils import stdout_cmd, stdout_obj
from utils import shared_data
from utils import stdout_cmd, stdout_obj, google_translate, ollama_translate
class VoskRecognizer:
@@ -11,14 +14,25 @@ class VoskRecognizer:
初始化参数:
model_path: Vosk 识别模型路径
target: 翻译目标语言
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, model_path: str):
def __init__(self, model_path: str, target: str | None, trans_model: str, ollama_name: str, ollama_url: str = '', ollama_api_key: str = ''):
SetLogLevel(-1)
if model_path.startswith('"'):
model_path = model_path[1:]
if model_path.endswith('"'):
model_path = model_path[:-1]
self.model_path = model_path
self.target = target
if trans_model == 'google':
self.trans_func = google_translate
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.ollama_url = ollama_url
self.ollama_api_key = ollama_api_key
self.time_str = ''
self.cur_id = 0
self.prev_content = ''
@@ -48,7 +62,16 @@ class VoskRecognizer:
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.prev_content = ''
if content == '': return
self.cur_id += 1
if self.target:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], self.time_str, self.ollama_url, self.ollama_api_key),
daemon=True
)
th.start()
else:
content = json.loads(self.recognizer.PartialResult()).get('partial', '')
if content == '' or content == self.prev_content:
@@ -63,6 +86,13 @@ class VoskRecognizer:
stdout_obj(caption)
def translate(self):
"""持续读取共享数据中的音频帧,并进行语音识别,将识别结果输出到标准输出中"""
global shared_data
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
self.send_audio_frame(chunk)
def stop(self):
"""停止 Vosk 引擎"""
stdout_cmd('info', 'Vosk recognizer closed.')

View File

@@ -1,89 +1,223 @@
import wave
import argparse
from utils import stdout_cmd, stdout_err
from utils import thread_data, start_server
import threading
import datetime
from utils import stdout, stdout_cmd, change_caption_display
from utils import shared_data, start_server
from utils import merge_chunk_channels, resample_chunk_mono
from audio2text import InvalidParameter, GummyRecognizer
from audio2text import GummyRecognizer
from audio2text import VoskRecognizer
from audio2text import SosvRecognizer
from audio2text import GlmRecognizer
from sysaudio import AudioStream
def main_gummy(s: str, t: str, a: int, c: int, k: str):
global thread_data
def audio_recording(stream: AudioStream, resample: bool, record = False, path = ''):
global shared_data
stream.open_stream()
wf = None
full_name = ''
if record:
if path != '':
if path.startswith('"') and path.endswith('"'):
path = path[1:-1]
if path[-1] != '/':
path += '/'
cur_dt = datetime.datetime.now()
name = cur_dt.strftime("audio-%Y-%m-%dT%H-%M-%S")
full_name = f'{path}{name}.wav'
wf = wave.open(full_name, 'wb')
wf.setnchannels(stream.CHANNELS)
wf.setsampwidth(stream.SAMP_WIDTH)
wf.setframerate(stream.RATE)
stdout_cmd("info", "Audio recording...")
while shared_data.status == 'running':
raw_chunk = stream.read_chunk()
if record: wf.writeframes(raw_chunk) # type: ignore
if raw_chunk is None: continue
if resample:
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
else:
chunk = merge_chunk_channels(raw_chunk, stream.CHANNELS)
shared_data.chunk_queue.put(chunk)
if record:
stdout_cmd("info", f"Audio saved to {full_name}")
wf.close() # type: ignore
stream.close_stream_signal()
def main_gummy(s: str, t: str, a: int, c: int, k: str, r: bool, rp: str):
"""
Parameters:
s: Source language
t: Target language
k: Aliyun Bailian API key
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
if t == 'none':
engine = GummyRecognizer(stream.RATE, s, None, k)
else:
engine = GummyRecognizer(stream.RATE, s, t, k)
stream.open_stream()
engine.start()
chunk_mono = bytes()
restart_count = 0
while thread_data.status == "running":
try:
chunk = stream.read_chunk()
if chunk is None: continue
chunk_mono = merge_chunk_channels(chunk, stream.CHANNELS)
try:
engine.send_audio_frame(chunk_mono)
except InvalidParameter as e:
restart_count += 1
if restart_count > 5:
stdout_err(str(e))
thread_data.status = "kill"
stdout_cmd('kill')
break
else:
stdout_cmd('info', f'Gummy engine stopped, restart attempt: {restart_count}...')
except KeyboardInterrupt:
break
engine.send_audio_frame(chunk_mono)
stream.close_stream()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, False, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
def main_vosk(a: int, c: int, m: str):
global thread_data
def main_vosk(a: int, c: int, vosk: str, t: str, tm: str, omn: str, ourl: str, okey: str, r: bool, rp: str):
"""
Parameters:
a: Audio source: 0 for output, 1 for input
c: Chunk number in 1 second
vosk: Vosk model path
t: Target language
tm: Translation model type, ollama or google
omn: Ollama model name
ourl: Ollama Base URL
okey: Ollama API Key
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
engine = VoskRecognizer(m)
if t == 'none':
engine = VoskRecognizer(vosk, None, tm, omn, ourl, okey)
else:
engine = VoskRecognizer(vosk, t, tm, omn, ourl, okey)
stream.open_stream()
engine.start()
while thread_data.status == "running":
try:
chunk = stream.read_chunk()
if chunk is None: continue
chunk_mono = resample_chunk_mono(chunk, stream.CHANNELS, stream.RATE, 16000)
engine.send_audio_frame(chunk_mono)
except KeyboardInterrupt:
break
stream.close_stream()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, True, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
def main_sosv(a: int, c: int, sosv: str, s: str, t: str, tm: str, omn: str, ourl: str, okey: str, r: bool, rp: str):
"""
Parameters:
a: Audio source: 0 for output, 1 for input
c: Chunk number in 1 second
sosv: Sherpa-ONNX SenseVoice model path
s: Source language
t: Target language
tm: Translation model type, ollama or google
omn: Ollama model name
ourl: Ollama API URL
okey: Ollama API Key
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
if t == 'none':
engine = SosvRecognizer(sosv, s, None, tm, omn, ourl, okey)
else:
engine = SosvRecognizer(sosv, s, t, tm, omn, ourl, okey)
engine.start()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, True, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
def main_glm(a: int, c: int, url: str, model: str, key: str, s: str, t: str, tm: str, omn: str, ourl: str, okey: str, r: bool, rp: str):
"""
Parameters:
a: Audio source
c: Chunk rate
url: GLM API URL
model: GLM Model Name
key: GLM API Key
s: Source language
t: Target language
tm: Translation model
omn: Ollama model name
ourl: Ollama API URL
okey: Ollama API Key
r: Record
rp: Record path
"""
stream = AudioStream(a, c)
if t == 'none':
engine = GlmRecognizer(url, model, key, s, None, tm, omn, ourl, okey)
else:
engine = GlmRecognizer(url, model, key, s, t, tm, omn, ourl, okey)
engine.start()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, True, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# both
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=8080, help='The port to run the server on, 0 for no server')
# all
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy, glm, vosk or sosv')
parser.add_argument('-a', '--audio_type', type=int, default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', type=int, default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', type=int, default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-d', '--display_caption', type=int, default=0, help='Display caption on terminal, 0 for no display, 1 for display')
parser.add_argument('-t', '--target_language', default='none', help='Target language code, "none" for no translation')
parser.add_argument('-r', '--record', type=int, default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
# gummy and sosv and glm
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
# gummy only
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# vosk and sosv
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
parser.add_argument('-ourl', '--ollama_url', default='', help='Ollama API URL')
parser.add_argument('-okey', '--ollama_api_key', default='', help='Ollama API Key')
# vosk only
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
# sosv only
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
# glm only
parser.add_argument('-gurl', '--glm_url', default='https://open.bigmodel.cn/api/paas/v4/audio/transcriptions', help='GLM API URL')
parser.add_argument('-gmodel', '--glm_model', default='glm-asr-2512', help='GLM Model Name')
parser.add_argument('-gkey', '--glm_api_key', default='', help='GLM API Key')
args = parser.parse_args()
if int(args.port) == 0:
thread_data.status = "running"
else:
start_server(int(args.port))
if args.port != 0:
threading.Thread(target=start_server, args=(args.port,), daemon=True).start()
if args.display_caption == '1':
change_caption_display(True)
if args.caption_engine == 'gummy':
main_gummy(
@@ -91,16 +225,55 @@ if __name__ == "__main__":
args.target_language,
int(args.audio_type),
int(args.chunk_rate),
args.api_key
args.api_key,
bool(int(args.record)),
args.record_path
)
elif args.caption_engine == 'vosk':
main_vosk(
int(args.audio_type),
int(args.chunk_rate),
args.model_path
args.vosk_model,
args.target_language,
args.translation_model,
args.ollama_name,
args.ollama_url,
args.ollama_api_key,
bool(int(args.record)),
args.record_path
)
elif args.caption_engine == 'sosv':
main_sosv(
int(args.audio_type),
int(args.chunk_rate),
args.sosv_model,
args.source_language,
args.target_language,
args.translation_model,
args.ollama_name,
args.ollama_url,
args.ollama_api_key,
bool(int(args.record)),
args.record_path
)
elif args.caption_engine == 'glm':
main_glm(
int(args.audio_type),
int(args.chunk_rate),
args.glm_url,
args.glm_model,
args.glm_api_key,
args.source_language,
args.target_language,
args.translation_model,
args.ollama_name,
args.ollama_url,
args.ollama_api_key,
bool(int(args.record)),
args.record_path
)
else:
raise ValueError('Invalid caption engine specified.')
if thread_data.status == "kill":
if shared_data.status == "kill":
stdout_cmd('kill')

View File

@@ -6,7 +6,12 @@ import sys
if sys.platform == 'win32':
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
else:
vosk_path = str(Path('./.venv/lib/python3.12/site-packages/vosk').resolve())
venv_lib = Path('./.venv/lib')
python_dirs = list(venv_lib.glob('python*'))
if python_dirs:
vosk_path = str((python_dirs[0] / 'site-packages' / 'vosk').resolve())
else:
vosk_path = str(Path('./.venv/lib/python3.12/site-packages/vosk').resolve())
a = Analysis(
['main.py'],

View File

@@ -1,7 +1,12 @@
dashscope
numpy
samplerate
resampy
vosk
pyinstaller
pyaudio; sys_platform == 'darwin'
pyaudiowpatch; sys_platform == 'win32'
googletrans
ollama
sherpa_onnx
requests
openai

View File

@@ -37,14 +37,13 @@ class AudioStream:
self.FORMAT = pyaudio.paInt16
self.SAMP_WIDTH = pyaudio.get_sample_size(self.FORMAT)
self.CHANNELS = int(self.device["maxInputChannels"])
self.RATE = int(self.device["defaultSampleRate"])
self.CHUNK = self.RATE // chunk_rate
self.DEFAULT_RATE = int(self.device["defaultSampleRate"])
self.CHUNK_RATE = chunk_rate
def reset_chunk_size(self, chunk_size: int):
"""
重新设置音频块大小
"""
self.CHUNK = chunk_size
self.RATE = 16000
self.CHUNK = self.RATE // self.CHUNK_RATE
self.open_stream()
self.close_stream()
def get_info(self):
dev_info = f"""
@@ -72,16 +71,27 @@ class AudioStream:
打开并返回系统音频输出流
"""
if self.stream: return self.stream
self.stream = self.mic.open(
format = self.FORMAT,
channels = int(self.CHANNELS),
rate = self.RATE,
input = True,
input_device_index = int(self.INDEX)
)
try:
self.stream = self.mic.open(
format = self.FORMAT,
channels = int(self.CHANNELS),
rate = self.RATE,
input = True,
input_device_index = int(self.INDEX)
)
except OSError:
self.RATE = self.DEFAULT_RATE
self.CHUNK = self.RATE // self.CHUNK_RATE
self.stream = self.mic.open(
format = self.FORMAT,
channels = int(self.CHANNELS),
rate = self.RATE,
input = True,
input_device_index = int(self.INDEX)
)
return self.stream
def read_chunk(self):
def read_chunk(self) -> bytes | None:
"""
读取音频数据
"""

View File

@@ -55,15 +55,10 @@ class AudioStream:
self.FORMAT = 16
self.SAMP_WIDTH = 2
self.CHANNELS = 2
self.RATE = 48000
self.RATE = 16000
self.CHUNK_RATE = chunk_rate
self.CHUNK = self.RATE // chunk_rate
def reset_chunk_size(self, chunk_size: int):
"""
重新设置音频块大小
"""
self.CHUNK = chunk_size
def get_info(self):
dev_info = f"""
音频捕获进程:
@@ -84,7 +79,7 @@ class AudioStream:
启动音频捕获进程
"""
self.process = subprocess.Popen(
["parec", "-d", self.source, "--format=s16le", "--rate=48000", "--channels=2"],
["parec", "-d", self.source, "--format=s16le", "--rate=16000", "--channels=2"],
stdout=subprocess.PIPE
)

View File

@@ -61,14 +61,13 @@ class AudioStream:
self.FORMAT = pyaudio.paInt16
self.SAMP_WIDTH = pyaudio.get_sample_size(self.FORMAT)
self.CHANNELS = int(self.device["maxInputChannels"])
self.RATE = int(self.device["defaultSampleRate"])
self.CHUNK = self.RATE // chunk_rate
self.DEFAULT_RATE = int(self.device["defaultSampleRate"])
self.CHUNK_RATE = chunk_rate
def reset_chunk_size(self, chunk_size: int):
"""
重新设置音频块大小
"""
self.CHUNK = chunk_size
self.RATE = 16000
self.CHUNK = self.RATE // self.CHUNK_RATE
self.open_stream()
self.close_stream()
def get_info(self):
dev_info = f"""
@@ -96,13 +95,24 @@ class AudioStream:
打开并返回系统音频输出流
"""
if self.stream: return self.stream
self.stream = self.mic.open(
format = self.FORMAT,
channels = self.CHANNELS,
rate = self.RATE,
input = True,
input_device_index = self.INDEX
)
try:
self.stream = self.mic.open(
format = self.FORMAT,
channels = self.CHANNELS,
rate = self.RATE,
input = True,
input_device_index = self.INDEX
)
except OSError:
self.RATE = self.DEFAULT_RATE
self.CHUNK = self.RATE // self.CHUNK_RATE
self.stream = self.mic.open(
format = self.FORMAT,
channels = self.CHANNELS,
rate = self.RATE,
input = True,
input_device_index = self.INDEX
)
return self.stream
def read_chunk(self) -> bytes | None:

View File

@@ -1,9 +1,6 @@
from .audioprcs import (
merge_chunk_channels,
resample_chunk_mono,
resample_chunk_mono_np,
resample_mono_chunk
)
from .audioprcs import merge_chunk_channels, resample_chunk_mono
from .sysout import stdout, stdout_err, stdout_cmd, stdout_obj, stderr
from .thdata import thread_data
from .server import start_server
from .sysout import change_caption_display
from .shared import shared_data
from .server import start_server
from .translation import ollama_translate, google_translate

View File

@@ -1,4 +1,4 @@
import samplerate
import resampy
import numpy as np
import numpy.core.multiarray # do not remove
@@ -24,16 +24,15 @@ def merge_chunk_channels(chunk: bytes, channels: int) -> bytes:
return chunk_mono.tobytes()
def resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes:
def resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes:
"""
将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
将当前多通道音频数据块转换成单通道音频数据块,进行重采样
Args:
chunk: 多通道音频数据块
channels: 通道数
orig_sr: 原始采样率
target_sr: 目标采样率
mode: 重采样模式,可选:'sinc_best' | 'sinc_medium' | 'sinc_fastest' | 'zero_order_hold' | 'linear'
Return:
单通道音频数据块
@@ -49,60 +48,17 @@ def resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: in
# (length,)
chunk_mono = np.mean(chunk_np.astype(np.float32), axis=1)
ratio = target_sr / orig_sr
chunk_mono_r = samplerate.resample(chunk_mono, ratio, converter_type=mode)
if orig_sr == target_sr:
return chunk_mono.astype(np.int16).tobytes()
chunk_mono_r = resampy.resample(chunk_mono, orig_sr, target_sr)
chunk_mono_r = np.round(chunk_mono_r).astype(np.int16)
return chunk_mono_r.tobytes()
def resample_chunk_mono_np(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best", dtype=np.float32) -> np.ndarray:
"""
将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样,返回 Numpy 数组
Args:
chunk: 多通道音频数据块
channels: 通道数
orig_sr: 原始采样率
target_sr: 目标采样率
mode: 重采样模式,可选:'sinc_best' | 'sinc_medium' | 'sinc_fastest' | 'zero_order_hold' | 'linear'
dtype: 返回 Numpy 数组的数据类型
Return:
单通道音频数据块
"""
if channels == 1:
chunk_mono = np.frombuffer(chunk, dtype=np.int16)
chunk_mono = chunk_mono.astype(np.float32)
real_len = round(chunk_mono.shape[0] * target_sr / orig_sr)
if(chunk_mono_r.shape[0] != real_len):
print(chunk_mono_r.shape[0], real_len)
if(chunk_mono_r.shape[0] > real_len):
chunk_mono_r = chunk_mono_r[:real_len]
else:
# (length * channels,)
chunk_np = np.frombuffer(chunk, dtype=np.int16)
# (length, channels)
chunk_np = chunk_np.reshape(-1, channels)
# (length,)
chunk_mono = np.mean(chunk_np.astype(np.float32), axis=1)
ratio = target_sr / orig_sr
chunk_mono_r = samplerate.resample(chunk_mono, ratio, converter_type=mode)
chunk_mono_r = chunk_mono_r.astype(dtype)
return chunk_mono_r
def resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes:
"""
将当前单通道音频块进行重采样
Args:
chunk: 单通道音频数据块
orig_sr: 原始采样率
target_sr: 目标采样率
mode: 重采样模式,可选:'sinc_best' | 'sinc_medium' | 'sinc_fastest' | 'zero_order_hold' | 'linear'
Return:
单通道音频数据块
"""
chunk_np = np.frombuffer(chunk, dtype=np.int16)
chunk_np = chunk_np.astype(np.float32)
ratio = target_sr / orig_sr
chunk_r = samplerate.resample(chunk_np, ratio, converter_type=mode)
chunk_r = np.round(chunk_r).astype(np.int16)
return chunk_r.tobytes()
while chunk_mono_r.shape[0] < real_len:
chunk_mono_r = np.append(chunk_mono_r, chunk_mono_r[-1])
return chunk_mono_r.tobytes()

View File

@@ -1,12 +1,12 @@
import socket
import threading
import json
from utils import thread_data, stdout_cmd, stderr
from utils import shared_data, stdout_cmd, stderr
def handle_client(client_socket):
global thread_data
while thread_data.status == 'running':
global shared_data
while shared_data.status == 'running':
try:
data = client_socket.recv(4096).decode('utf-8')
if not data:
@@ -14,13 +14,13 @@ def handle_client(client_socket):
data = json.loads(data)
if data['command'] == 'stop':
thread_data.status = 'stop'
shared_data.status = 'stop'
break
except Exception as e:
stderr(f'Communication error: {e}')
break
thread_data.status = 'stop'
shared_data.status = 'stop'
client_socket.close()

8
engine/utils/shared.py Normal file
View File

@@ -0,0 +1,8 @@
import queue
class SharedData:
def __init__(self):
self.status = "running"
self.chunk_queue = queue.Queue()
shared_data = SharedData()

View File

@@ -1,5 +1,10 @@
import sys
import json
import sherpa_onnx
display_caption = False
caption_index = -1
display = sherpa_onnx.Display()
def stdout(text: str):
stdout_cmd("print", text)
@@ -12,7 +17,42 @@ def stdout_cmd(command: str, content = ""):
sys.stdout.write(json.dumps(msg) + "\n")
sys.stdout.flush()
def change_caption_display(val: bool):
global display_caption
display_caption = val
def caption_display(obj):
global display_caption
global caption_index
global display
if caption_index >=0 and caption_index != int(obj['index']):
display.finalize_current_sentence()
caption_index = int(obj['index'])
full_text = f"{obj['text']}\n{obj['translation']}"
if obj['translation']:
full_text += "\n"
display.update_text(full_text)
display.display()
def translation_display(obj):
global original_caption
global display
full_text = f"{obj['text']}\n{obj['translation']}"
if obj['translation']:
full_text += "\n"
display.update_text(full_text)
display.display()
display.finalize_current_sentence()
def stdout_obj(obj):
global display_caption
if obj['command'] == 'caption' and display_caption:
caption_display(obj)
return
if obj['command'] == 'translation' and display_caption:
translation_display(obj)
return
sys.stdout.write(json.dumps(obj) + "\n")
sys.stdout.flush()

View File

@@ -1,5 +0,0 @@
class ThreadData:
def __init__(self):
self.status = "running"
thread_data = ThreadData()

View File

@@ -0,0 +1,83 @@
from ollama import chat, Client
from ollama import ChatResponse
try:
from openai import OpenAI
except ImportError:
OpenAI = None
import asyncio
from googletrans import Translator
from .sysout import stdout_cmd, stdout_obj
lang_map = {
'en': 'English',
'es': 'Spanish',
'fr': 'French',
'de': 'German',
'it': 'Italian',
'ru': 'Russian',
'ja': 'Japanese',
'ko': 'Korean',
'zh': 'Chinese',
'zh-cn': 'Chinese'
}
def ollama_translate(model: str, target: str, text: str, time_s: str, url: str = '', key: str = ''):
content = ""
try:
if url:
if OpenAI:
client = OpenAI(base_url=url, api_key=key if key else "ollama")
openai_response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = openai_response.choices[0].message.content or ""
else:
client = Client(host=url)
response: ChatResponse = client.chat(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = response.message.content or ""
else:
response: ChatResponse = chat(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = response.message.content or ""
except Exception as e:
stdout_cmd("warn", f"Translation failed: {str(e)}")
return
if content.startswith('<think>'):
index = content.find('</think>')
if index != -1:
content = content[index+8:]
stdout_obj({
"command": "translation",
"time_s": time_s,
"text": text,
"translation": content.strip()
})
def google_translate(model: str, target: str, text: str, time_s: str):
translator = Translator()
try:
res = asyncio.run(translator.translate(text, dest=target))
stdout_obj({
"command": "translation",
"time_s": time_s,
"text": text,
"translation": res.text
})
except Exception as e:
stdout_cmd("warn", f"Google translation request failed, please check your network connection...")

71
package-lock.json generated
View File

@@ -1,12 +1,12 @@
{
"name": "auto-caption",
"version": "0.7.0",
"version": "1.0.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "auto-caption",
"version": "0.7.0",
"version": "1.0.0",
"hasInstallScript": true,
"dependencies": {
"@electron-toolkit/preload": "^3.0.1",
@@ -110,6 +110,7 @@
"integrity": "sha512-IaaGWsQqfsQWVLqMn9OB92MNN7zukfVA4s7KKAI0KfrrDsZ0yhi5uV4baBuLuN7n3vsZpwP8asPPcVwApxvjBQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@ampproject/remapping": "^2.2.0",
"@babel/code-frame": "^7.27.1",
@@ -2274,6 +2275,7 @@
"resolved": "https://registry.npmmirror.com/@types/node/-/node-22.15.17.tgz",
"integrity": "sha512-wIX2aSZL5FE+MR0JlvF87BNVrtFWf6AE6rxSE9X7OwnVvoyCQjpzSRJ+M87se/4QCkCiebQAqrJ0y6fwIyi7nw==",
"license": "MIT",
"peer": true,
"dependencies": {
"undici-types": "~6.21.0"
}
@@ -2360,6 +2362,7 @@
"integrity": "sha512-B2MdzyWxCE2+SqiZHAjPphft+/2x2FlO9YBx7eKE1BCb+rqBlQdhtAEhzIEdozHd55DXPmxBdpMygFJjfjjA9A==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@typescript-eslint/scope-manager": "8.32.0",
"@typescript-eslint/types": "8.32.0",
@@ -2791,6 +2794,7 @@
"integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==",
"dev": true,
"license": "MIT",
"peer": true,
"bin": {
"acorn": "bin/acorn"
},
@@ -2851,6 +2855,7 @@
"integrity": "sha512-j3fVLgvTo527anyYyJOGTYJbG+vnnQYvE0m5mmkc1TK+nxAppkCLMIL0aZ4dblVCNoGShhm+kzE4ZUykBoMg4g==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"fast-deep-equal": "^3.1.1",
"fast-json-stable-stringify": "^2.0.0",
@@ -3064,7 +3069,6 @@
"integrity": "sha512-+25nxyyznAXF7Nef3y0EbBeqmGZgeN/BxHX29Rs39djAfaFalmQ89SE6CWyDCHzGL0yt/ycBtNOmGTW0FyGWNw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"archiver-utils": "^2.1.0",
"async": "^3.2.4",
@@ -3084,7 +3088,6 @@
"integrity": "sha512-bEL/yUb/fNNiNTuUz979Z0Yg5L+LzLxGJz8x79lYmR54fmTIb6ob/hNQgkQnIUDWIFjZVQwl9Xs356I6BAMHfw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"glob": "^7.1.4",
"graceful-fs": "^4.2.0",
@@ -3107,7 +3110,6 @@
"integrity": "sha512-8p0AUk4XODgIewSi0l8Epjs+EVnWiK7NoDIEGU0HhE7+ZyY8D1IMY7odu5lRrFXGg71L15KG8QrPmum45RTtdA==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"core-util-is": "~1.0.0",
"inherits": "~2.0.3",
@@ -3123,8 +3125,7 @@
"resolved": "https://registry.npmmirror.com/safe-buffer/-/safe-buffer-5.1.2.tgz",
"integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/archiver-utils/node_modules/string_decoder": {
"version": "1.1.1",
@@ -3132,7 +3133,6 @@
"integrity": "sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"safe-buffer": "~5.1.0"
}
@@ -3351,6 +3351,7 @@
}
],
"license": "MIT",
"peer": true,
"dependencies": {
"caniuse-lite": "^1.0.30001716",
"electron-to-chromium": "^1.5.149",
@@ -3848,7 +3849,6 @@
"integrity": "sha512-D3uMHtGc/fcO1Gt1/L7i1e33VOvD4A9hfQLP+6ewd+BvG/gQ84Yh4oftEhAdjSMgBgwGL+jsppT7JYNpo6MHHg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"buffer-crc32": "^0.2.13",
"crc32-stream": "^4.0.2",
@@ -3994,7 +3994,6 @@
"integrity": "sha512-ROmzCKrTnOwybPcJApAA6WBWij23HVfGVNKqqrZpuyZOHqK2CwHSvpGuyt/UNNvaIjEd8X5IFGp4Mh+Ie1IHJQ==",
"dev": true,
"license": "Apache-2.0",
"peer": true,
"bin": {
"crc32": "bin/crc32.njs"
},
@@ -4008,7 +4007,6 @@
"integrity": "sha512-NT7w2JVU7DFroFdYkeq8cywxrgjPHWkdX1wjpRQXPX5Asews3tA+Ght6lddQO5Mkumffp3X7GEqku3epj2toIw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"crc-32": "^1.2.0",
"readable-stream": "^3.4.0"
@@ -4248,6 +4246,7 @@
"integrity": "sha512-NoXo6Liy2heSklTI5OIZbCgXC1RzrDQsZkeEwXhdOro3FT1VBOvbubvscdPnjVuQ4AMwwv61oaH96AbiYg9EnQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"app-builder-lib": "25.1.8",
"builder-util": "25.1.7",
@@ -4410,6 +4409,7 @@
"integrity": "sha512-6dLslJrQYB1qvqVPYRv1PhAA/uytC66nUeiTcq2JXiBzrmTWCHppqtGUjZhvnSRVatBCT5/SFdizdzcBiEiYUg==",
"hasInstallScript": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@electron/get": "^2.0.0",
"@types/node": "^22.7.7",
@@ -4454,7 +4454,6 @@
"integrity": "sha512-2ntkJ+9+0GFP6nAISiMabKt6eqBB0kX1QqHNWFWAXgi0VULKGisM46luRFpIBiU3u/TDmhZMM8tzvo2Abn3ayg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"app-builder-lib": "25.1.8",
"archiver": "^5.3.1",
@@ -4468,7 +4467,6 @@
"integrity": "sha512-oRXApq54ETRj4eMiFzGnHWGy+zo5raudjuxN0b8H7s/RU2oW0Wvsx9O0ACRN/kRq9E8Vu/ReskGB5o3ji+FzHQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"graceful-fs": "^4.2.0",
"jsonfile": "^6.0.1",
@@ -4484,7 +4482,6 @@
"integrity": "sha512-5dgndWOriYSm5cnYaJNhalLNDKOqFwyDB/rr1E9ZsGciGvKPs8R2xYGCacuf3z6K1YKDz182fd+fY3cn3pMqXQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"universalify": "^2.0.0"
},
@@ -4498,7 +4495,6 @@
"integrity": "sha512-gptHNQghINnc/vTGIk0SOFGFNXw7JVrlRUtConJRlvaw6DuX0wO5Jeko9sWrMBhh+PsYAZ7oXAiOnf/UKogyiw==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">= 10.0.0"
}
@@ -4813,6 +4809,7 @@
"integrity": "sha512-LSehfdpgMeWcTZkWZVIJl+tkZ2nuSkyyB9C27MZqFWXuph7DvaowgcTvKqxvpLW1JZIk8PN7hFY3Rj9LQ7m7lg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@eslint-community/eslint-utils": "^4.2.0",
"@eslint-community/regexpp": "^4.12.1",
@@ -4874,6 +4871,7 @@
"integrity": "sha512-zc1UmCpNltmVY34vuLRV61r1K27sWuX39E+uyUnY8xS2Bex88VV9cugG+UZbRSRGtGyFboj+D8JODyme1plMpw==",
"dev": true,
"license": "MIT",
"peer": true,
"bin": {
"eslint-config-prettier": "bin/cli.js"
},
@@ -5351,8 +5349,7 @@
"resolved": "https://registry.npmmirror.com/fs-constants/-/fs-constants-1.0.0.tgz",
"integrity": "sha512-y6OAwoSIf7FyjMIv94u+b5rdheZEjzR63GTyZJm5qh4Bi+2YgwLCcI/fPFZkL5PSixOt6ZNKm+w+Hfp/Bciwow==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/fs-extra": {
"version": "8.1.0",
@@ -6108,8 +6105,7 @@
"resolved": "https://registry.npmmirror.com/isarray/-/isarray-1.0.0.tgz",
"integrity": "sha512-VLghIWNM6ELQzo7zwmcg0NmTVyWKYjvIeM83yjp0wRDTmUnrM678fQbcKBo6n2CJEF0szoG//ytg+TKla89ALQ==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/isbinaryfile": {
"version": "5.0.4",
@@ -6300,7 +6296,6 @@
"integrity": "sha512-b94GiNHQNy6JNTrt5w6zNyffMrNkXZb3KTkCZJb2V1xaEGCk093vkZ2jk3tpaeP33/OiXC+WvK9AxUebnf5nbw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"readable-stream": "^2.0.5"
},
@@ -6314,7 +6309,6 @@
"integrity": "sha512-8p0AUk4XODgIewSi0l8Epjs+EVnWiK7NoDIEGU0HhE7+ZyY8D1IMY7odu5lRrFXGg71L15KG8QrPmum45RTtdA==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"core-util-is": "~1.0.0",
"inherits": "~2.0.3",
@@ -6330,8 +6324,7 @@
"resolved": "https://registry.npmmirror.com/safe-buffer/-/safe-buffer-5.1.2.tgz",
"integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lazystream/node_modules/string_decoder": {
"version": "1.1.1",
@@ -6339,7 +6332,6 @@
"integrity": "sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"safe-buffer": "~5.1.0"
}
@@ -6391,32 +6383,28 @@
"resolved": "https://registry.npmmirror.com/lodash.defaults/-/lodash.defaults-4.2.0.tgz",
"integrity": "sha512-qjxPLHd3r5DnsdGacqOMU6pb/avJzdh9tFX2ymgoZE27BmjXrNy/y4LoaiTeAb+O3gL8AfpJGtqfX/ae2leYYQ==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.difference": {
"version": "4.5.0",
"resolved": "https://registry.npmmirror.com/lodash.difference/-/lodash.difference-4.5.0.tgz",
"integrity": "sha512-dS2j+W26TQ7taQBGN8Lbbq04ssV3emRw4NY58WErlTO29pIqS0HmoT5aJ9+TUQ1N3G+JOZSji4eugsWwGp9yPA==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.flatten": {
"version": "4.4.0",
"resolved": "https://registry.npmmirror.com/lodash.flatten/-/lodash.flatten-4.4.0.tgz",
"integrity": "sha512-C5N2Z3DgnnKr0LOpv/hKCgKdb7ZZwafIrsesve6lmzvZIRZRGaZ/l6Q8+2W7NaT+ZwO3fFlSCzCzrDCFdJfZ4g==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.isplainobject": {
"version": "4.0.6",
"resolved": "https://registry.npmmirror.com/lodash.isplainobject/-/lodash.isplainobject-4.0.6.tgz",
"integrity": "sha512-oSXzaWypCMHkPC3NvBEaPHf0KsA5mvPrOPgQWDsbg8n7orZ290M0BmC/jgRZ4vcJ6DTAhjrsSYgdsW/F+MFOBA==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.merge": {
"version": "4.6.2",
@@ -6430,8 +6418,7 @@
"resolved": "https://registry.npmmirror.com/lodash.union/-/lodash.union-4.6.0.tgz",
"integrity": "sha512-c4pB2CdGrGdjMKYLA+XiRDO7Y0PRQbm/Gzg8qMj+QH+pFVAoTp5sBpO0odL3FjoPCGjK96p6qsP+yQoiLoOBcw==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/log-symbols": {
"version": "4.1.0",
@@ -6984,7 +6971,6 @@
"integrity": "sha512-6eZs5Ls3WtCisHWp9S2GUy8dqkpGi4BVSz3GaqiE6ezub0512ESztXUwUB6C6IKbQkY2Pnb/mD4WYojCRwcwLA==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">=0.10.0"
}
@@ -7408,6 +7394,7 @@
"integrity": "sha512-QQtaxnoDJeAkDvDKWCLiwIXkTgRhwYDEQCghU9Z6q03iyek/rxRh/2lC3HB7P8sWT2xC/y5JDctPLBIGzHKbhw==",
"dev": true,
"license": "MIT",
"peer": true,
"bin": {
"prettier": "bin/prettier.cjs"
},
@@ -7436,8 +7423,7 @@
"resolved": "https://registry.npmmirror.com/process-nextick-args/-/process-nextick-args-2.0.1.tgz",
"integrity": "sha512-3ouUOpQhtgrbOa17J7+uxOTpITYWaGP7/AhoR3+A+/1e9skrzelGi/dXzEYyvbxubEF6Wn2ypscTKiKJFFn1ag==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/progress": {
"version": "2.0.3",
@@ -7556,7 +7542,6 @@
"integrity": "sha512-v05I2k7xN8zXvPD9N+z/uhXPaj0sUFCe2rcWZIpBsqxfP7xXFQ0tipAd/wjj1YxWyWtUS5IDJpOG82JKt2EAVA==",
"dev": true,
"license": "Apache-2.0",
"peer": true,
"dependencies": {
"minimatch": "^5.1.0"
}
@@ -7567,7 +7552,6 @@
"integrity": "sha512-lKwV/1brpG6mBUFHtb7NUmtABCb2WZZmm2wNiOA5hAb8VdCS4B3dtMWyvcoViccwAW/COERjXLt0zP1zXUN26g==",
"dev": true,
"license": "ISC",
"peer": true,
"dependencies": {
"brace-expansion": "^2.0.1"
},
@@ -8235,7 +8219,6 @@
"integrity": "sha512-ujeqbceABgwMZxEJnk2HDY2DlnUZ+9oEcb1KzTVfYHio0UE6dG71n60d8D2I4qNvleWrrXpmjpt7vZeF1LnMZQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"bl": "^4.0.3",
"end-of-stream": "^1.4.1",
@@ -8360,6 +8343,7 @@
"integrity": "sha512-M7BAV6Rlcy5u+m6oPhAPFgJTzAioX/6B0DxyvDlo9l8+T3nLKbrczg2WLUyzd45L8RqfUMyGPzekbMvX2Ldkwg==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">=12"
},
@@ -8462,6 +8446,7 @@
"integrity": "sha512-p1diW6TqL9L07nNxvRMM7hMMw4c5XOo/1ibL4aAIGmSAt9slTE1Xgw5KWuof2uTOvCg9BY7ZRi+GaF+7sfgPeQ==",
"devOptional": true,
"license": "Apache-2.0",
"peer": true,
"bin": {
"tsc": "bin/tsc",
"tsserver": "bin/tsserver"
@@ -8611,6 +8596,7 @@
"integrity": "sha512-cZn6NDFE7wdTpINgs++ZJ4N49W2vRp8LCKrn3Ob1kYNtOo21vfDoaV5GzBfLU4MovSAB8uNRm4jgzVQZ+mBzPQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"esbuild": "^0.25.0",
"fdir": "^6.4.4",
@@ -8701,6 +8687,7 @@
"integrity": "sha512-M7BAV6Rlcy5u+m6oPhAPFgJTzAioX/6B0DxyvDlo9l8+T3nLKbrczg2WLUyzd45L8RqfUMyGPzekbMvX2Ldkwg==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">=12"
},
@@ -8720,6 +8707,7 @@
"resolved": "https://registry.npmmirror.com/vue/-/vue-3.5.13.tgz",
"integrity": "sha512-wmeiSMxkZCSc+PM2w2VRsOYAZC8GdipNFRTsLSfodVqI9mbejKeXEGr8SckuLnrQPGe3oJN5c3K0vpoU9q/wCQ==",
"license": "MIT",
"peer": true,
"dependencies": {
"@vue/compiler-dom": "3.5.13",
"@vue/compiler-sfc": "3.5.13",
@@ -8742,6 +8730,7 @@
"integrity": "sha512-dbCBnd2e02dYWsXoqX5yKUZlOt+ExIpq7hmHKPb5ZqKcjf++Eo0hMseFTZMLKThrUk61m+Uv6A2YSBve6ZvuDQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"debug": "^4.4.0",
"eslint-scope": "^8.2.0",
@@ -9046,7 +9035,6 @@
"integrity": "sha512-9qv4rlDiopXg4E69k+vMHjNN63YFMe9sZMrdlvKnCjlCRWeCBswPPMPUfx+ipsAWq1LXHe70RcbaHdJJpS6hyQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"archiver-utils": "^3.0.4",
"compress-commons": "^4.1.2",
@@ -9062,7 +9050,6 @@
"integrity": "sha512-KVgf4XQVrTjhyWmx6cte4RxonPLR9onExufI1jhvw/MQ4BB6IsZD5gT8Lq+u/+pRkWna/6JoHpiQioaqFP5Rzw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"glob": "^7.2.3",
"graceful-fs": "^4.2.0",

View File

@@ -1,7 +1,7 @@
{
"name": "auto-caption",
"productName": "Auto Caption",
"version": "0.7.0",
"version": "1.1.1",
"description": "A cross-platform subtitle display software.",
"main": "./out/main/index.js",
"author": "himeditator",

View File

@@ -77,10 +77,9 @@ class CaptionWindow {
}
})
ipcMain.on('caption.pin.set', (_, pinned) => {
ipcMain.on('caption.mouseEvents.ignore', (_, ignore: boolean) => {
if(this.window){
if(pinned) this.window.setAlwaysOnTop(true, 'screen-saver')
else this.window.setAlwaysOnTop(false)
this.window.setIgnoreMouseEvents(ignore, { forward: ignore })
}
})
}

View File

@@ -159,6 +159,10 @@ class ControlWindow {
captionEngine.stop()
})
ipcMain.on('control.engine.forceKill', () => {
captionEngine.kill()
})
ipcMain.on('control.captionLog.clear', () => {
allConfig.captionLog.splice(0)
})

View File

@@ -4,5 +4,6 @@ export default {
"engine.start.error": "Caption engine failed to start: ",
"engine.output.parse.error": "Unable to parse caption engine output as a JSON object: ",
"engine.error": "Caption engine error: ",
"engine.shutdown.error": "Failed to shut down the caption engine process: "
"engine.shutdown.error": "Failed to shut down the caption engine process: ",
"engine.start.timeout": "Caption engine startup timeout, automatically force stopped"
}

View File

@@ -4,5 +4,6 @@ export default {
"engine.start.error": "字幕エンジンの起動に失敗しました: ",
"engine.output.parse.error": "字幕エンジンの出力を JSON オブジェクトとして解析できませんでした: ",
"engine.error": "字幕エンジンエラー: ",
"engine.shutdown.error": "字幕エンジンプロセスの終了に失敗しました: "
"engine.shutdown.error": "字幕エンジンプロセスの終了に失敗しました: ",
"engine.start.timeout": "字幕エンジンの起動がタイムアウトしました。自動的に強制停止しました"
}

View File

@@ -4,5 +4,6 @@ export default {
"engine.start.error": "字幕引擎启动失败:",
"engine.output.parse.error": "字幕引擎输出内容无法解析为 JSON 对象:",
"engine.error": "字幕引擎错误:",
"engine.shutdown.error": "字幕引擎进程关闭失败:"
"engine.shutdown.error": "字幕引擎进程关闭失败:",
"engine.start.timeout": "字幕引擎启动超时,已自动强制停止"
}

View File

@@ -6,17 +6,29 @@ export interface Controls {
engineEnabled: boolean,
sourceLang: string,
targetLang: string,
transModel: string,
ollamaName: string,
ollamaUrl: string,
ollamaApiKey: string,
engine: string,
audio: 0 | 1,
translation: boolean,
recording: boolean,
API_KEY: string,
modelPath: string,
voskModelPath: string,
sosvModelPath: string,
glmUrl: string,
glmModel: string,
glmApiKey: string,
recordingPath: string,
customized: boolean,
customizedApp: string,
customizedCommand: string
customizedCommand: string,
startTimeoutSeconds: number
}
export interface Styles {
lineNumber: number,
lineBreak: number,
fontFamily: string,
fontSize: number,

View File

@@ -4,10 +4,23 @@ import {
} from '../types'
import { Log } from './Log'
import { app, BrowserWindow } from 'electron'
import { passwordMaskingForObject } from './UtilsFunc'
import * as path from 'path'
import * as fs from 'fs'
import * as os from 'os'
interface CaptionTranslation {
time_s: string,
translation: string
}
function getDesktopPath() {
const homeDir = os.homedir()
return path.join(homeDir, 'Desktop')
}
const defaultStyles: Styles = {
lineNumber: 1,
lineBreak: 1,
fontFamily: 'sans-serif',
fontSize: 24,
@@ -31,15 +44,26 @@ const defaultStyles: Styles = {
const defaultControls: Controls = {
sourceLang: 'en',
targetLang: 'zh',
transModel: 'ollama',
ollamaName: 'qwen2.5:0.5b',
ollamaUrl: 'http://localhost:11434',
ollamaApiKey: '',
engine: 'gummy',
audio: 0,
engineEnabled: false,
API_KEY: '',
modelPath: '',
voskModelPath: '',
sosvModelPath: '',
glmUrl: 'https://open.bigmodel.cn/api/paas/v4/audio/transcriptions',
glmModel: 'glm-asr-2512',
glmApiKey: '',
recordingPath: getDesktopPath(),
translation: true,
recording: false,
customized: false,
customizedApp: '',
customizedCommand: ''
customizedCommand: '',
startTimeoutSeconds: 30
};
@@ -128,9 +152,7 @@ class AllConfig {
}
}
this.controls.engineEnabled = engineEnabled
let _controls = {...this.controls}
_controls.API_KEY = _controls.API_KEY.replace(/./g, '*')
Log.info('Set Controls:', _controls)
Log.info('Set Controls:', passwordMaskingForObject(this.controls))
}
public sendControls(window: BrowserWindow, info = true) {
@@ -157,12 +179,28 @@ class AllConfig {
}
}
public sendCaptionLog(window: BrowserWindow, command: 'add' | 'upd' | 'set') {
public updateCaptionTranslation(trans: CaptionTranslation){
for(let i = this.captionLog.length - 1; i >= 0; i--){
if(this.captionLog[i].time_s === trans.time_s){
this.captionLog[i].translation = trans.translation
for(const window of BrowserWindow.getAllWindows()){
this.sendCaptionLog(window, 'upd', i)
}
break
}
}
}
public sendCaptionLog(
window: BrowserWindow,
command: 'add' | 'upd' | 'set',
index: number | undefined = undefined
) {
if(command === 'add'){
window.webContents.send(`both.captionLog.add`, this.captionLog[this.captionLog.length - 1])
window.webContents.send(`both.captionLog.add`, this.captionLog.at(-1))
}
else if(command === 'upd'){
window.webContents.send(`both.captionLog.upd`, this.captionLog[this.captionLog.length - 1])
if(index !== undefined) window.webContents.send(`both.captionLog.upd`, this.captionLog[index])
else window.webContents.send(`both.captionLog.upd`, this.captionLog.at(-1))
}
else if(command === 'set'){
window.webContents.send(`both.captionLog.set`, this.captionLog)

View File

@@ -1,12 +1,13 @@
import { exec, spawn } from 'child_process'
import { app } from 'electron'
import { is } from '@electron-toolkit/utils'
import path from 'path'
import net from 'net'
import * as path from 'path'
import * as net from 'net'
import { controlWindow } from '../ControlWindow'
import { allConfig } from './AllConfig'
import { i18n } from '../i18n'
import { Log } from './Log'
import { passwordMaskingForList } from './UtilsFunc'
export class CaptionEngine {
appPath: string = ''
@@ -14,8 +15,9 @@ export class CaptionEngine {
process: any | undefined
client: net.Socket | undefined
port: number = 8080
status: 'running' | 'starting' | 'stopping' | 'stopped' = 'stopped'
status: 'running' | 'starting' | 'stopping' | 'stopped' | 'starting-timeout' = 'stopped'
timerID: NodeJS.Timeout | undefined
startTimeoutID: NodeJS.Timeout | undefined
private getApp(): boolean {
if (allConfig.controls.customized) {
@@ -59,43 +61,70 @@ export class CaptionEngine {
this.appPath = path.join(process.resourcesPath, 'engine', 'main.exe')
}
else {
this.appPath = path.join(process.resourcesPath, 'engine', 'main')
this.appPath = path.join(process.resourcesPath, 'engine', 'main', 'main')
}
}
this.command.push('-a', allConfig.controls.audio ? '1' : '0')
if(allConfig.controls.recording) {
this.command.push('-r', '1')
this.command.push('-rp', `"${allConfig.controls.recordingPath}"`)
}
this.port = Math.floor(Math.random() * (65535 - 1024 + 1)) + 1024
this.command.push('-p', this.port.toString())
this.command.push(
'-t', allConfig.controls.translation ?
allConfig.controls.targetLang : 'none'
)
if(allConfig.controls.engine === 'gummy') {
this.command.push('-e', 'gummy')
this.command.push('-s', allConfig.controls.sourceLang)
this.command.push(
'-t', allConfig.controls.translation ?
allConfig.controls.targetLang : 'none'
)
if(allConfig.controls.API_KEY) {
this.command.push('-k', allConfig.controls.API_KEY)
}
}
else if(allConfig.controls.engine === 'vosk'){
this.command.push('-e', 'vosk')
this.command.push('-m', `"${allConfig.controls.modelPath}"`)
this.command.push('-vosk', `"${allConfig.controls.voskModelPath}"`)
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
if(allConfig.controls.ollamaUrl) this.command.push('-ourl', allConfig.controls.ollamaUrl)
if(allConfig.controls.ollamaApiKey) this.command.push('-okey', allConfig.controls.ollamaApiKey)
}
else if(allConfig.controls.engine === 'sosv'){
this.command.push('-e', 'sosv')
this.command.push('-s', allConfig.controls.sourceLang)
this.command.push('-sosv', `"${allConfig.controls.sosvModelPath}"`)
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
if(allConfig.controls.ollamaUrl) this.command.push('-ourl', allConfig.controls.ollamaUrl)
if(allConfig.controls.ollamaApiKey) this.command.push('-okey', allConfig.controls.ollamaApiKey)
}
else if(allConfig.controls.engine === 'glm'){
this.command.push('-e', 'glm')
this.command.push('-s', allConfig.controls.sourceLang)
this.command.push('-gurl', allConfig.controls.glmUrl)
this.command.push('-gmodel', allConfig.controls.glmModel)
if(allConfig.controls.glmApiKey) {
this.command.push('-gkey', allConfig.controls.glmApiKey)
}
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
if(allConfig.controls.ollamaUrl) this.command.push('-ourl', allConfig.controls.ollamaUrl)
if(allConfig.controls.ollamaApiKey) this.command.push('-okey', allConfig.controls.ollamaApiKey)
}
}
Log.info('Engine Path:', this.appPath)
if(this.command.length > 2 && this.command.at(-2) === '-k') {
const _command = [...this.command]
_command[_command.length -1] = _command[_command.length -1].replace(/./g, '*')
Log.info('Engine Command:', _command)
}
else Log.info('Engine Command:', this.command)
Log.info('Engine Command:', passwordMaskingForList(this.command))
return true
}
public connect() {
if(this.client) { Log.warn('Client already exists, ignoring...') }
if (this.startTimeoutID) {
clearTimeout(this.startTimeoutID)
this.startTimeoutID = undefined
}
this.client = net.createConnection({ port: this.port }, () => {
Log.info('Connected to caption engine server');
});
@@ -130,6 +159,16 @@ export class CaptionEngine {
this.process = spawn(this.appPath, this.command)
this.status = 'starting'
Log.info('Caption Engine Starting, PID:', this.process.pid)
const timeoutMs = allConfig.controls.startTimeoutSeconds * 1000
this.startTimeoutID = setTimeout(() => {
if (this.status === 'starting') {
Log.warn(`Engine start timeout after ${allConfig.controls.startTimeoutSeconds} seconds, forcing kill...`)
this.status = 'starting-timeout'
controlWindow.sendErrorMessage(i18n('engine.start.timeout'))
this.kill()
}
}, timeoutMs)
this.process.stdout.on('data', (data: any) => {
const lines = data.toString().split('\n')
@@ -139,7 +178,7 @@ export class CaptionEngine {
const data_obj = JSON.parse(line)
handleEngineData(data_obj)
} catch (e) {
controlWindow.sendErrorMessage(i18n('engine.output.parse.error') + e)
// controlWindow.sendErrorMessage(i18n('engine.output.parse.error') + e)
Log.error('Error parsing JSON:', e)
}
}
@@ -165,6 +204,10 @@ export class CaptionEngine {
}
this.status = 'stopped'
clearInterval(this.timerID)
if (this.startTimeoutID) {
clearTimeout(this.startTimeoutID)
this.startTimeoutID = undefined
}
Log.info(`Engine exited with code ${code}`)
});
}
@@ -172,7 +215,6 @@ export class CaptionEngine {
public stop() {
if(this.status !== 'running'){
Log.warn('Trying to stop engine which is not running, current status:', this.status)
return
}
this.sendCommand('stop')
if(this.client){
@@ -192,19 +234,29 @@ export class CaptionEngine {
if(this.status !== 'running'){
Log.warn('Trying to kill engine which is not running, current status:', this.status)
}
Log.warn('Trying to kill engine process, PID:', this.process.pid)
Log.warn('Killing engine process, PID:', this.process.pid)
if (this.startTimeoutID) {
clearTimeout(this.startTimeoutID)
this.startTimeoutID = undefined
}
if(this.client){
this.client.destroy()
this.client = undefined
}
if (this.process.pid) {
let cmd = `kill ${this.process.pid}`;
let cmd = `kill -9 ${this.process.pid}`;
if (process.platform === "win32") {
cmd = `taskkill /pid ${this.process.pid} /t /f`
}
exec(cmd)
exec(cmd, (error) => {
if (error) {
Log.error('Failed to kill process:', error)
} else {
Log.info('Process killed successfully')
}
})
}
this.status = 'stopping'
}
}
@@ -221,12 +273,18 @@ function handleEngineData(data: any) {
else if(data.command === 'caption') {
allConfig.updateCaptionLog(data);
}
else if(data.command === 'translation') {
allConfig.updateCaptionTranslation(data);
}
else if(data.command === 'print') {
Log.info('Engine Print:', data.content)
console.log(data.content)
}
else if(data.command === 'info') {
Log.info('Engine Info:', data.content)
}
else if(data.command === 'warn') {
Log.warn('Engine Warn:', data.content)
}
else if(data.command === 'error') {
Log.error('Engine Error:', data.content)
controlWindow.sendErrorMessage(/*i18n('engine.error') +*/ data.content)

View File

@@ -0,0 +1,24 @@
function passwordMasking(pwd: string) {
return pwd.replace(/./g, '*')
}
export function passwordMaskingForList(args: string[]) {
const maskedArgs = [...args]
for(let i = 1; i < maskedArgs.length; i++) {
if(maskedArgs[i-1] === '-k' || maskedArgs[i-1] === '-okey' || maskedArgs[i-1] === '-gkey') {
maskedArgs[i] = passwordMasking(maskedArgs[i])
}
}
return maskedArgs
}
export function passwordMaskingForObject(args: Record<string, any>) {
const maskedArgs = {...args}
for(const key in maskedArgs) {
const lKey = key.toLowerCase()
if(lKey.includes('api') && lKey.includes('key')) {
maskedArgs[key] = passwordMasking(maskedArgs[key])
}
}
return maskedArgs
}

View File

@@ -2,7 +2,7 @@
<html>
<head>
<meta charset="UTF-8" />
<title>Auto Caption v0.7.0</title>
<title>Auto Caption v1.1.1</title>
<!-- https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP -->
<meta
http-equiv="Content-Security-Policy"

View File

@@ -304,6 +304,7 @@ function clearCaptions() {
}
.caption-title {
color: var(--icon-color);
display: inline-block;
font-size: 24px;
font-weight: bold;

View File

@@ -6,6 +6,16 @@
<a @click="resetStyle">{{ $t('style.resetStyle') }}</a>
</template>
<div class="input-item">
<span class="input-label">{{ $t('style.lineNumber') }}</span>
<a-radio-group v-model:value="currentLineNumber">
<a-radio-button :value="1">1</a-radio-button>
<a-radio-button :value="2">2</a-radio-button>
<a-radio-button :value="3">3</a-radio-button>
<a-radio-button :value="4">4</a-radio-button>
</a-radio-group>
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.longCaption') }}</span>
<a-select
@@ -178,28 +188,57 @@
textShadow: currentTextShadow ? `${currentOffsetX}px ${currentOffsetY}px ${currentBlur}px ${currentTextShadowColor}` : 'none'
}"
>
<p :class="[currentLineBreak?'':'left-ellipsis']"
:style="{
fontFamily: currentFontFamily,
fontSize: currentFontSize + 'px',
color: currentFontColor,
fontWeight: currentFontWeight * 100
}">
<span v-if="captionData.length">{{ captionData[captionData.length-1].text }}</span>
<span v-else>{{ $t('example.original') }}</span>
</p>
<p :class="[currentLineBreak?'':'left-ellipsis']"
v-if="currentTransDisplay"
:style="{
fontFamily: currentTransFontFamily,
fontSize: currentTransFontSize + 'px',
color: currentTransFontColor,
fontWeight: currentTransFontWeight * 100
}"
>
<span v-if="captionData.length">{{ captionData[captionData.length-1].translation }}</span>
<span v-else>{{ $t('example.translation') }}</span>
</p>
<template v-if="captionData.length">
<template
v-for="val in revArr[Math.min(currentLineNumber, captionData.length)]"
:key="captionData[captionData.length - val].time_s"
>
<p :class="[currentLineBreak?'':'left-ellipsis']"
:style="{
fontFamily: currentFontFamily,
fontSize: currentFontSize + 'px',
color: currentFontColor,
fontWeight: currentFontWeight * 100
}">
<span>{{ captionData[captionData.length - val].text }}</span>
</p>
<p :class="[currentLineBreak?'':'left-ellipsis']"
v-if="currentTransDisplay && captionData[captionData.length - val].translation"
:style="{
fontFamily: currentTransFontFamily,
fontSize: currentTransFontSize + 'px',
color: currentTransFontColor,
fontWeight: currentTransFontWeight * 100
}"
>
<span>{{ captionData[captionData.length - val].translation }}</span>
</p>
</template>
</template>
<template v-else>
<template v-for="val in currentLineNumber" :key="val">
<p :class="[currentLineBreak?'':'left-ellipsis']"
:style="{
fontFamily: currentFontFamily,
fontSize: currentFontSize + 'px',
color: currentFontColor,
fontWeight: currentFontWeight * 100
}">
<span>{{ $t('example.original') }}</span>
</p>
<p :class="[currentLineBreak?'':'left-ellipsis']"
v-if="currentTransDisplay"
:style="{
fontFamily: currentTransFontFamily,
fontSize: currentTransFontSize + 'px',
color: currentTransFontColor,
fontWeight: currentTransFontWeight * 100
}"
>
<span>{{ $t('example.translation') }}</span>
</p>
</template>
</template>
</div>
</Teleport>
@@ -212,6 +251,14 @@ import { storeToRefs } from 'pinia'
import { notification } from 'ant-design-vue'
import { useI18n } from 'vue-i18n'
import { useCaptionLogStore } from '@renderer/stores/captionLog';
const revArr = {
1: [1],
2: [2, 1],
3: [3, 2, 1],
4: [4, 3, 2, 1],
}
const captionLog = useCaptionLogStore();
const { captionData } = storeToRefs(captionLog);
@@ -220,6 +267,7 @@ const { t } = useI18n()
const captionStyle = useCaptionStyleStore()
const { changeSignal } = storeToRefs(captionStyle)
const currentLineNumber = ref<number>(1)
const currentLineBreak = ref<number>(0)
const currentFontFamily = ref<string>('sans-serif')
const currentFontSize = ref<number>(24)
@@ -253,6 +301,7 @@ function useSameStyle(){
}
function applyStyle(){
captionStyle.lineNumber = currentLineNumber.value;
captionStyle.lineBreak = currentLineBreak.value;
captionStyle.fontFamily = currentFontFamily.value;
captionStyle.fontSize = currentFontSize.value;
@@ -282,6 +331,7 @@ function applyStyle(){
}
function backStyle(){
currentLineNumber.value = captionStyle.lineNumber;
currentLineBreak.value = captionStyle.lineBreak;
currentFontFamily.value = captionStyle.fontFamily;
currentFontSize.value = captionStyle.fontSize;

View File

@@ -5,23 +5,6 @@
<a @click="applyChange">{{ $t('engine.applyChange') }}</a> |
<a @click="cancelChange">{{ $t('engine.cancelChange') }}</a>
</template>
<div class="input-item">
<span class="input-label">{{ $t('engine.sourceLang') }}</span>
<a-select
class="input-area"
v-model:value="currentSourceLang"
:options="langList"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.transLang') }}</span>
<a-select
:disabled="currentEngine === 'vosk'"
class="input-area"
v-model:value="currentTargetLang"
:options="langList.filter((item) => item.value !== 'auto')"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.captionEngine') }}</span>
<a-select
@@ -30,6 +13,91 @@
:options="captionEngine"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.sourceLang') }}</span>
<a-select
:disabled="currentEngine === 'vosk'"
class="input-area"
v-model:value="currentSourceLang"
:options="sLangList"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.transLang') }}</span>
<a-select
class="input-area"
v-model:value="currentTargetLang"
:options="tLangList"
></a-select>
</div>
<div class="input-item" v-if="transModel">
<span class="input-label">{{ $t('engine.transModel') }}</span>
<a-select
class="input-area"
v-model:value="currentTransModel"
:options="transModel"
></a-select>
</div>
<div class="input-item" v-if="transModel && currentTransModel === 'ollama'">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.modelNameNote') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.modelName') }}</span>
</a-popover>
<a-input
class="input-area"
v-model:value="currentOllamaName"
></a-input>
</div>
<div class="input-item" v-if="transModel && currentTransModel === 'ollama'">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.baseURL') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>Base URL</span>
</a-popover>
<a-input
class="input-area"
v-model:value="currentOllamaUrl"
placeholder="http://localhost:11434"
></a-input>
</div>
<div class="input-item" v-if="transModel && currentTransModel === 'ollama'">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.apiKey') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>API Key</span>
</a-popover>
<a-input
class="input-area"
type="password"
v-model:value="currentOllamaApiKey"
/>
</div>
<div class="input-item" v-if="currentEngine === 'glm'">
<span class="input-label">GLM API URL</span>
<a-input
class="input-area"
v-model:value="currentGlmUrl"
placeholder="https://open.bigmodel.cn/api/paas/v4/audio/transcriptions"
></a-input>
</div>
<div class="input-item" v-if="currentEngine === 'glm'">
<span class="input-label">GLM Model Name</span>
<a-input
class="input-area"
v-model:value="currentGlmModel"
placeholder="glm-asr-2512"
></a-input>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.audioType') }}</span>
<a-select
@@ -43,9 +111,13 @@
<a-switch v-model:checked="currentTranslation" />
<span style="display:inline-block;width:10px;"></span>
<div style="display: inline-block;">
<span class="switch-label">{{ $t('engine.customEngine') }}</span>
<a-switch v-model:checked="currentCustomized" />
<span class="switch-label">{{ $t('engine.enableRecording') }}</span>
<a-switch v-model:checked="currentRecording" />
</div>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.customEngine') }}</span>
<a-switch v-model:checked="currentCustomized" />
<span style="display:inline-block;width:10px;"></span>
<div style="display: inline-block;">
<span class="switch-label">{{ $t('engine.showMore') }}</span>
@@ -80,11 +152,16 @@
<a-card size="small" :title="$t('engine.showMore')" v-show="showMore" style="margin-top:10px;">
<div class="input-item">
<a-popover>
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.apikeyInfo') }}</p>
<p><a href="https://bailian.console.aliyun.com" target="_blank">
https://bailian.console.aliyun.com
</a></p>
</template>
<span class="input-label info-label">{{ $t('engine.apikey') }}</span>
<span class="input-label info-label"
:style="{color: uiColor}"
>ALI {{ $t('engine.apikey') }}</span>
</a-popover>
<a-input
class="input-area"
@@ -93,20 +170,106 @@
/>
</div>
<div class="input-item">
<a-popover>
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.modelPathInfo') }}</p>
<p class="label-hover-info">{{ $t('engine.glmApikeyInfo') }}</p>
<p><a href="https://open.bigmodel.cn/" target="_blank">
https://open.bigmodel.cn
</a></p>
</template>
<span class="input-label info-label">{{ $t('engine.modelPath') }}</span>
<span class="input-label info-label"
:style="{color: uiColor}"
>GLM {{ $t('engine.apikey') }}</span>
</a-popover>
<a-input
class="input-area"
type="password"
v-model:value="currentGlmApiKey"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.voskModelPathInfo') }}</p>
<p class="label-hover-info">
<a href="https://alphacephei.com/vosk/models" target="_blank">Vosk {{ $t('engine.modelDownload') }}</a>
</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.voskModelPath') }}</span>
</a-popover>
<span
class="input-folder"
@click="selectFolderPath"
:style="{color: uiColor}"
@click="selectFolderPath('vosk')"
><span><FolderOpenOutlined /></span></span>
<a-input
class="input-area"
style="width:calc(100% - 140px);"
v-model:value="currentModelPath"
v-model:value="currentVoskModelPath"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.sosvModelPathInfo') }}</p>
<p class="label-hover-info">
<a href="https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model" target="_blank">SOSV {{ $t('engine.modelDownload') }}</a>
</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.sosvModelPath') }}</span>
</a-popover>
<span
class="input-folder"
:style="{color: uiColor}"
@click="selectFolderPath('sosv')"
><span><FolderOpenOutlined /></span></span>
<a-input
class="input-area"
style="width:calc(100% - 140px);"
v-model:value="currentSosvModelPath"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.recordingPathInfo') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.recordingPath') }}</span>
</a-popover>
<span
class="input-folder"
:style="{color: uiColor}"
@click="selectFolderPath('rec')"
><span><FolderOpenOutlined /></span></span>
<a-input
class="input-area"
style="width:calc(100% - 140px);"
v-model:value="currentRecordingPath"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.startTimeoutInfo') }}</p>
</template>
<span
class="input-label info-label"
:style="{color: uiColor, verticalAlign: 'middle'}"
>{{ $t('engine.startTimeout') }}</span>
</a-popover>
<a-input-number
class="input-area"
v-model:value="currentStartTimeoutSeconds"
:min="10"
:max="120"
:step="5"
:addon-after="$t('engine.seconds')"
/>
</div>
</a-card>
@@ -115,12 +278,12 @@
</template>
<script setup lang="ts">
import { ref, computed, watch } from 'vue'
import { ref, computed, watch, h } from 'vue'
import { storeToRefs } from 'pinia'
import { useGeneralSettingStore } from '@renderer/stores/generalSetting'
import { useEngineControlStore } from '@renderer/stores/engineControl'
import { notification } from 'ant-design-vue'
import { FolderOpenOutlined ,InfoCircleOutlined } from '@ant-design/icons-vue';
import { ExclamationCircleOutlined, FolderOpenOutlined ,InfoCircleOutlined } from '@ant-design/icons-vue';
import { useI18n } from 'vue-i18n'
const { t } = useI18n()
@@ -129,37 +292,93 @@ const showMore = ref(false)
const engineControl = useEngineControlStore()
const { captionEngine, audioType, changeSignal } = storeToRefs(engineControl)
const generalSetting = useGeneralSettingStore()
const { uiColor } = storeToRefs(generalSetting)
const currentSourceLang = ref('auto')
const currentTargetLang = ref('zh')
const currentEngine = ref<string>('gummy')
const currentAudio = ref<0 | 1>(0)
const currentTranslation = ref<boolean>(false)
const currentTranslation = ref<boolean>(true)
const currentRecording = ref<boolean>(false)
const currentTransModel = ref('ollama')
const currentOllamaName = ref('')
const currentOllamaUrl = ref('')
const currentOllamaApiKey = ref('')
const currentAPI_KEY = ref<string>('')
const currentModelPath = ref<string>('')
const currentVoskModelPath = ref<string>('')
const currentSosvModelPath = ref<string>('')
const currentGlmUrl = ref<string>('')
const currentGlmModel = ref<string>('')
const currentGlmApiKey = ref<string>('')
const currentRecordingPath = ref<string>('')
const currentCustomized = ref<boolean>(false)
const currentCustomizedApp = ref('')
const currentCustomizedCommand = ref('')
const currentStartTimeoutSeconds = ref<number>(30)
const langList = computed(() => {
const sLangList = computed(() => {
for(let item of captionEngine.value){
if(item.value === currentEngine.value) {
return item.languages
return item.languages.filter(item => item.type <= 0)
}
}
return []
})
const tLangList = computed(() => {
for(let item of captionEngine.value){
if(item.value === currentEngine.value) {
return item.languages.filter(item => item.type >= 0)
}
}
return []
})
const transModel = computed(() => {
for(let item of captionEngine.value){
if(item.value === currentEngine.value) {
return item.transModel
}
}
return []
})
function applyChange(){
if(
currentTranslation.value && transModel.value &&
currentTransModel.value === 'ollama' && !currentOllamaName.value.trim()
) {
notification.open({
message: t('noti.ollamaNameNull'),
description: t('noti.ollamaNameNullNote'),
duration: null,
icon: () => h(ExclamationCircleOutlined, { style: 'color: #ff4d4f' })
})
return
}
engineControl.sourceLang = currentSourceLang.value
engineControl.targetLang = currentTargetLang.value
engineControl.transModel = currentTransModel.value
engineControl.ollamaName = currentOllamaName.value
engineControl.engine = currentEngine.value
engineControl.ollamaUrl = currentOllamaUrl.value ?? "http://localhost:11434"
engineControl.ollamaApiKey = currentOllamaApiKey.value
engineControl.audio = currentAudio.value
engineControl.translation = currentTranslation.value
engineControl.recording = currentRecording.value
engineControl.API_KEY = currentAPI_KEY.value
engineControl.modelPath = currentModelPath.value
engineControl.voskModelPath = currentVoskModelPath.value
engineControl.sosvModelPath = currentSosvModelPath.value
engineControl.glmUrl = currentGlmUrl.value ?? "https://open.bigmodel.cn/api/paas/v4/audio/transcriptions"
engineControl.glmModel = currentGlmModel.value ?? "glm-asr-2512"
engineControl.glmApiKey = currentGlmApiKey.value
engineControl.recordingPath = currentRecordingPath.value
engineControl.customized = currentCustomized.value
engineControl.customizedApp = currentCustomizedApp.value
engineControl.customizedCommand = currentCustomizedCommand.value
engineControl.startTimeoutSeconds = currentStartTimeoutSeconds.value
engineControl.sendControlsChange()
@@ -173,20 +392,36 @@ function applyChange(){
function cancelChange(){
currentSourceLang.value = engineControl.sourceLang
currentTargetLang.value = engineControl.targetLang
currentTransModel.value = engineControl.transModel
currentOllamaName.value = engineControl.ollamaName
currentOllamaUrl.value = engineControl.ollamaUrl
currentOllamaApiKey.value = engineControl.ollamaApiKey
currentEngine.value = engineControl.engine
currentAudio.value = engineControl.audio
currentTranslation.value = engineControl.translation
currentRecording.value = engineControl.recording
currentAPI_KEY.value = engineControl.API_KEY
currentModelPath.value = engineControl.modelPath
currentVoskModelPath.value = engineControl.voskModelPath
currentSosvModelPath.value = engineControl.sosvModelPath
currentGlmUrl.value = engineControl.glmUrl
currentGlmModel.value = engineControl.glmModel
currentGlmApiKey.value = engineControl.glmApiKey
currentRecordingPath.value = engineControl.recordingPath
currentCustomized.value = engineControl.customized
currentCustomizedApp.value = engineControl.customizedApp
currentCustomizedCommand.value = engineControl.customizedCommand
currentStartTimeoutSeconds.value = engineControl.startTimeoutSeconds
}
function selectFolderPath() {
function selectFolderPath(type: 'vosk' | 'sosv' | 'rec') {
window.electron.ipcRenderer.invoke('control.folder.select').then((folderPath) => {
if(!folderPath) return
currentModelPath.value = folderPath
if(type == 'vosk')
currentVoskModelPath.value = folderPath
else if(type == 'sosv')
currentSosvModelPath.value = folderPath
else if(type == 'rec')
currentRecordingPath.value = folderPath
})
}
@@ -200,9 +435,12 @@ watch(changeSignal, (val) => {
watch(currentEngine, (val) => {
if(val == 'vosk'){
currentSourceLang.value = 'auto'
currentTargetLang.value = ''
currentTargetLang.value = useGeneralSettingStore().uiLanguage
if(currentTargetLang.value === 'zh') {
currentTargetLang.value = 'zh-cn'
}
}
else if(val == 'gummy'){
else{
currentSourceLang.value = 'auto'
currentTargetLang.value = useGeneralSettingStore().uiLanguage
}
@@ -218,8 +456,8 @@ watch(currentEngine, (val) => {
}
.info-label {
color: #1677ff;
cursor: pointer;
font-style: italic;
}
.input-folder {
@@ -230,20 +468,12 @@ watch(currentEngine, (val) => {
transition: all 0.25s;
}
.input-folder>span {
padding: 0 2px;
border: 2px solid #1677ff;
color: #1677ff;
border-radius: 30%;
}
.input-folder:hover {
transform: scale(1.1);
}
.customize-note {
padding: 10px 10px 0;
color: red;
max-width: min(40vw, 480px);
}
</style>

View File

@@ -67,11 +67,26 @@
@click="openCaptionWindow"
>{{ $t('status.openCaption') }}</a-button>
<a-button
v-if="!isStarting"
class="control-button"
:loading="pending && !engineEnabled"
:disabled="pending || engineEnabled"
@click="startEngine"
>{{ $t('status.startEngine') }}</a-button>
<a-popconfirm
v-if="isStarting"
:title="$t('status.forceKillConfirm')"
:ok-text="$t('status.confirm')"
:cancel-text="$t('status.cancel')"
@confirm="forceKillEngine"
>
<a-button
danger
class="control-button"
type="primary"
:icon="h(LoadingOutlined)"
>{{ $t('status.forceKillStarting') }}</a-button>
</a-popconfirm>
<a-button
danger class="control-button"
:loading="pending && engineEnabled"
@@ -86,7 +101,7 @@
<p class="about-desc">{{ $t('status.about.desc') }}</p>
<a-divider />
<div class="about-info">
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v0.7.0</a-tag></p>
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v1.1.1</a-tag></p>
<p>
<b>{{ $t('status.about.author') }}</b>
<a
@@ -128,15 +143,16 @@
<script setup lang="ts">
import { EngineInfo } from '@renderer/types'
import { ref, watch } from 'vue'
import { ref, watch, h } from 'vue'
import { storeToRefs } from 'pinia'
import { useCaptionLogStore } from '@renderer/stores/captionLog'
import { useSoftwareLogStore } from '@renderer/stores/softwareLog'
import { useEngineControlStore } from '@renderer/stores/engineControl'
import { GithubOutlined, InfoCircleOutlined } from '@ant-design/icons-vue'
import { GithubOutlined, InfoCircleOutlined, LoadingOutlined } from '@ant-design/icons-vue'
const showAbout = ref(false)
const pending = ref(false)
const isStarting = ref(false)
const captionLog = useCaptionLogStore()
const { captionData } = storeToRefs(captionLog)
@@ -158,8 +174,17 @@ function openCaptionWindow() {
function startEngine() {
pending.value = true
if(engineControl.engine === 'vosk' && engineControl.modelPath.trim() === '') {
isStarting.value = true
if(engineControl.engine === 'vosk' && engineControl.voskModelPath.trim() === '') {
engineControl.emptyModelPathErr()
pending.value = false
isStarting.value = false
return
}
if(engineControl.engine === 'sosv' && engineControl.sosvModelPath.trim() === '') {
engineControl.emptyModelPathErr()
pending.value = false
isStarting.value = false
return
}
window.electron.ipcRenderer.send('control.engine.start')
@@ -170,6 +195,12 @@ function stopEngine() {
window.electron.ipcRenderer.send('control.engine.stop')
}
function forceKillEngine() {
pending.value = true
isStarting.value = false
window.electron.ipcRenderer.send('control.engine.forceKill')
}
function getEngineInfo() {
window.electron.ipcRenderer.invoke('control.engine.info').then((data: EngineInfo) => {
pid.value = data.pid
@@ -181,12 +212,16 @@ function getEngineInfo() {
})
}
watch(engineEnabled, () => {
watch(engineEnabled, (enabled) => {
pending.value = false
if (enabled) {
isStarting.value = false
}
})
watch(errorSignal, () => {
pending.value = false
isStarting.value = false
errorSignal.value = false
})
</script>

View File

@@ -96,6 +96,7 @@ const columns = [
<style scoped>
.log-title {
color: var(--icon-color);
display: inline-block;
font-size: 24px;
font-weight: bold;

View File

@@ -1,78 +1,228 @@
// type: -1 仅为源语言0 两者皆可, 1 仅为翻译语言
export const engines = {
zh: [
{
value: 'gummy',
label: '云端 - 阿里云 - Gummy',
label: '云端 / 阿里云 / Gummy',
languages: [
{ value: 'auto', label: '自动检测' },
{ value: 'en', label: '英语' },
{ value: 'zh', label: '中文' },
{ value: 'ja', label: '日语' },
{ value: 'ko', label: '韩语' },
{ value: 'de', label: '德语' },
{ value: 'fr', label: '法语' },
{ value: 'ru', label: '俄语' },
{ value: 'es', label: '西班牙语' },
{ value: 'it', label: '意大利语' },
{ value: 'auto', type: -1, label: '自动检测' },
{ value: 'en', type: 0, label: '英语' },
{ value: 'zh', type: 0, label: '中文' },
{ value: 'ja', type: 0, label: '日语' },
{ value: 'ko', type: 0, label: '韩语' },
{ value: 'de', type: -1, label: '德语' },
{ value: 'fr', type: -1, label: '法语' },
{ value: 'ru', type: -1, label: '俄语' },
{ value: 'es', type: -1, label: '西班牙语' },
{ value: 'it', type: -1, label: '意大利语' },
{ value: 'yue', type: -1, label: '粤语' },
]
},
{
value: 'vosk',
label: '本地 - Vosk',
label: '本地 / Vosk',
languages: [
{ value: 'auto', label: '需要自行配置模型' },
{ value: 'auto', type: -1, label: '需要自行配置模型' },
{ value: 'en', type: 1, label: '英语' },
{ value: 'zh-cn', type: 1, label: '中文' },
{ value: 'ja', type: 1, label: '日语' },
{ value: 'ko', type: 1, label: '韩语' },
{ value: 'de', type: 1, label: '德语' },
{ value: 'fr', type: 1, label: '法语' },
{ value: 'ru', type: 1, label: '俄语' },
{ value: 'es', type: 1, label: '西班牙语' },
{ value: 'it', type: 1, label: '意大利语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 模型或 OpenAI 兼容模型' },
{ value: 'google', label: 'Google API 调用' },
]
},
{
value: 'sosv',
label: '本地 / SOSV',
languages: [
{ value: 'auto', type: -1, label: '自动检测' },
{ value: 'en', type: 0, label: '英语' },
{ value: 'zh', type: 0, label: '中文' },
{ value: 'ja', type: 0, label: '日语' },
{ value: 'ko', type: 0, label: '韩语' },
{ value: 'yue', type: -1, label: '粤语' },
{ value: 'de', type: 1, label: '德语' },
{ value: 'fr', type: 1, label: '法语' },
{ value: 'ru', type: 1, label: '俄语' },
{ value: 'es', type: 1, label: '西班牙语' },
{ value: 'it', type: 1, label: '意大利语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 模型或 OpenAI 兼容模型' },
{ value: 'google', label: 'Google API 调用' },
]
},
{
value: 'glm',
label: '云端 / 智谱AI / GLM-ASR',
languages: [
{ value: 'auto', type: -1, label: '自动检测' },
{ value: 'en', type: 0, label: '英语' },
{ value: 'zh', type: 0, label: '中文' },
{ value: 'ja', type: 0, label: '日语' },
{ value: 'ko', type: 0, label: '韩语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 模型或 OpenAI 兼容模型' },
{ value: 'google', label: 'Google API 调用' },
]
}
],
en: [
{
value: 'gummy',
label: 'Cloud - Alibaba Cloud - Gummy',
label: 'Cloud / Alibaba Cloud / Gummy',
languages: [
{ value: 'auto', label: 'Auto Detect' },
{ value: 'en', label: 'English' },
{ value: 'zh', label: 'Chinese' },
{ value: 'ja', label: 'Japanese' },
{ value: 'ko', label: 'Korean' },
{ value: 'de', label: 'German' },
{ value: 'fr', label: 'French' },
{ value: 'ru', label: 'Russian' },
{ value: 'es', label: 'Spanish' },
{ value: 'it', label: 'Italian' },
{ value: 'auto', type: -1, label: 'Auto Detect' },
{ value: 'en', type: 0, label: 'English' },
{ value: 'zh', type: 0, label: 'Chinese' },
{ value: 'ja', type: 0, label: 'Japanese' },
{ value: 'ko', type: 0, label: 'Korean' },
{ value: 'de', type: -1, label: 'German' },
{ value: 'fr', type: -1, label: 'French' },
{ value: 'ru', type: -1, label: 'Russian' },
{ value: 'es', type: -1, label: 'Spanish' },
{ value: 'it', type: -1, label: 'Italian' },
{ value: 'yue', type: -1, label: 'Cantonese' },
]
},
{
value: 'vosk',
label: 'Local - Vosk',
label: 'Local / Vosk',
languages: [
{ value: 'auto', label: 'Model needs to be configured manually' },
{ value: 'auto', type: -1, label: 'Model needs to be configured manually' },
{ value: 'en', type: 1, label: 'English' },
{ value: 'zh-cn', type: 1, label: 'Chinese' },
{ value: 'ja', type: 1, label: 'Japanese' },
{ value: 'ko', type: 1, label: 'Korean' },
{ value: 'de', type: 1, label: 'German' },
{ value: 'fr', type: 1, label: 'French' },
{ value: 'ru', type: 1, label: 'Russian' },
{ value: 'es', type: 1, label: 'Spanish' },
{ value: 'it', type: 1, label: 'Italian' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Model or OpenAI-compatible Model' },
{ value: 'google', label: 'Google API Call' },
]
},
{
value: 'sosv',
label: 'Local / SOSV',
languages: [
{ value: 'auto', type: -1, label: 'Auto Detect' },
{ value: 'en', type: 0, label: 'English' },
{ value: 'zh-cn', type: 0, label: 'Chinese' },
{ value: 'ja', type: 0, label: 'Japanese' },
{ value: 'ko', type: 0, label: 'Korean' },
{ value: 'yue', type: -1, label: 'Cantonese' },
{ value: 'de', type: 1, label: 'German' },
{ value: 'fr', type: 1, label: 'French' },
{ value: 'ru', type: 1, label: 'Russian' },
{ value: 'es', type: 1, label: 'Spanish' },
{ value: 'it', type: 1, label: 'Italian' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Model or OpenAI-compatible Model' },
{ value: 'google', label: 'Google API Call' },
]
},
{
value: 'glm',
label: 'Cloud / Zhipu AI / GLM-ASR',
languages: [
{ value: 'auto', type: -1, label: 'Auto Detect' },
{ value: 'en', type: 0, label: 'English' },
{ value: 'zh', type: 0, label: 'Chinese' },
{ value: 'ja', type: 0, label: 'Japanese' },
{ value: 'ko', type: 0, label: 'Korean' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Model or OpenAI-compatible Model' },
{ value: 'google', label: 'Google API Call' },
]
}
],
ja: [
{
value: 'gummy',
label: 'クラウド - アリババクラウド - Gummy',
label: 'クラウド / アリババクラウド / Gummy',
languages: [
{ value: 'auto', label: '自動検出' },
{ value: 'en', label: '英語' },
{ value: 'zh', label: '中国語' },
{ value: 'ja', label: '日本語' },
{ value: 'ko', label: '韓国語' },
{ value: 'de', label: 'ドイツ語' },
{ value: 'fr', label: 'フランス語' },
{ value: 'ru', label: 'ロシア語' },
{ value: 'es', label: 'スペイン語' },
{ value: 'it', label: 'イタリア語' },
{ value: 'auto', type: -1, label: '自動検出' },
{ value: 'en', type: 0, label: '英語' },
{ value: 'zh', type: 0, label: '中国語' },
{ value: 'ja', type: 0, label: '日本語' },
{ value: 'ko', type: 0, label: '韓国語' },
{ value: 'de', type: -1, label: 'ドイツ語' },
{ value: 'fr', type: -1, label: 'フランス語' },
{ value: 'ru', type: -1, label: 'ロシア語' },
{ value: 'es', type: -1, label: 'スペイン語' },
{ value: 'it', type: -1, label: 'イタリア語' },
{ value: 'yue', type: -1, label: '広東語' },
]
},
{
value: 'vosk',
label: 'ローカル - Vosk',
label: 'ローカル / Vosk',
languages: [
{ value: 'auto', label: 'モデルを手動で設定する必要があります' },
{ value: 'auto', type: -1, label: 'モデルを手動で設定する必要があります' },
{ value: 'en', type: 1, label: '英語' },
{ value: 'zh-cn', type: 1, label: '中国語' },
{ value: 'ja', type: 1, label: '日本語' },
{ value: 'ko', type: 1, label: '韓国語' },
{ value: 'de', type: 1, label: 'ドイツ語' },
{ value: 'fr', type: 1, label: 'フランス語' },
{ value: 'ru', type: 1, label: 'ロシア語' },
{ value: 'es', type: 1, label: 'スペイン語' },
{ value: 'it', type: 1, label: 'イタリア語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama モデルまたは OpenAI 互換モデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
},
{
value: 'sosv',
label: 'ローカル / SOSV',
languages: [
{ value: 'auto', type: -1, label: '自動検出' },
{ value: 'en', type: 0, label: '英語' },
{ value: 'zh-cn', type: 0, label: '中国語' },
{ value: 'ja', type: 0, label: '日本語' },
{ value: 'ko', type: 0, label: '韓国語' },
{ value: 'yue', type: -1, label: '広東語' },
{ value: 'de', type: 1, label: 'ドイツ語' },
{ value: 'fr', type: 1, label: 'フランス語' },
{ value: 'ru', type: 1, label: 'ロシア語' },
{ value: 'es', type: 1, label: 'スペイン語' },
{ value: 'it', type: 1, label: 'イタリア語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama モデルまたは OpenAI 互換モデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
},
{
value: 'glm',
label: 'クラウド / 智譜AI / GLM-ASR',
languages: [
{ value: 'auto', type: -1, label: '自動検出' },
{ value: 'en', type: 0, label: '英語' },
{ value: 'zh', type: 0, label: '中国語' },
{ value: 'ja', type: 0, label: '日本語' },
{ value: 'ko', type: 0, label: '韓国語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama モデルまたは OpenAI 互換モデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
}
]
}

View File

@@ -18,16 +18,19 @@ export default {
"args": ", command arguments: ",
"pidInfo": ", caption engine process PID: ",
"empty": "Model Path is Empty",
"emptyInfo": "The Vosk model path is empty. Please set the Vosk model path in the additional settings of the subtitle engine settings.",
"emptyInfo": "The model path for the selected local model is empty. Please download the corresponding model first, and then configure the model path in [Caption Engine Settings > More Settings].",
"stopped": "Caption Engine Stopped",
"stoppedInfo": "The caption engine has stopped. You can click the 'Start Caption Engine' button to restart it.",
"error": "An error occurred",
"engineError": "The subtitle engine encountered an error and requested a forced exit.",
"engineError": "The caption engine encountered an error and requested a forced exit.",
"socketError": "The Socket connection between the main program and the caption engine failed",
"engineChange": "Cpation Engine Configuration Changed",
"changeInfo": "If the caption engine is already running, you need to restart it for the changes to take effect.",
"styleChange": "Caption Style Changed",
"styleInfo": "Caption style changes have been saved and applied."
"styleInfo": "Caption style changes have been saved and applied.",
"engineStartTimeout": "Caption engine startup timeout, automatically force stopped",
"ollamaNameNull": "'Ollama' Field is Empty",
"ollamaNameNullNote": "When selecting Ollama model as the translation model, the 'Ollama' field cannot be empty and must be filled with the name of a locally configured Ollama model."
},
general: {
"title": "General Settings",
@@ -46,17 +49,33 @@ export default {
"cancelChange": "Cancel Changes",
"sourceLang": "Source",
"transLang": "Translation",
"transModel": "Model",
"modelName": "Model Name",
"modelNameNote": "Please enter the translation model name you wish to use, which can be either a local Ollama model or an OpenAI API compatible cloud model. If the Base URL field is left blank, the local Ollama service will be called by default; otherwise, the API service at the specified address will be called via the Python OpenAI library.",
"baseURL": "The base request URL for calling OpenAI API. If left empty, the local default port Ollama model will be used.",
"apiKey": "The API KEY required for the model corresponding to OpenAI API.",
"captionEngine": "Engine",
"audioType": "Audio Type",
"systemOutput": "System Audio Output (Speaker)",
"systemInput": "System Audio Input (Microphone)",
"enableTranslation": "Translation",
"enableRecording": "Enable Recording",
"showMore": "More Settings",
"apikey": "API KEY",
"modelPath": "Model Path",
"apikeyInfo": "API KEY required for the Gummy subtitle engine, which needs to be obtained from the Alibaba Cloud Bailing platform. For more details, see the project user manual.",
"modelPathInfo": "The folder path of the model required by the Vosk subtitle engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"customEngine": "Custom Engine",
"voskModelPath": "Vosk Path",
"sosvModelPath": "SOSV Path",
"recordingPath": "Save Path",
"startTimeout": "Timeout",
"seconds": "seconds",
"apikeyInfo": "API KEY required for the Gummy caption engine, which needs to be obtained from the Alibaba Cloud Bailing platform. For more details, see the project user manual.",
"glmApikeyInfo": "API KEY required for GLM caption engine, which needs to be obtained from the Zhipu AI platform.",
"voskModelPathInfo": "The folder path of the model required by the Vosk caption engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"sosvModelPathInfo": "The folder path of the model required by the SOSV caption engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"recordingPathInfo": "The path to save recording files, requiring a folder path. The software will automatically name the recording file and save it as .wav file.",
"modelDownload": "Model Download Link",
"startTimeoutInfo": "Caption engine startup timeout duration. Engine will be forcefully stopped if startup exceeds this time. Recommended range: 10-120 seconds.",
"customEngine": "Custom Eng",
custom: {
"title": "Custom Caption Engine",
"attention": "Attention",
@@ -69,6 +88,7 @@ export default {
"title": "Caption Style Settings",
"applyStyle": "Apply",
"cancelChange": "Cancel",
"lineNumber": "CaptionLines",
"resetStyle": "Reset",
"longCaption": "LongCaption",
"fontFamily": "Font Family",
@@ -112,6 +132,11 @@ export default {
"startEngine": "Start Caption Engine",
"restartEngine": "Restart Caption Engine",
"stopEngine": "Stop Caption Engine",
"forceKill": "Force Stop",
"forceKillStarting": "Starting Engine... (Force Stop)",
"forceKillConfirm": "Are you sure you want to force stop the caption engine? This will terminate the process immediately.",
"confirm": "Confirm",
"cancel": "Cancel",
about: {
"title": "About This Project",
"proj": "Auto Caption Project",
@@ -121,7 +146,7 @@ export default {
"projLink": "Project Link",
"manual": "User Manual",
"engineDoc": "Caption Engine Manual",
"date": "August 20, 2025"
"date": "January 31, 2026"
}
},
log: {

View File

@@ -18,7 +18,7 @@ export default {
"args": "、コマンド引数:",
"pidInfo": "、字幕エンジンプロセス PID",
"empty": "モデルパスが空です",
"emptyInfo": "Vosk モデルのパスが空です。字幕エンジン設定の追加設定で Vosk モデルのパスを設定してください。",
"emptyInfo": "選択されたローカルモデルに対応するモデルパスが空です。まず対応するモデルをダウンロードし、次に【字幕エンジン設定 > 詳細設定】でモデルのパスを設定してください。",
"stopped": "字幕エンジンが停止しました",
"stoppedInfo": "字幕エンジンが停止しました。再起動するには「字幕エンジンを開始」ボタンをクリックしてください。",
"error": "エラーが発生しました",
@@ -27,7 +27,10 @@ export default {
"engineChange": "字幕エンジンの設定が変更されました",
"changeInfo": "字幕エンジンがすでに起動している場合、変更を有効にするには再起動が必要です。",
"styleChange": "字幕のスタイルが変更されました",
"styleInfo": "字幕のスタイル変更が保存され、適用されました"
"styleInfo": "字幕のスタイル変更が保存され、適用されました",
"engineStartTimeout": "字幕エンジンの起動がタイムアウトしました。自動的に強制停止しました",
"ollamaNameNull": "Ollama フィールドが空です",
"ollamaNameNullNote": "Ollama モデルを翻訳モデルとして選択する場合、Ollama フィールドは空にできません。ローカルで設定された Ollama モデルの名前を入力してください。"
},
general: {
"title": "一般設定",
@@ -46,17 +49,32 @@ export default {
"cancelChange": "変更をキャンセル",
"sourceLang": "ソース言語",
"transLang": "翻訳言語",
"transModel": "翻訳モデル",
"modelName": "モデル名",
"modelNameNote": "使用する翻訳モデル名を入力してください。Ollama のローカルモデルでも OpenAI API 互換のクラウドモデルでも可能です。Base URL フィールドが未入力の場合、デフォルトでローカルの Ollama サービスが呼び出され、それ以外の場合は Python OpenAI ライブラリ経由で指定されたアドレスの API サービスが呼び出されます。",
"baseURL": "OpenAI API を呼び出すための基本リクエスト URL です。未記入の場合、ローカルのデフォルトポートの Ollama モデルが呼び出されます。",
"apiKey": "OpenAI API に対応するモデルを使用するために必要な API キーです。",
"captionEngine": "エンジン",
"audioType": "オーディオ",
"systemOutput": "システムオーディオ出力(スピーカー)",
"systemInput": "システムオーディオ入力(マイク)",
"enableTranslation": "翻訳",
"enableRecording": "録音",
"showMore": "詳細設定",
"apikey": "API KEY",
"modelPath": "モデルパス",
"voskModelPath": "Voskパス",
"sosvModelPath": "SOSVパス",
"recordingPath": "保存パス",
"startTimeout": "時間制限",
"seconds": "秒",
"apikeyInfo": "Gummy 字幕エンジンに必要な API KEY は、アリババクラウド百煉プラットフォームから取得する必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"modelPathInfo": "Vosk 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"customEngine": "カスタムエンジン",
"glmApikeyInfo": "GLM 字幕エンジンに必要な API KEY で、智譜 AI プラットフォームから取得する必要があります。",
"voskModelPathInfo": "Vosk 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"sosvModelPathInfo": "SOSV 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"recordingPathInfo": "録音ファイルの保存パスで、フォルダパスを指定する必要があります。ソフトウェアが自動的に録音ファイルに名前を付けて .wav ファイルとして保存します。",
"modelDownload": "モデルダウンロードリンク",
"startTimeoutInfo": "字幕エンジンの起動タイムアウト時間です。この時間を超えると自動的に強制停止されます。10-120秒の範囲で設定することを推奨します。",
"customEngine": "カスタム",
custom: {
"title": "カスタムキャプションエンジン",
"attention": "注意事項",
@@ -70,6 +88,7 @@ export default {
"applyStyle": "適用",
"cancelChange": "キャンセル",
"resetStyle": "リセット",
"lineNumber": "字幕行数",
"longCaption": "長い字幕",
"fontFamily": "フォント",
"fontColor": "カラー",
@@ -101,7 +120,7 @@ export default {
"cpu": "CPU 使用率",
"mem": "メモリ使用量",
"elapsed": "稼働時間",
"customized": "カスタマイズ済み",
"customized": "カスタ",
"status": "エンジン状態",
"started": "開始済み",
"stopped": "未開始",
@@ -112,6 +131,11 @@ export default {
"startEngine": "字幕エンジンを開始",
"restartEngine": "字幕エンジンを再起動",
"stopEngine": "字幕エンジンを停止",
"forceKill": "強制停止",
"forceKillStarting": "エンジン起動中... (強制停止)",
"forceKillConfirm": "字幕エンジンを強制停止しますか?プロセスが直ちに終了されます。",
"confirm": "確認",
"cancel": "キャンセル",
about: {
"title": "このプロジェクトについて",
"proj": "Auto Caption プロジェクト",
@@ -121,7 +145,7 @@ export default {
"projLink": "プロジェクトリンク",
"manual": "ユーザーマニュアル",
"engineDoc": "字幕エンジンマニュアル",
"date": "2025820 日"
"date": "2026131 日"
}
},
log: {

View File

@@ -18,7 +18,7 @@ export default {
"args": ",命令参数:",
"pidInfo": ",字幕引擎进程 PID",
"empty": "模型路径为空",
"emptyInfo": "Vosk 模型模型路径为空,请在字幕引擎设置更多设置中设置 Vosk 模型的路径。",
"emptyInfo": "选择的本地模型对应的模型路径为空。请先下载对应模型,然后在【字幕引擎设置 > 更多设置】中配置对应模型的路径。",
"stopped": "字幕引擎停止",
"stoppedInfo": "字幕引擎已经停止,可点击“启动字幕引擎”按钮重新启动",
"error": "发生错误",
@@ -27,7 +27,10 @@ export default {
"engineChange": "字幕引擎配置已更改",
"changeInfo": "如果字幕引擎已经启动,需要重启字幕引擎修改才会生效",
"styleChange": "字幕样式已修改",
"styleInfo": "字幕样式修改已经保存并生效"
"styleInfo": "字幕样式修改已经保存并生效",
"engineStartTimeout": "字幕引擎启动超时,已自动强制停止",
"ollamaNameNull": "Ollama 字段为空",
"ollamaNameNullNote": "选择 Ollama 模型作为翻译模型时Ollama 字段不能为空,需要填写本地已经配置好的 Ollama 模型的名称。"
},
general: {
"title": "通用设置",
@@ -46,16 +49,31 @@ export default {
"cancelChange": "取消更改",
"sourceLang": "源语言",
"transLang": "翻译语言",
"transModel": "翻译模型",
"modelName": "模型名称",
"modelNameNote": "请输入要使用的翻译模型名称,可以是 Ollama 本地模型,也可以是 OpenAI API 兼容的云端模型。若未填写 Base URL 字段,则默认调用本地 Ollama 服务,否则会通过 Python OpenAI 库调用该地址指向的 API 服务。",
"baseURL": "调用 OpenAI API 的基础请求地址,如果不填写则调用本地默认端口的 Ollama 模型。",
"apiKey": "调用 OpenAI API 对应的模型需要使用的 API KEY。",
"captionEngine": "字幕引擎",
"audioType": "音频类型",
"systemOutput": "系统音频输出(扬声器)",
"systemInput": "系统音频输入(麦克风)",
"enableTranslation": "启用翻译",
"enableRecording": "启用录制",
"showMore": "更多设置",
"apikey": "API KEY",
"modelPath": "模型路径",
"voskModelPath": "Vosk路径",
"sosvModelPath": "SOSV路径",
"recordingPath": "保存路径",
"startTimeout": "启动超时",
"seconds": "秒",
"apikeyInfo": "Gummy 字幕引擎需要的 API KEY需要在阿里云百炼平台获取。详细信息见项目用户手册。",
"modelPathInfo": "Vosk 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"glmApikeyInfo": "GLM 字幕引擎需要的 API KEY需要在智谱 AI 平台获取。",
"voskModelPathInfo": "Vosk 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"sosvModelPathInfo": "SOSV 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"recordingPathInfo": "录音文件保存路径,需要提供文件夹路径。软件会自动命名录音文件并保存为 .wav 文件。",
"modelDownload": "模型下载地址",
"startTimeoutInfo": "字幕引擎启动超时时间,超过此时间将自动强制停止。建议设置为 10-120 秒之间。",
"customEngine": "自定义引擎",
custom: {
"title": "自定义字幕引擎",
@@ -70,6 +88,7 @@ export default {
"applyStyle": "应用样式",
"cancelChange": "取消更改",
"resetStyle": "恢复默认",
"lineNumber": "字幕行数",
"longCaption": "长字幕",
"fontFamily": "字体族",
"fontColor": "字体颜色",
@@ -112,6 +131,11 @@ export default {
"startEngine": "启动字幕引擎",
"restartEngine": "重启字幕引擎",
"stopEngine": "关闭字幕引擎",
"forceKill": "强行停止",
"forceKillStarting": "正在启动引擎... (强行停止)",
"forceKillConfirm": "确定要强行停止字幕引擎吗?这将立即终止进程。",
"confirm": "确定",
"cancel": "取消",
about: {
"title": "关于本项目",
"proj": "Auto Caption 项目",
@@ -121,7 +145,7 @@ export default {
"projLink": "项目链接",
"manual": "用户手册",
"engineDoc": "字幕引擎手册",
"date": "2025820 日"
"date": "2026131 日"
}
},
log: {

View File

@@ -15,7 +15,12 @@ export const useCaptionLogStore = defineStore('captionLog', () => {
})
window.electron.ipcRenderer.on('both.captionLog.upd', (_, log) => {
captionData.value.splice(captionData.value.length - 1, 1, log)
for(let i = captionData.value.length - 1; i >= 0; i--) {
if(captionData.value[i].time_s === log.time_s){
captionData.value.splice(i, 1, log)
break
}
}
})
window.electron.ipcRenderer.on('both.captionLog.set', (_, logs) => {

View File

@@ -4,6 +4,7 @@ import { Styles } from '@renderer/types'
import { breakOptions } from '@renderer/i18n'
export const useCaptionStyleStore = defineStore('captionStyle', () => {
const lineNumber = ref<number>(1)
const lineBreak = ref<number>(1)
const fontFamily = ref<string>('sans-serif')
const fontSize = ref<number>(24)
@@ -38,6 +39,7 @@ export const useCaptionStyleStore = defineStore('captionStyle', () => {
function sendStylesChange() {
const styles: Styles = {
lineNumber: lineNumber.value,
lineBreak: lineBreak.value,
fontFamily: fontFamily.value,
fontSize: fontSize.value,
@@ -65,6 +67,7 @@ export const useCaptionStyleStore = defineStore('captionStyle', () => {
}
function setStyles(args: Styles){
lineNumber.value = args.lineNumber
lineBreak.value = args.lineBreak
fontFamily.value = args.fontFamily
fontSize.value = args.fontSize
@@ -91,6 +94,7 @@ export const useCaptionStyleStore = defineStore('captionStyle', () => {
})
return {
lineNumber, // 显示字幕行数
lineBreak, // 换行方式
fontFamily, // 字体族
fontSize, // 字体大小

View File

@@ -19,14 +19,25 @@ export const useEngineControlStore = defineStore('engineControl', () => {
const engineEnabled = ref(false)
const sourceLang = ref<string>('en')
const targetLang = ref<string>('zh')
const transModel = ref<string>('ollama')
const ollamaName = ref<string>('')
const ollamaUrl = ref<string>('')
const ollamaApiKey = ref<string>('')
const engine = ref<string>('gummy')
const audio = ref<0 | 1>(0)
const translation = ref<boolean>(true)
const recording = ref<boolean>(false)
const API_KEY = ref<string>('')
const modelPath = ref<string>('')
const voskModelPath = ref<string>('')
const sosvModelPath = ref<string>('')
const glmUrl = ref<string>('https://open.bigmodel.cn/api/paas/v4/audio/transcriptions')
const glmModel = ref<string>('glm-asr-2512')
const glmApiKey = ref<string>('')
const recordingPath = ref<string>('')
const customized = ref<boolean>(false)
const customizedApp = ref<string>('')
const customizedCommand = ref<string>('')
const startTimeoutSeconds = ref<number>(30)
const changeSignal = ref<boolean>(false)
const errorSignal = ref<boolean>(false)
@@ -36,14 +47,25 @@ export const useEngineControlStore = defineStore('engineControl', () => {
engineEnabled: engineEnabled.value,
sourceLang: sourceLang.value,
targetLang: targetLang.value,
transModel: transModel.value,
ollamaName: ollamaName.value,
ollamaUrl: ollamaUrl.value,
ollamaApiKey: ollamaApiKey.value,
engine: engine.value,
audio: audio.value,
translation: translation.value,
recording: recording.value,
API_KEY: API_KEY.value,
modelPath: modelPath.value,
voskModelPath: voskModelPath.value,
sosvModelPath: sosvModelPath.value,
glmUrl: glmUrl.value,
glmModel: glmModel.value,
glmApiKey: glmApiKey.value,
recordingPath: recordingPath.value,
customized: customized.value,
customizedApp: customizedApp.value,
customizedCommand: customizedCommand.value
customizedCommand: customizedCommand.value,
startTimeoutSeconds: startTimeoutSeconds.value
}
window.electron.ipcRenderer.send('control.controls.change', controls)
}
@@ -66,23 +88,35 @@ export const useEngineControlStore = defineStore('engineControl', () => {
}
sourceLang.value = controls.sourceLang
targetLang.value = controls.targetLang
transModel.value = controls.transModel
ollamaName.value = controls.ollamaName
ollamaUrl.value = controls.ollamaUrl
ollamaApiKey.value = controls.ollamaApiKey
engine.value = controls.engine
audio.value = controls.audio
engineEnabled.value = controls.engineEnabled
translation.value = controls.translation
recording.value = controls.recording
API_KEY.value = controls.API_KEY
modelPath.value = controls.modelPath
voskModelPath.value = controls.voskModelPath
sosvModelPath.value = controls.sosvModelPath
glmUrl.value = controls.glmUrl || 'https://open.bigmodel.cn/api/paas/v4/audio/transcriptions'
glmModel.value = controls.glmModel || 'glm-asr-2512'
glmApiKey.value = controls.glmApiKey
recordingPath.value = controls.recordingPath
customized.value = controls.customized
customizedApp.value = controls.customizedApp
customizedCommand.value = controls.customizedCommand
startTimeoutSeconds.value = controls.startTimeoutSeconds
changeSignal.value = true
}
function emptyModelPathErr() {
notification.open({
placement: 'topLeft',
message: t('noti.empty'),
description: t('noti.emptyInfo')
description: t('noti.emptyInfo'),
duration: null,
icon: () => h(ExclamationCircleOutlined, { style: 'color: #ff4d4f' })
});
}
@@ -129,14 +163,25 @@ export const useEngineControlStore = defineStore('engineControl', () => {
engineEnabled, // 字幕引擎是否启用
sourceLang, // 源语言
targetLang, // 目标语言
transModel, // 翻译模型
ollamaName, // Ollama 模型
ollamaUrl,
ollamaApiKey,
engine, // 字幕引擎
audio, // 选择音频
translation, // 是否启用翻译
recording, // 是否启用录音
API_KEY, // API KEY
modelPath, // vosk 模型路径
voskModelPath, // vosk 模型路径
sosvModelPath, // sosv 模型路径
glmUrl, // GLM API URL
glmModel, // GLM 模型名称
glmApiKey, // GLM API Key
recordingPath, // 录音保存路径
customized, // 是否使用自定义字幕引擎
customizedApp, // 自定义字幕引擎的应用程序
customizedCommand, // 自定义字幕引擎的命令
startTimeoutSeconds, // 启动超时时间(秒)
setControls, // 设置引擎配置
sendControlsChange, // 发送最新控制消息到后端
emptyModelPathErr, // 模型路径为空时显示警告

View File

@@ -6,17 +6,29 @@ export interface Controls {
engineEnabled: boolean,
sourceLang: string,
targetLang: string,
transModel: string,
ollamaName: string,
ollamaUrl: string,
ollamaApiKey: string,
engine: string,
audio: 0 | 1,
translation: boolean,
recording: boolean,
API_KEY: string,
modelPath: string,
voskModelPath: string,
sosvModelPath: string,
glmUrl: string,
glmModel: string,
glmApiKey: string,
recordingPath: string,
customized: boolean,
customizedApp: string,
customizedCommand: string
customizedCommand: string,
startTimeoutSeconds: number
}
export interface Styles {
lineNumber: number,
lineBreak: number,
fontFamily: string,
fontSize: number,

View File

@@ -12,29 +12,60 @@
textShadow: captionStyle.textShadow ? `${captionStyle.offsetX}px ${captionStyle.offsetY}px ${captionStyle.blur}px ${captionStyle.textShadowColor}` : 'none'
}"
>
<p :class="[captionStyle.lineBreak?'':'left-ellipsis']" :style="{
fontFamily: captionStyle.fontFamily,
fontSize: captionStyle.fontSize + 'px',
color: captionStyle.fontColor,
fontWeight: captionStyle.fontWeight * 100
}">
<span v-if="captionData.length">{{ captionData[captionData.length-1].text }}</span>
<span v-else>{{ $t('example.original') }}</span>
</p>
<p :class="[captionStyle.lineBreak?'':'left-ellipsis']"
v-if="captionStyle.transDisplay"
:style="{
fontFamily: captionStyle.transFontFamily,
fontSize: captionStyle.transFontSize + 'px',
color: captionStyle.transFontColor,
fontWeight: captionStyle.transFontWeight * 100
}">
<span v-if="captionData.length">{{ captionData[captionData.length-1].translation }}</span>
<span v-else>{{ $t('example.translation') }}</span>
</p>
<template v-if="captionData.length">
<template
v-for="val in revArr[Math.min(captionStyle.lineNumber, captionData.length)]"
:key="captionData[captionData.length - val].time_s"
>
<p :class="[captionStyle.lineBreak?'':'left-ellipsis']" :style="{
fontFamily: captionStyle.fontFamily,
fontSize: captionStyle.fontSize + 'px',
color: captionStyle.fontColor,
fontWeight: captionStyle.fontWeight * 100
}">
<span>{{ captionData[captionData.length - val].text }}</span>
</p>
<p :class="[captionStyle.lineBreak?'':'left-ellipsis']"
v-if="captionStyle.transDisplay && captionData[captionData.length - val].translation"
:style="{
fontFamily: captionStyle.transFontFamily,
fontSize: captionStyle.transFontSize + 'px',
color: captionStyle.transFontColor,
fontWeight: captionStyle.transFontWeight * 100
}">
<span>{{ captionData[captionData.length - val].translation }}</span>
</p>
</template>
</template>
<template v-else>
<template v-for="val in captionStyle.lineNumber" :key="val">
<p :class="[captionStyle.lineBreak?'':'left-ellipsis']" :style="{
fontFamily: captionStyle.fontFamily,
fontSize: captionStyle.fontSize + 'px',
color: captionStyle.fontColor,
fontWeight: captionStyle.fontWeight * 100
}">
<span>{{ $t('example.original') }}</span>
</p>
<p :class="[captionStyle.lineBreak?'':'left-ellipsis']"
v-if="captionStyle.transDisplay"
:style="{
fontFamily: captionStyle.transFontFamily,
fontSize: captionStyle.transFontSize + 'px',
color: captionStyle.transFontColor,
fontWeight: captionStyle.transFontWeight * 100
}">
<span>{{ $t('example.translation') }}</span>
</p>
</template>
</template>
</div>
<div class="title-bar" :style="{color: captionStyle.fontColor}">
<div class="title-bar"
:style="{color: captionStyle.fontColor}"
@mouseenter="onTitleBarEnter()"
@mouseleave="onTitleBarLeave()"
>
<div class="option-item" @click="closeCaptionWindow">
<CloseOutlined />
</div>
@@ -56,12 +87,20 @@ import { ref, onMounted } from 'vue';
import { useCaptionStyleStore } from '@renderer/stores/captionStyle';
import { useCaptionLogStore } from '@renderer/stores/captionLog';
import { storeToRefs } from 'pinia';
const revArr = {
1: [1],
2: [2, 1],
3: [3, 2, 1],
4: [4, 3, 2, 1],
}
const captionStyle = useCaptionStyleStore();
const captionLog = useCaptionLogStore();
const { captionData } = storeToRefs(captionLog);
const caption = ref();
const windowHeight = ref(100);
const pinned = ref(true);
const pinned = ref(false);
onMounted(() => {
const resizeObserver = new ResizeObserver(entries => {
@@ -79,7 +118,7 @@ onMounted(() => {
function pinCaptionWindow() {
pinned.value = !pinned.value;
window.electron.ipcRenderer.send('caption.pin.set', pinned.value)
window.electron.ipcRenderer.send('caption.mouseEvents.ignore', pinned.value)
}
function openControlWindow() {
@@ -89,6 +128,18 @@ function openControlWindow() {
function closeCaptionWindow() {
window.electron.ipcRenderer.send('caption.window.close')
}
function onTitleBarEnter() {
if(pinned.value) {
window.electron.ipcRenderer.send('caption.mouseEvents.ignore', false)
}
}
function onTitleBarLeave() {
if(pinned.value) {
window.electron.ipcRenderer.send('caption.mouseEvents.ignore', true)
}
}
</script>
<style scoped>