36 Commits

Author SHA1 Message Date
himeditator
383e582a2d docs(readme): 更新说明并添加终端使用指南 2025-11-02 20:53:56 +08:00
himeditator
e6a65f8362 feat(engine): 字幕引擎添加在终端直接显示字幕的功能 2025-11-01 21:06:28 +08:00
himeditator
77726753bb fix(renderer): 修复暗色主题下部分字体不可见的问题 2025-09-15 22:17:47 +08:00
himeditator
4b47e50d9e release v1.0.0 2025-09-08 15:19:10 +08:00
himeditator
4494b2c68b feat(renderer):实现多行字幕显示功能
- 在 CaptionStyle 组件中添加字幕行数设置选项
- 修改组件以支持多行字幕显示
- 优化字幕数据处理逻辑,支持按时间顺序显示多条字幕
2025-09-07 21:06:11 +08:00
himeditator
4abd6d0808 feat(renderer): 在用户界面中添加新功能的设置
- 添加录音功能,可保存为 WAV 文件
- 优化字幕引擎设置界面,支持更多配置选项
- 更新多语言翻译,增加模型下载链接等信息
2025-09-07 14:35:18 +08:00
himeditator
6bff978b88 feat(engine): 替换重采样模型、SOSV 添加标点恢复模型
- 将 samplerate 库替换为 resampy 库,提高重采样质量
- Shepra-ONNX SenseVoice 添加中文和英语标点恢复模型
2025-09-06 23:15:33 +08:00
himeditator
eba2c5ca45 feat(engine): 重构字幕引擎,新增 Sherpa-ONNX SenseVoice 语音识别模型
- 重构字幕引擎,将音频采集改为在新线程上进行
- 重构 audio2text 中的类,调整运行逻辑
- 更新 main 函数,添加对 Sosv 模型的支持
- 修改 AudioStream 类,默认使用 16000Hz 采样率
2025-09-06 20:49:46 +08:00
himeditator
2b7ce06f04 feat(translation): 添加非实时翻译功能用户界面组件 2025-09-04 23:41:22 +08:00
himeditator
14987cbfc5 feat(vosk): 为 Vosk 模型添加非实时翻译功能 (#14)
- 添加 Ollama 大模型翻译和 Google 翻译(非实时),支持多种语言
- 为 Vosk 引擎添加非实时翻译
- 为新增的翻译功能添加和修改接口
- 修改 Electron 构建配置,之后不同平台构建无需修改构建文件
2025-09-02 23:19:53 +08:00
himeditator
56fdc348f8 fix(engine): 解决在引擎状态不为 running 时强制关闭字幕引擎失败的问题
- 合并了 CaptionEngine 类中的 kill 和 forceKill 方法,删除了状态警告中的提前  return
- 更新了 README 文件中的macOS兼容性说明,添加了配置链接
2025-08-30 20:57:26 +08:00
Chen Janai
f42458124e Merge pull request #17 from xuemian168/main
feat(engine): 添加启动超时功能和强制终止引擎的支持
2025-08-28 12:25:33 +08:00
himeditator
2352bcee5d feat(engine): 优化超时启动功能的小问题
- 更新接口文档
- 修改国际化文本使得内容不超过标签长度
- 解决强制关闭按钮点击无效的问题
2025-08-28 12:22:19 +08:00
xuemian
051a497f3a feat(engine): 添加启动超时功能和强制终止引擎的支持
- 在 ControlWindow 中添加了 'control.engine.forceKill' 事件处理,允许强制终止引擎。
- 在 CaptionEngine 中实现了启动超时机制,若引擎启动超时,将自动强制停止并发送错误消息。
- 更新了国际化文件,添加了与启动超时相关的提示信息。
- 在 EngineControl 组件中添加了启动超时的输入选项,允许用户设置超时时间。
- 更新了相关类型定义以支持新的启动超时配置。
2025-08-28 10:24:08 +10:00
himeditator
34362fea3d feat(auto-caption): 发布 v0.7.0 版本 2025-08-20 00:53:06 +08:00
himeditator
771f7ad002 feat(log): 添加软件日志功能
- 新增 SoftwareLog 相关接口和数据结构
- 实现日志数据的收集和展示
- 添加日志相关的国际化支持
- 优化控制页面布局,支持日志切换显示
2025-08-19 22:23:54 +08:00
himeditator
01936d5f12 feat(renderer): 添加界面主题颜色功能,添加复制最新字幕选项(#13)
- 新增界面主题颜色功能,支持自定义主题颜色
- 使用 antd 滑块替代原生 input 元素
- 添加复制字幕记录功能,可选择复制最近的字幕记录
2025-08-18 16:03:46 +08:00
himeditator
1c0bf1f9c4 refactor(engine): 修改虚拟环境设置,修改音频工具函数
- 更新虚拟环境目录名为 .venv
- 调整音频块采集速率默认值为 10
- 为 AudioStream 类添加重设音频块大小的方法
- 更新依赖文件 requirements.txt
2025-08-03 16:40:26 +08:00
himeditator
38b4b15cec feat(engine): 添加字幕窗口宽度记忆功能并优化字幕引擎关闭逻辑
- 添加 captionWindowWidth 属性,用于保存字幕窗口宽度
- 修改 CaptionEngine 中的 stop 和 kill 方法,优化字幕引擎关闭逻辑
- 更新 README,添加预备模型列表
2025-08-02 15:57:07 +08:00
Chen Janai
64ea2f0daf Merge pull request #12 from HiMeditator/dev-0.6.0-engine
Release v0.6.0 with new caption engine structure
2025-07-30 00:32:00 +08:00
himeditator mac
a7a60da260 fix(engine): 字幕引擎启动路径适配、音频重采样函数适配 2025-07-30 00:16:54 +08:00
himeditator
1b7ff33656 feat(docs): 更新项目文档和图片 2025-07-29 23:20:15 +08:00
himeditator mac
d5d692188e feat(engine): 优化字幕引擎、提升程序健壮性
- 优化服务器启动流程,增加异常处理
- 主程序和字幕引擎的 WebSocket 端口号改为随机生成
2025-07-29 19:37:03 +08:00
himeditator
e4f937e6b6 feat(engine): 优化字幕引擎通信和控制逻辑,优化窗口信息展示
- 优化错误处理和引擎重启逻辑
- 添加字幕引擎强制终止功能
- 调整通知和错误提示的显示位置
- 优化日志记录精度到毫秒级
2025-07-28 21:44:49 +08:00
himeditator
cd9f3a847d feat(engine): 重构字幕引擎并实现 WebSocket 通信
- 重构了 Gummy 和 Vosk 字幕引擎的代码,提高了可扩展性和可读性
- 合并 Gummy 和 Vosk 引擎为单个可执行文件
- 实现了字幕引擎和主程序之间的 WebSocket 通信,避免了孤儿进程问题
2025-07-28 15:49:52 +08:00
himeditator
b658ef5440 feat(engine): 优化字幕引擎输出格式、准备合并两个字幕引擎
- 重构字幕引擎相关代码
- 准备合并两个字幕引擎
2025-07-27 17:15:12 +08:00
himeditator
3792eb88b6 refactor(engine): 重构字幕引擎
- 更新 GummyTranslator 类,优化字幕生成逻辑
- 移除 audioprcs 模块,音频处理功能转移到 utils 模块
- 重构 sysaudio 模块,提高音频流管理的灵活性和稳定性
- 修改 TODO.md,完成按时间降序排列字幕记录的功能
- 更新文档,说明因资源限制将不再维护英文和日文文档
2025-07-26 23:37:24 +08:00
himeditator
8e575a9ba3 refactor(engine): 字幕引擎文件夹重命名,字幕记录添加降序选择
- 字幕记录表格可以按时间降序排列
- 将 caption-engine 重命名为 engine
- 更新了相关文件和文件夹的路径
- 修改了 README 和 TODO 文档中的相关内容
- 更新了 Electron 构建配置
2025-07-26 21:29:16 +08:00
himeditator
697488ce84 docs: update README, add TODO 2025-07-20 00:32:57 +08:00
himeditator
f7d2df938d fix(engine): 修复自定义字幕引擎相关问题 2025-07-17 20:52:27 +08:00
himeditator
5513c7e84c docs(compatibility): 添加 Kylin OS 支持、更新文档 2025-07-16 20:55:03 +08:00
himeditator
25b6ad5ed2 release v0.5.0
- 更新了发行说明和用户手册
- 优化了界面显示和功能
- 过滤 Gummy 字幕引擎输出的不完整字幕
2025-07-15 18:48:16 +08:00
himeditator mac
760c01d79e feat(engine): 添加字幕引擎资源消耗监控功能
- 在控制窗口添加引擎状态显示,包括 PID、PPID、CPU 使用率、内存使用量和运行时间
- 优化字幕记录导出和复制功能,支持选择导出内容类型
2025-07-15 13:52:10 +08:00
himeditator
a0a0a2e66d feat(caption): 调整字幕窗口、添加字幕时间轴修改 (#8)
- 新增修改字幕时间功能
- 添加导出字幕记录类型,支持 srt 和 json 格式
- 调整字幕窗口右上角图标为竖向排布
2025-07-14 20:07:22 +08:00
himeditator
665c47d24f feat(linux): 支持 Linux 系统音频输出
- 添加了对 Linux 系统音频输出的支持
- 更新了 README 和用户手册中的平台兼容性信息
- 修改了 AudioStream 类以支持 Linux 平台
2025-07-13 23:28:40 +08:00
himeditator
7f8766b13e docs(engine-manual): 更新字幕引擎开发文档
- 添加了命令行参数指定的详细说明
- 增加了字幕引擎打包和运行的步骤说明
- 修复了一些文档中的错误和拼写问题
2025-07-11 13:25:52 +08:00
107 changed files with 4847 additions and 2964 deletions

10
.gitignore vendored
View File

@@ -5,8 +5,10 @@ out
.eslintcache
*.log*
__pycache__
subenv
caption-engine/build
caption-engine/models
output.wav
.venv
test.py
engine/build
engine/models
engine/notebook
.repomap
.virtualme

4
.npmrc
View File

@@ -1,2 +1,2 @@
# electron_mirror=https://npmmirror.com/mirrors/electron/
# electron_builder_binaries_mirror=https://npmmirror.com/mirrors/electron-builder-binaries/
electron_mirror=https://npmmirror.com/mirrors/electron/
electron_builder_binaries_mirror=https://npmmirror.com/mirrors/electron-builder-binaries/

View File

@@ -9,6 +9,6 @@
"editor.defaultFormatter": "esbenp.prettier-vscode"
},
"python.analysis.extraPaths": [
"./caption-engine"
"./engine"
]
}

178
README.md
View File

@@ -3,12 +3,8 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption 是一个跨平台的实时字幕显示软件。</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases">
<img src="https://img.shields.io/badge/release-0.4.0-blue">
</a>
<a href="https://github.com/HiMeditator/auto-caption/issues">
<img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange">
</a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
<img src="https://img.shields.io/github/stars/HiMeditator/auto-caption?style=social">
@@ -18,16 +14,18 @@
| <a href="./README_en.md">English</a>
| <a href="./README_ja.md">日本語</a> |
</p>
<p><i>包含 Vosk 本地字幕引擎的 v0.4.0 版本已经发布。<b>目前本地字幕引擎不含翻译</b>,本地翻译模块仍正在开发中...</i></p>
<p><i>v1.0.0 版本已经发布,新增 SOSV 本地字幕模型。当前功能已经基本完整,暂无继续开发计划...</i></p>
</div>
![](./assets/media/main_zh.png)
![](./assets/media/main_mac_zh.png)
## 📥 下载
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
软件下载:[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Vosk 模型下载:[Vosk Models](https://alphacephei.com/vosk/models)
SOSV 模型下载:[ Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
## 📚 相关文档
@@ -35,44 +33,106 @@
[字幕引擎说明文档](./docs/engine-manual/zh.md)
[项目 API 文档](./docs/api-docs/electron-ipc.md)
[更新日志](./docs/CHANGELOG.md)
## ✨ 特性
- 生成音频输出或麦克风输入的字幕
- 支持调用本地 Ollama 模型或云端 Google 翻译 API 进行翻译
- 跨平台Windows、macOS、Linux、多界面语言中文、英语、日语支持
- 丰富的字幕样式设置(字体、字体大小、字体粗细、字体颜色、背景颜色等)
- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、本地 Vosk 模型、本地 SOSV 模型、还可以自己开发模型)
- 多语言识别与翻译(见下文“⚙️ 自带字幕引擎说明”)
- 字幕记录展示与导出(支持导出 `.srt``.json` 格式)
## 📖 基本使用
目前提供了 WindowsmacOS 平台的可安装版本。
软件已经适配了 WindowsmacOS 和 Linux 平台。测试过的主流平台信息如下:
> 国际版的阿里云服务并没有提供 Gummy 模型,因此目前非中国用户无法使用 Gummy 字幕引擎。
| 操作系统版本 | 处理器架构 | 获取系统音频输入 | 获取系统音频输出 |
| ------------------ | ---------- | ---------------- | ---------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ [需要额外配置](./docs/user-manual/zh.md#macos-获取系统音频输出) | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,详见 [Auto Caption 用户手册](./docs/user-manual/zh.md)。
下载软件后,需要根据自己的需求选择对应的模型,然后配置模型。
| | 识别效果 | 部署类型 | 支持语言 | 翻译 | 备注 |
| ------------------------------------------------------------ | -------- | ------------- | ---------- | ---------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 很好😊 | 云端 / 阿里云 | 10 种 | 自带翻译 | 收费0.54CNY / 小时 |
| [Vosk](https://alphacephei.com/vosk) | 较差😞 | 本地 / CPU | 超过 30 种 | 需额外配置 | 支持的语言非常多 |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | 本地 / CPU | 5 种 | 需额外配置 | 仅有一个模型 |
| 自己开发 | 🤔 | 自定义 | 自定义 | 自定义 | 根据[文档](./docs/engine-manual/zh.md)使用 Python 自己开发 |
如果你选择使用 Vosk 或 SOSV 模型,你还需要配置自己的翻译模型。
### 配置翻译模型
![](./assets/media/engine_zh.png)
> 注意:翻译不是实时的,翻译模型只会在每句话识别完成后再调用。
#### Ollama 本地模型
> 注意:使用参数量过大的模型会导致资源消耗和翻译延迟较大。建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`。
使用该模型之前你需要确定本机安装了 [Ollama](https://ollama.com/) 软件,并已经下载了需要的大语言模型。只需要将需要调用的大模型名称添加到设置中的 `Ollama` 字段中。
#### Google 翻译 API
> 注意Google 翻译 API 在部分地区无法使用。
无需任何配置,联网即可使用。
### 使用 Gummy 模型
> 国际版的阿里云服务似乎并没有提供 Gummy 模型,因此目前非中国用户可能无法使用 Gummy 字幕引擎。
如果要使用默认的 Gummy 字幕引擎(使用云端模型进行语音识别和翻译),首先需要获取阿里云百炼平台的 API KEY然后将 API KEY 添加到软件设置中或者配置到环境变量中(仅 Windows 平台支持读取环境变量中的 API KEY这样才能正常使用该模型。相关教程
- [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### 使用 Vosk 模型
> Vosk 模型的识别效果较差,请谨慎使用。
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型,并将模型解压到本地,并将模型文件夹的路径添加到软件的设置中。
![](./assets/media/vosk_zh.png)
![](./assets/media/config_zh.png)
### 使用 SOSV 模型
**如果你觉得上述字幕引擎不能满足你的需求,而且你会 Python那么你可以考虑开发自己的字幕引擎。详细说明请参考[字幕引擎说明文档](./docs/engine-manual/zh.md)。**
使用 SOSV 模型的方式和 Vosk 一样下载地址如下https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## ✨ 特性
## ⌨️ 在终端中使用
- 跨平台、多界面语言支持
- 丰富的字幕样式设置
- 灵活的字幕引擎选择
- 多语言识别与翻译
- 字幕记录展示与导出
- 生成音频输出或麦克风输入的字幕
软件采用模块化设计,可用分为软件主体和字幕引擎两部分,软件主体通过图形界面调用字幕引擎。核心的音频获取和音频识别功能都在字幕引擎中实现,而字幕引擎是可用脱离软件主体单独使用的。
说明
- Windows 和 macOS 平台支持生成音频输出和麦克风输入的字幕,但是 **macOS 平台获取系统音频输出需要进行设置,详见[Auto Caption 用户手册](./docs/user-manual/zh.md)**
- Linux 平台目前无法获取系统音频输出,仅支持生成麦克风输入的字幕
字幕引擎使用 Python 开发,通过 PyInstaller 打包为可执行文件。因此字幕引擎有两种使用方式
1. 使用项目字幕引擎部分的源代码,使用安装了对应库的 Python 环境进行运行
2. 使用打包好的字幕引擎的可执行文件,通过终端运行
运行参数和详细使用介绍请参考[用户手册](./docs/user-manual/zh.md#单独使用字幕引擎)。
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
![](./docs/img/07.png)
## ⚙️ 自带字幕引擎说明
目前软件自带 2 个字幕引擎,正在规划 1 个新的引擎。它们的详细信息如下。
目前软件自带 3 个字幕引擎,正在规划新的引擎。它们的详细信息如下。
### Gummy 字幕引擎(云端)
@@ -101,11 +161,22 @@ $$
### Vosk 字幕引擎(本地)
基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。目前只支持生成音频对应的原文,不支持生成翻译内容
基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。该字幕引擎的优点是可选的语言模型非常多(超过 30 种),缺点是识别效果比较差,且生成内容没有标点符号
### FunASR 字幕引擎(本地)
如果可行,将基于 [FunASR](https://github.com/modelscope/FunASR) 进行开发。还未进行调研和可行性验证。
### SOSV 字幕引擎(本地)
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) 是一个整合包,该整合包主要基于 [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html),并添加了端点检测模型和标点恢复模型。该模型支持识别的语言有:英语、中文、日语、韩语、粤语。
### 新规划字幕引擎
以下为备选模型,将根据模型效果和集成难易程度选择。
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
## 🚀 项目运行
@@ -119,27 +190,26 @@ npm install
### 构建字幕引擎
首先进入 `caption-engine` 文件夹,执行如下指令创建虚拟环境:
首先进入 `engine` 文件夹,执行如下指令创建虚拟环境(需要使用大于等于 Python 3.10 的 Python 运行环境,建议使用 Python 3.12
```bash
# in ./caption-engine folder
python -m venv subenv
cd ./engine
# in ./engine folder
python -m venv .venv
# or
python3 -m venv subenv
python3 -m venv .venv
```
然后激活虚拟环境:
```bash
# Windows
subenv/Scripts/activate
.venv/Scripts/activate
# Linux or macOS
source subenv/bin/activate
source .venv/bin/activate
```
然后安装依赖(注意如果是 Linux 或 macOS 环境,需要注释掉 `requirements.txt` 中的 `PyAudioWPatch`,该模块仅适用于 Windows 环境)。
> 这一步可能会报错,一般是因为构建失败,需要根据报错信息安装对应的构建工具包。
然后安装依赖(这一步在 macOS 和 Linux 可能会报错,一般是因为构建失败,需要根据报错信息进行处理):
```bash
pip install -r requirements.txt
@@ -148,29 +218,27 @@ pip install -r requirements.txt
然后使用 `pyinstaller` 构建项目:
```bash
pyinstaller ./main-gummy.spec
pyinstaller ./main-vosk.spec
pyinstaller ./main.spec
```
注意 `main-vosk.spec` 文件中 `vsok` 库的路径可能不正确,需要根据实际状况配置。
注意 `main.spec` 文件中 `vosk` 库的路径可能不正确,需要根据实际状况配置(与 Python 环境的版本相关)
```
# Windows
vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
# Linux or macOS
vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
vosk_path = str(Path('./.venv/lib/python3.x/site-packages/vosk').resolve())
```
此时项目构建完成,进入 `caption-engine/dist` 文件夹可见对应的可执行文件。即可进行后续操作。
此时项目构建完成,进入 `engine/dist` 文件夹可见对应的可执行文件。即可进行后续操作。
### 运行项目
```bash
npm run dev
```
### 构建项目
注意目前软件只在 Windows 和 macOS 平台上进行了构建和测试,无法保证软件在 Linux 平台下的正确性。
### 构建项目
```bash
# For windows
@@ -180,19 +248,3 @@ npm run build:mac
# For Linux
npm run build:linux
```
注意,根据不同的平台需要修改项目根目录下 `electron-builder.yml` 文件中的配置内容:
```yml
extraResources:
# For Windows
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
# For macOS and Linux
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
```

View File

@@ -3,12 +3,8 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption is a cross-platform real-time caption display software.</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases">
<img src="https://img.shields.io/badge/release-0.4.0-blue">
</a>
<a href="https://github.com/HiMeditator/auto-caption/issues">
<img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange">
</a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
<img src="https://img.shields.io/github/stars/HiMeditator/auto-caption?style=social">
@@ -18,16 +14,18 @@
| <b>English</b>
| <a href="./README_ja.md">日本語</a> |
</p>
<p><i>The v0.4.0 version with Vosk local caption engine has been released. <b>Currently the local caption engine does not include translation</b>, the local translation module is still under development...</i></p>
<p><i>Version 1.0.0 has been released, with the addition of the SOSV local caption model. The current features are basically complete, and there are no further development plans...</i></p>
</div>
![](./assets/media/main_en.png)
![](./assets/media/main_mac_en.png)
## 📥 Download
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Software Download: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Vosk Model Download: [Vosk Models](https://alphacephei.com/vosk/models)
SOSV Model Download: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
## 📚 Documentation
@@ -35,43 +33,107 @@
[Caption Engine Documentation](./docs/engine-manual/en.md)
[Project API Documentation (Chinese)](./docs/api-docs/electron-ipc.md)
## 📖 Basic Usage
Currently, installable versions are available for Windows and macOS platforms.
> The international version of Alibaba Cloud services does not provide the Gummy model, so non-Chinese users currently cannot use the Gummy caption engine.
To use the default Gummy caption engine (which uses cloud-based models for speech recognition and translation), you first need to obtain an API KEY from the Alibaba Cloud Bailian platform. Then add the API KEY to the software settings or configure it in environment variables (only Windows platform supports reading API KEY from environment variables) to properly use this model. Related tutorials:
- [Obtaining API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configuring API Key through Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
> The recognition performance of Vosk models is suboptimal, please use with caution.
To use the Vosk local caption engine, first download your required model from [Vosk Models](https://alphacephei.com/vosk/models) page, extract the model locally, and add the model folder path to the software settings. Currently, the Vosk caption engine does not support translated captions.
![](./assets/media/vosk_en.png)
**If you find the above caption engines don't meet your needs and you know Python, you may consider developing your own caption engine. For detailed instructions, please refer to the [Caption Engine Documentation](./docs/engine-manual/en.md).**
[Changelog](./docs/CHANGELOG.md)
## ✨ Features
- Cross-platform, multi-language UI support
- Rich caption style settings
- Flexible caption engine selection
- Multi-language recognition and translation
- Caption recording display and export
- Generate captions for audio output or microphone input
- Generate captions from audio output or microphone input
- Supports translation by calling local Ollama models or cloud-based Google Translate API
- Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support
- Rich caption style settings (font, font size, font weight, font color, background color, etc.)
- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, local SOSV model, or you can develop your own model)
- Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines")
- Subtitle record display and export (supports exporting `.srt` and `.json` formats)
Notes:
- Windows and macOS platforms support generating captions for both audio output and microphone input, but **macOS requires additional setup to capture system audio output. See [Auto Caption User Manual](./docs/user-manual/en.md) for details.**
- Linux platform currently cannot capture system audio output, only supports generating subtitles for microphone input.
## 📖 Basic Usage
The software has been adapted for Windows, macOS, and Linux platforms. The tested platform information is as follows:
| OS Version | Architecture | System Audio Input | System Audio Output |
| ------------------ | ------------ | ------------------ | ------------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ [Additional config required](./docs/user-manual/en.md#capturing-system-audio-output-on-macos) | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
Additional configuration is required to capture system audio output on macOS and Linux platforms. See [Auto Caption User Manual](./docs/user-manual/en.md) for details.
After downloading the software, you need to select the corresponding model according to your needs and then configure the model.
| | Recognition Quality | Deployment Type | Supported Languages | Translation | Notes |
| ------------------------------------------------------------ | ------------------- | ------------------ | ------------------- | ------------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Excellent 😊 | Alibaba Cloud | 10 languages | Built-in | Paid, 0.54 CNY/hour |
| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Local / CPU | Over 30 languages | Requires setup | Supports many languages |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Fair 😐 | Local / CPU | 5 languages | Requires setup | Only one model available |
| DIY Development | 🤔 | Custom | Custom | Custom | Develop your own using Python according to [documentation](./docs/engine-manual/zh.md) |
If you choose to use the Vosk or SOSV model, you also need to configure your own translation model.
### Configuring Translation Models
![](./assets/media/engine_en.png)
> Note: Translation is not real-time. The translation model is only called after each sentence recognition is completed.
#### Ollama Local Model
> Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
Before using this model, you need to ensure that [Ollama](https://ollama.com/) software is installed on your machine and the required large language model has been downloaded. Simply add the name of the large model you want to call to the `Ollama` field in the settings.
#### Google Translate API
> Note: Google Translate API is not available in some regions.
No configuration required, just connect to the internet to use.
### Using Gummy Model
> The international version of Alibaba Cloud services does not seem to provide the Gummy model, so non-Chinese users may not be able to use the Gummy caption engine at present.
To use the default Gummy caption engine (using cloud models for speech recognition and translation), you first need to obtain an API KEY from Alibaba Cloud Bailian platform, then add the API KEY to the software settings or configure it in the environment variables (only Windows platform supports reading API KEY from environment variables), so that the model can be used normally. Related tutorials:
- [Get API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configure API Key through Environment Variables](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### Using Vosk Model
> The recognition effect of the Vosk model is poor, please use it with caution.
To use the Vosk local caption engine, first download the model you need from the [Vosk Models](https://alphacephei.com/vosk/models) page, unzip the model locally, and add the path of the model folder to the software settings.
![](./assets/media/config_en.png)
### Using SOSV Model
The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## ⌨️ Using in Terminal
The software adopts a modular design and can be divided into two parts: the main software body and caption engine. The main software calls caption engine through a graphical interface. Audio acquisition and speech recognition functions are implemented in the caption engine, which can be used independently without the main software.
Caption engine is developed using Python and packaged as executable files via PyInstaller. Therefore, there are two ways to use caption engine:
1. Use the source code of the project's caption engine part and run it with a Python environment that has the required libraries installed
2. Run the packaged executable file of the caption engine through the terminal
For runtime parameters and detailed usage instructions, please refer to the [User Manual](./docs/user-manual/en.md#using-caption-engine-standalone).
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
![](./docs/img/07.png)
## ⚙️ Built-in Subtitle Engines
Currently, the software comes with 2 subtitle engines, with 1 new engine planned. Details are as follows.
Currently, the software comes with 3 caption engines, with new engines under development. Their detailed information is as follows.
### Gummy Subtitle Engine (Cloud)
@@ -90,7 +152,7 @@ Developed based on Tongyi Lab's [Gummy Speech Translation Model](https://help.al
**Network Traffic Consumption:**
The subtitle engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:
The caption engine uses native sample rate (assumed to be 48kHz) for sampling, with 16bit sample depth and mono channel, so the upload rate is approximately:
$$
48000\ \text{samples/second} \times 2\ \text{bytes/sample} \times 1\ \text{channel} = 93.75\ \text{KB/s}
@@ -100,11 +162,21 @@ The engine only uploads data when receiving audio streams, so the actual upload
### Vosk Subtitle Engine (Local)
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). Currently only supports generating original text from audio, does not support translation content.
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation.
### FunASR Subtitle Engine (Local)
### SOSV Subtitle Engine (Local)
If feasible, will be developed based on [FunASR](https://github.com/modelscope/FunASR). Not yet researched or verified for feasibility.
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) is an integrated package, mainly based on [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html), with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese.
### Planned New Subtitle Engines
The following are candidate models that will be selected based on model performance and ease of integration.
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
## 🚀 Project Setup
@@ -118,27 +190,26 @@ npm install
### Build Subtitle Engine
First enter the `caption-engine` folder and execute the following commands to create a virtual environment:
First enter the `engine` folder and execute the following commands to create a virtual environment (requires Python 3.10 or higher, with Python 3.12 recommended):
```bash
# in ./caption-engine folder
python -m venv subenv
cd ./engine
# in ./engine folder
python -m venv .venv
# or
python3 -m venv subenv
python3 -m venv .venv
```
Then activate the virtual environment:
```bash
# Windows
subenv/Scripts/activate
.venv/Scripts/activate
# Linux or macOS
source subenv/bin/activate
source .venv/bin/activate
```
Then install dependencies (note: for Linux or macOS environments, you need to comment out `PyAudioWPatch` in `requirements.txt`, as this module is only for Windows environments).
> This step may report errors, usually due to build failures. You need to install corresponding build tools based on the error messages.
Then install dependencies (this step might result in errors on macOS and Linux, usually due to build failures, and you need to handle them based on the error messages):
```bash
pip install -r requirements.txt
@@ -147,20 +218,19 @@ pip install -r requirements.txt
Then use `pyinstaller` to build the project:
```bash
pyinstaller ./main-gummy.spec
pyinstaller ./main-vosk.spec
pyinstaller ./main.spec
```
Note that the path to the `vosk` library in `main-vosk.spec` might be incorrect and needs to be configured according to the actual situation.
Note that the path to the `vosk` library in `main-vosk.spec` might be incorrect and needs to be configured according to the actual situation (related to the version of the Python environment).
```
# Windows
vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
# Linux or macOS
vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
vosk_path = str(Path('./.venv/lib/python3.x/site-packages/vosk').resolve())
```
After the build completes, you can find the executable file in the `caption-engine/dist` folder. Then proceed with subsequent operations.
After the build completes, you can find the executable file in the `engine/dist` folder. Then proceed with subsequent operations.
### Run Project
@@ -170,8 +240,6 @@ npm run dev
### Build Project
Note: Currently the software has only been built and tested on Windows and macOS platforms. Correct operation on Linux platform is not guaranteed.
```bash
# For windows
npm run build:win
@@ -180,19 +248,3 @@ npm run build:mac
# For Linux
npm run build:linux
```
Note: You need to modify the configuration content in the `electron-builder.yml` file in the project root directory according to different platforms:
```yml
extraResources:
# For Windows
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
# For macOS and Linux
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
```

View File

@@ -3,12 +3,8 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption はクロスプラットフォームのリアルタイム字幕表示ソフトウェアです。</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases">
<img src="https://img.shields.io/badge/release-0.4.0-blue">
</a>
<a href="https://github.com/HiMeditator/auto-caption/issues">
<img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange">
</a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
<img src="https://img.shields.io/github/stars/HiMeditator/auto-caption?style=social">
@@ -18,16 +14,18 @@
| <a href="./README_en.md">English</a>
| <b>日本語</b> |
</p>
<p><i>Voskローカル字幕エンジンを含む v0.4.0 バージョンがリリースされました。<b>現在、ローカル字幕エンジンには翻訳機能が含まれておりません</b>。ローカル翻訳モジュールは現在も開発中です...</i></p>
<p><i>v1.0.0 バージョンがリリースされ、SOSV ローカル字幕モデルが追加されました。現在の機能は基本的に完了しており、今後の開発計画はありません...</i></p>
</div>
![](./assets/media/main_ja.png)
![](./assets/media/main_mac_ja.png)
## 📥 ダウンロード
[GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
ソフトウェアダウンロード: [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases)
Vosk モデルダウンロード: [Vosk Models](https://alphacephei.com/vosk/models)
SOSV モデルダウンロード: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)
## 📚 関連ドキュメント
@@ -35,43 +33,108 @@
[字幕エンジン説明ドキュメント](./docs/engine-manual/ja.md)
[プロジェクト API ドキュメント(中国語)](./docs/api-docs/electron-ipc.md)
## 📖 基本使い方
現在、Windows と macOS プラットフォーム向けのインストール可能なバージョンを提供しています。
> 阿里雲の国際版サービスでは Gummy モデルを提供していないため、現在中国以外のユーザーは Gummy 字幕エンジンを使用できません。
デフォルトの Gummy 字幕エンジン(クラウドベースのモデルを使用した音声認識と翻訳)を使用するには、まず阿里雲百煉プラットフォームから API KEY を取得する必要があります。その後、API KEY をソフトウェア設定に追加するか、環境変数に設定しますWindows プラットフォームのみ環境変数からの API KEY 読み取りをサポート)。関連チュートリアル:
- [API KEY の取得(中国語)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数を通じて API Key を設定(中国語)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
> Vosk モデルの認識精度は低いため、注意してご使用ください。
Vosk ローカル字幕エンジンを使用するには、まず [Vosk Models](https://alphacephei.com/vosk/models) ページから必要なモデルをダウンロードし、ローカルに解凍した後、モデルフォルダのパスをソフトウェア設定に追加してください。現在、Vosk 字幕エンジンは字幕の翻訳をサポートしていません。
![](./assets/media/vosk_ja.png)
**上記の字幕エンジンがご要望を満たさず、かつ Python の知識をお持ちの場合、独自の字幕エンジンを開発することも可能です。詳細な説明は[字幕エンジン説明書](./docs/engine-manual/ja.md)をご参照ください。**
[更新履歴](./docs/CHANGELOG.md)
## ✨ 特徴
- クロスプラットフォーム、多言語 UI サポート
- 豊富な字幕スタイル設定
- 柔軟な字幕エンジン選択
- 多言語認識と翻訳
- 字幕記録の表示とエクスポート
- オーディオ出力またはマイク入力からの字幕生成
- 音声出力またはマイク入力からの字幕生成
- ローカルのOllamaモデルまたはクラウドベースのGoogle翻訳APIを呼び出して翻訳をサポート
- クロスプラットフォームWindows、macOS、Linux、多言語インターフェース中国語、英語、日本語対応
- 豊富な字幕スタイル設定(フォント、フォントサイズ、フォント太さ、フォント色、背景色など)
- 柔軟な字幕エンジン選択阿里云Gummyクラウドモデル、ローカルVoskモデル、ローカルSOSVモデル、または独自にモデルを開発可能
- 多言語認識と翻訳(下記「⚙️ 字幕エンジン説明」参照)
- 字幕記録表示とエクスポート(`.srt` および `.json` 形式のエクスポートに対応)
注記:
- Windows と macOS プラットフォームはオーディオ出力とマイク入力の両方からの字幕生成をサポートしていますが、**macOS プラットフォームでシステムオーディオ出力を取得するには設定が必要です。詳細は[Auto Caption ユーザーマニュアル](./docs/user-manual/ja.md)をご覧ください。**
- Linux プラットフォームは現在システムオーディオ出力を取得できず、マイク入力からの字幕生成のみをサポートしています。
## 📖 基本使い方
このソフトウェアは Windows、macOS、Linux プラットフォームに対応しています。テスト済みのプラットフォーム情報は以下の通りです:
| OS バージョン | アーキテクチャ | システムオーディオ入力 | システムオーディオ出力 |
| ------------------ | ------------ | ------------------ | ------------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ [追加設定が必要](./docs/user-manual/ja.md#macos-でのシステムオーディオ出力の取得方法) | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
macOS および Linux プラットフォームでシステムオーディオ出力を取得するには追加設定が必要です。詳細は[Auto Captionユーザーマニュアル](./docs/user-manual/ja.md)をご覧ください。
ソフトウェアをダウンロードした後、自分のニーズに応じて対応するモデルを選択し、モデルを設定する必要があります。
| | 認識効果 | デプロイタイプ | 対応言語 | 翻訳 | 備考 |
| ------------------------------------------------------------ | -------- | ----------------- | ---------- | ---------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 良好😊 | クラウド / 阿里云 | 10種 | 内蔵翻訳 | 有料、0.54CNY / 時間 |
| [Vosk](https://alphacephei.com/vosk) | 不良😞 | ローカル / CPU | 30種以上 | 追加設定必要 | 対応言語が非常に多い |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | ローカル / CPU | 5種 | 追加設定必要 | モデルは一つのみ |
| 自前開発 | 🤔 | カスタム | カスタム | カスタム | [ドキュメント](./docs/engine-manual/zh.md)に従ってPythonで自前開発 |
VoskまたはSOSVモデルを使用する場合、独自の翻訳モデルも設定する必要があります。
### 翻訳モデルの設定
![](./assets/media/engine_ja.png)
> 注意:翻訳はリアルタイムではありません。翻訳モデルは各文の認識が完了した後にのみ呼び出されます。
#### Ollama ローカルモデル
> 注意パラメータ数が多すぎるモデルを使用すると、リソース消費と翻訳遅延が大きくなります。1B未満のパラメータ数のモデルを使用することを推奨します。例`qwen2.5:0.5b`、`qwen3:0.6b`。
このモデルを使用する前に、ローカルマシンに[Ollama](https://ollama.com/)ソフトウェアがインストールされ、必要な大規模言語モデルがダウンロードされていることを確認してください。必要な大規模モデル名を設定の`Ollama`フィールドに追加するだけでOKです。
#### Google翻訳API
> 注意Google翻訳APIは一部の地域では使用できません。
設定不要で、ネット接続があれば使用できます。
### Gummyモデルの使用
> 阿里云の国際版サービスにはGummyモデルが提供されていないため、現在中国以外のユーザーはGummy字幕エンジンを使用できない可能性があります。
デフォルトのGummy字幕エンジンクラウドモデルを使用した音声認識と翻訳を使用するには、まず阿里云百煉プラットフォームのAPI KEYを取得し、API KEYをソフトウェア設定に追加するか環境変数に設定する必要がありますWindowsプラットフォームのみ環境変数からのAPI KEY読み取りをサポート。関連チュートリアル
- [API KEYの取得](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数へのAPI Keyの設定](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### Voskモデルの使用
> Voskモデルの認識効果は不良のため、注意して使用してください。
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードし、ローカルにモデルを解凍し、モデルフォルダのパスをソフトウェア設定に追加してください。
![](./assets/media/config_ja.png)
### SOSVモデルの使用
SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りですhttps://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## ⌨️ ターミナルでの使用
ソフトウェアはモジュール化設計を採用しており、ソフトウェア本体と字幕エンジンの2つの部分に分けることができます。ソフトウェア本体はグラフィカルインターフェースを通じて字幕エンジンを呼び出します。コアとなる音声取得および音声認識機能はすべて字幕エンジンに実装されており、字幕エンジンはソフトウェア本体から独立して単独で使用できます。
字幕エンジンはPythonを使用して開発され、PyInstallerによって実行可能ファイルとしてパッケージ化されます。したがって、字幕エンジンの使用方法は以下の2つがあります
1. プロジェクトの字幕エンジン部分のソースコードを使用し、必要なライブラリがインストールされたPython環境で実行する
2. パッケージ化された字幕エンジンの実行可能ファイルをターミナルから実行する
実行引数および詳細な使用方法については、[User Manual](./docs/user-manual/en.md#using-caption-engine-standalone)をご参照ください。
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
![](./docs/img/07.png)
## ⚙️ 字幕エンジン説明
現在ソフトウェアには2つの字幕エンジンが組み込まれており、1つの新しいエンジン計画中です。詳細は以下の通りです。
現在ソフトウェアには3つの字幕エンジンが搭載されており、新しいエンジン計画されています。それらの詳細情報は以下の通りです。
### Gummy 字幕エンジン(クラウド)
@@ -100,11 +163,21 @@ $$
### Vosk字幕エンジンローカル
[vosk-api](https://github.com/alphacep/vosk-api) をベースに開発されています。現在は音声に対応する原文の生成のみをサポートしており、翻訳コンテンツはサポートしていません
[vosk-api](https://github.com/alphacep/vosk-api)をベースに開発。この字幕エンジンの利点は選択可能な言語モデルが非常に多く30言語以上、欠点は認識効果が比較的悪く、生成内容に句読点がないことです
### FunASR字幕エンジン(ローカル)
### SOSV 字幕エンジン(ローカル)
可能であれば、[FunASR](https://github.com/modelscope/FunASR) をベースに開発予定です。まだ調査と実現可能性の検証を行っていません
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)は統合パッケージで、主に[Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)をベースにし、エンドポイント検出モデルと句読点復元モデルを追加しています。このモデルが認識をサポートする言語は:英語、中国語、日本語、韓国語、広東語です
### 新規計画字幕エンジン
以下は候補モデルであり、モデルの性能と統合の容易さに基づいて選択されます。
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
## 🚀 プロジェクト実行
@@ -118,27 +191,26 @@ npm install
### 字幕エンジンの構築
まず `caption-engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成します:
まず `engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成しますPython 3.10 以上が必要で、Python 3.12 が推奨されます)
```bash
# ./caption-engine フォルダ内
python -m venv subenv
cd ./engine
# ./engine フォルダ内
python -m venv .venv
# または
python3 -m venv subenv
python3 -m venv .venv
```
次に仮想環境をアクティブにします:
```bash
# Windows
subenv/Scripts/activate
.venv/Scripts/activate
# Linux または macOS
source subenv/bin/activate
source .venv/bin/activate
```
その後、依存関係をインストールします(Linux または macOS 環境の場合、`requirements.txt` 内の `PyAudioWPatch` をコメントアウトする必要があります。このモジュールは Windows 環境専用です)。
> このステップでエラーが発生する場合があります。一般的にはビルド失敗が原因で、エラーメッセージに基づいて対応するビルドツールパッケージをインストールする必要があります。
次に依存関係をインストールします(このステップでは macOS と Linux でエラーが発生する可能性があります。通常はビルド失敗によるもので、エラーメッセージに基づいて対処する必要があります):
```bash
pip install -r requirements.txt
@@ -147,20 +219,19 @@ pip install -r requirements.txt
その後、`pyinstaller` を使用してプロジェクトをビルドします:
```bash
pyinstaller ./main-gummy.spec
pyinstaller ./main-vosk.spec
pyinstaller ./main.spec
```
`main-vosk.spec` ファイル内の `vosk` ライブラリのパスが正しくない可能性があるため、実際の状況に応じて設定する必要があります。
`main-vosk.spec` ファイル内の `vosk` ライブラリのパスが正しくない可能性があるため、実際の状況Python 環境のバージョンに関連)に応じて設定する必要があります。
```
# Windows
vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
# LinuxまたはmacOS
vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
# Linux または macOS
vosk_path = str(Path('./.venv/lib/python3.x/site-packages/vosk').resolve())
```
これでプロジェクトのビルドが完了し、`caption-engine/dist` フォルダ内に対応する実行可能ファイルが確認できます。その後、次の操作に進むことができます。
これでプロジェクトのビルドが完了し、`engine/dist` フォルダ内に対応する実行可能ファイルが確認できます。その後、次の操作に進むことができます。
### プロジェクト実行
@@ -170,8 +241,6 @@ npm run dev
### プロジェクト構築
現在、ソフトウェアは Windows と macOS プラットフォームでのみ構築とテストが行われており、Linux プラットフォームでの正しい動作は保証できません。
```bash
# Windows 用
npm run build:win
@@ -180,19 +249,3 @@ npm run build:mac
# Linux 用
npm run build:linux
```
注意: プラットフォームに応じて、プロジェクトルートディレクトリにある `electron-builder.yml` ファイルの設定内容を変更する必要があります:
```yml
extraResources:
# Windows用
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
# macOSとLinux用
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
```

BIN
assets/media/config_en.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

BIN
assets/media/config_ja.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

BIN
assets/media/config_zh.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

BIN
assets/media/engine_en.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

BIN
assets/media/engine_ja.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

BIN
assets/media/engine_zh.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 462 KiB

After

Width:  |  Height:  |  Size: 404 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 477 KiB

After

Width:  |  Height:  |  Size: 417 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.7 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.7 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 468 KiB

After

Width:  |  Height:  |  Size: 417 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 321 KiB

After

Width:  |  Height:  |  Size: 323 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 324 KiB

After

Width:  |  Height:  |  Size: 324 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 323 KiB

After

Width:  |  Height:  |  Size: 324 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

View File

@@ -1,2 +0,0 @@
from dashscope.common.error import InvalidParameter
from .gummy import GummyTranslator

View File

@@ -1 +0,0 @@
from .process import mergeChunkChannels, resampleRawChunk, resampleMonoChunk

View File

@@ -1,68 +0,0 @@
import samplerate
import numpy as np
def mergeChunkChannels(chunk, channels):
"""
将当前多通道音频数据块转换为单通道音频数据块
Args:
chunk: (bytes)多通道音频数据块
channels: 通道数
Returns:
(bytes)单通道音频数据块
"""
# (length * channels,)
chunk_np = np.frombuffer(chunk, dtype=np.int16)
# (length, channels)
chunk_np = chunk_np.reshape(-1, channels)
# (length,)
chunk_mono_f = np.mean(chunk_np.astype(np.float32), axis=1)
chunk_mono = np.round(chunk_mono_f).astype(np.int16)
return chunk_mono.tobytes()
def resampleRawChunk(chunk, channels, orig_sr, target_sr, mode="sinc_best"):
"""
将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
Args:
chunk: (bytes)多通道音频数据块
channels: 通道数
orig_sr: 原始采样率
target_sr: 目标采样率
mode: 重采样模式,可选:'sinc_best' | 'sinc_medium' | 'sinc_fastest' | 'zero_order_hold' | 'linear'
Return:
(bytes)单通道音频数据块
"""
# (length * channels,)
chunk_np = np.frombuffer(chunk, dtype=np.int16)
# (length, channels)
chunk_np = chunk_np.reshape(-1, channels)
# (length,)
chunk_mono_f = np.mean(chunk_np.astype(np.float32), axis=1)
chunk_mono = chunk_mono_f.astype(np.int16)
ratio = target_sr / orig_sr
chunk_mono_r = samplerate.resample(chunk_mono, ratio, converter_type=mode)
chunk_mono_r = np.round(chunk_mono_r).astype(np.int16)
return chunk_mono_r.tobytes()
def resampleMonoChunk(chunk, orig_sr, target_sr, mode="sinc_best"):
"""
将当前单通道音频块进行重采样
Args:
chunk: (bytes)单通道音频数据块
orig_sr: 原始采样率
target_sr: 目标采样率
mode: 重采样模式,可选:'sinc_best' | 'sinc_medium' | 'sinc_fastest' | 'zero_order_hold' | 'linear'
Return:
(bytes)单通道音频数据块
"""
chunk_np = np.frombuffer(chunk, dtype=np.int16)
ratio = target_sr / orig_sr
chunk_r = samplerate.resample(chunk_np, ratio, converter_type=mode)
chunk_r = np.round(chunk_r).astype(np.int16)
return chunk_r.tobytes()

View File

@@ -1,58 +0,0 @@
import sys
import argparse
if sys.platform == 'win32':
from sysaudio.win import AudioStream
elif sys.platform == 'darwin':
from sysaudio.darwin import AudioStream
elif sys.platform == 'linux':
from sysaudio.linux import AudioStream
else:
raise NotImplementedError(f"Unsupported platform: {sys.platform}")
from audioprcs import mergeChunkChannels
from audio2text import InvalidParameter, GummyTranslator
def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
sys.stdout.reconfigure(line_buffering=True) # type: ignore
stream = AudioStream(audio_type, chunk_rate)
if t_lang == 'none':
gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
else:
gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
stream.openStream()
gummy.start()
while True:
try:
chunk = stream.read_chunk()
chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
try:
gummy.send_audio_frame(chunk_mono)
except InvalidParameter:
gummy.start()
gummy.send_audio_frame(chunk_mono)
except KeyboardInterrupt:
stream.closeStream()
gummy.stop()
break
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
parser.add_argument('-s', '--source_language', default='en', help='Source language code')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output audio stream, 1 for input audio stream')
parser.add_argument('-c', '--chunk_rate', default=20, help='The number of audio stream chunks collected per second.')
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
args = parser.parse_args()
convert_audio_to_text(
args.source_language,
args.target_language,
int(args.audio_type),
int(args.chunk_rate),
args.api_key
)

View File

@@ -1,38 +0,0 @@
# -*- mode: python ; coding: utf-8 -*-
a = Analysis(
['main-gummy.py'],
pathex=[],
binaries=[],
datas=[],
hiddenimports=[],
hookspath=[],
hooksconfig={},
runtime_hooks=[],
excludes=[],
noarchive=False,
optimize=0,
)
pyz = PYZ(a.pure)
exe = EXE(
pyz,
a.scripts,
a.binaries,
a.datas,
[],
name='main-gummy',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
upx_exclude=[],
runtime_tmpdir=None,
console=True,
disable_windowed_traceback=False,
argv_emulation=False,
target_arch=None,
codesign_identity=None,
entitlements_file=None,
)

View File

@@ -1,83 +0,0 @@
import sys
import json
import argparse
from datetime import datetime
import numpy.core.multiarray
if sys.platform == 'win32':
from sysaudio.win import AudioStream
elif sys.platform == 'darwin':
from sysaudio.darwin import AudioStream
elif sys.platform == 'linux':
from sysaudio.linux import AudioStream
else:
raise NotImplementedError(f"Unsupported platform: {sys.platform}")
from vosk import Model, KaldiRecognizer, SetLogLevel
from audioprcs import resampleRawChunk
SetLogLevel(-1)
def convert_audio_to_text(audio_type, chunk_rate, model_path):
sys.stdout.reconfigure(line_buffering=True) # type: ignore
if model_path.startswith('"'):
model_path = model_path[1:]
if model_path.endswith('"'):
model_path = model_path[:-1]
model = Model(model_path)
recognizer = KaldiRecognizer(model, 16000)
stream = AudioStream(audio_type, chunk_rate)
stream.openStream()
time_str = ''
cur_id = 0
prev_content = ''
while True:
chunk = stream.read_chunk()
chunk_mono = resampleRawChunk(chunk, stream.CHANNELS, stream.RATE, 16000)
caption = {}
if recognizer.AcceptWaveform(chunk_mono):
content = json.loads(recognizer.Result()).get('text', '')
caption['index'] = cur_id
caption['text'] = content
caption['time_s'] = time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
caption['translation'] = ''
prev_content = ''
cur_id += 1
else:
content = json.loads(recognizer.PartialResult()).get('partial', '')
if content == '' or content == prev_content:
continue
if prev_content == '':
time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
caption['index'] = cur_id
caption['text'] = content
caption['time_s'] = time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
caption['translation'] = ''
prev_content = content
try:
json_str = json.dumps(caption) + '\n'
sys.stdout.write(json_str)
sys.stdout.flush()
except Exception as e:
print(e)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output audio stream, 1 for input audio stream')
parser.add_argument('-c', '--chunk_rate', default=20, help='The number of audio stream chunks collected per second.')
parser.add_argument('-m', '--model_path', default='', help='The path to the vosk model.')
args = parser.parse_args()
convert_audio_to_text(
int(args.audio_type),
int(args.chunk_rate),
args.model_path
)

View File

@@ -1,7 +0,0 @@
dashscope
numpy
samplerate
PyAudio
PyAudioWPatch # Windows only
vosk
pyinstaller

View File

@@ -1,85 +0,0 @@
"""获取 MacOS 系统音频输入/输出流"""
import pyaudio
class AudioStream:
"""
获取系统音频流(支持 BlackHole 作为系统音频输出捕获)
初始化参数:
audio_type: 0-系统音频输出流(需配合 BlackHole1-系统音频输入流
chunk_rate: 每秒采集音频块的数量默认为20
"""
def __init__(self, audio_type=0, chunk_rate=20):
self.audio_type = audio_type
self.mic = pyaudio.PyAudio()
if self.audio_type == 0:
self.device = self.getOutputDeviceInfo()
else:
self.device = self.mic.get_default_input_device_info()
self.stream = None
self.SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)
self.FORMAT = pyaudio.paInt16
self.CHANNELS = self.device["maxInputChannels"]
self.RATE = int(self.device["defaultSampleRate"])
self.CHUNK = self.RATE // chunk_rate
self.INDEX = self.device["index"]
def getOutputDeviceInfo(self):
"""查找指定关键词的输入设备"""
device_count = self.mic.get_device_count()
for i in range(device_count):
dev_info = self.mic.get_device_info_by_index(i)
if 'blackhole' in dev_info["name"].lower():
return dev_info
raise Exception("The device containing BlackHole was not found.")
def printInfo(self):
dev_info = f"""
采样输入设备:
- 设备类型:{ "音频输出" if self.audio_type == 0 else "音频输入" }
- 序号:{self.device['index']}
- 名称:{self.device['name']}
- 最大输入通道数:{self.device['maxInputChannels']}
- 默认低输入延迟:{self.device['defaultLowInputLatency']}s
- 默认高输入延迟:{self.device['defaultHighInputLatency']}s
- 默认采样率:{self.device['defaultSampleRate']}Hz
音频样本块大小:{self.CHUNK}
样本位宽:{self.SAMP_WIDTH}
采样格式:{self.FORMAT}
音频通道数:{self.CHANNELS}
音频采样率:{self.RATE}
"""
print(dev_info)
def openStream(self):
"""
打开并返回系统音频输出流
"""
if self.stream: return self.stream
self.stream = self.mic.open(
format = self.FORMAT,
channels = int(self.CHANNELS),
rate = self.RATE,
input = True,
input_device_index = int(self.INDEX)
)
return self.stream
def read_chunk(self):
"""
读取音频数据
"""
if not self.stream: return None
return self.stream.read(self.CHUNK, exception_on_overflow=False)
def closeStream(self):
"""
关闭系统音频输出流
"""
if self.stream is None: return
self.stream.stop_stream()
self.stream.close()
self.stream = None

View File

@@ -1,73 +0,0 @@
"""获取 Linux 系统音频输入流"""
import pyaudio
class AudioStream:
"""
获取系统音频流
初始化参数:
audio_type: 0-系统音频输出流不支持不会生效1-系统音频输入流(默认)
chunk_rate: 每秒采集音频块的数量默认为20
"""
def __init__(self, audio_type=1, chunk_rate=20):
self.audio_type = audio_type
self.mic = pyaudio.PyAudio()
self.device = self.mic.get_default_input_device_info()
self.stream = None
self.SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)
self.FORMAT = pyaudio.paInt16
self.CHANNELS = self.device["maxInputChannels"]
self.RATE = int(self.device["defaultSampleRate"])
self.CHUNK = self.RATE // chunk_rate
self.INDEX = self.device["index"]
def printInfo(self):
dev_info = f"""
采样输入设备:
- 设备类型:{ "音频输入Linux平台目前仅支持该项" }
- 序号:{self.device['index']}
- 名称:{self.device['name']}
- 最大输入通道数:{self.device['maxInputChannels']}
- 默认低输入延迟:{self.device['defaultLowInputLatency']}s
- 默认高输入延迟:{self.device['defaultHighInputLatency']}s
- 默认采样率:{self.device['defaultSampleRate']}Hz
音频样本块大小:{self.CHUNK}
样本位宽:{self.SAMP_WIDTH}
采样格式:{self.FORMAT}
音频通道数:{self.CHANNELS}
音频采样率:{self.RATE}
"""
print(dev_info)
def openStream(self):
"""
打开并返回系统音频输出流
"""
if self.stream: return self.stream
self.stream = self.mic.open(
format = self.FORMAT,
channels = int(self.CHANNELS),
rate = self.RATE,
input = True,
input_device_index = int(self.INDEX)
)
return self.stream
def read_chunk(self):
"""
读取音频数据
"""
if not self.stream: return None
return self.stream.read(self.CHUNK)
def closeStream(self):
"""
关闭系统音频输出流
"""
if self.stream is None: return
self.stream.stop_stream()
self.stream.close()
self.stream = None

View File

@@ -87,3 +87,89 @@
### 优化体验
- 字幕窗口右上角图标的颜色改为和字幕原文字体颜色一致
## v0.5.0
2025-07-15
为软件本体添加了更多功能、适配了 Linux。
### 新增功能
- 适配了 Linux 平台
- 新增修改字幕时间功能,可调整字幕时间
- 支持导出 srt 格式的字幕记录
- 支持显示字幕引擎状态pid、ppid、CPU占用率、内存占用、运行时间
### 优化体验
- 调整字幕窗口右上角图标为竖向排布
- 过滤 Gummy 字幕引擎输出的不完整字幕
## v0.5.1
2025-07-17
### 修复 bug
- 修复无法调用自定义字幕引擎的 bug
- 修复自定义字幕引擎的参数失效 bug
## v0.6.0
2025-07-29
### 新增功能
- 新增字幕记录排序功能,可选择字幕记录正序或倒叙显示
### 优化体验
- 减小了软件安装包的体积
- 微调字幕引擎设置界面布局
- 交换窗口界面信息弹窗和错误弹窗的位置,防止提示信息挡住操作
- 提高程序健壮性,完全避免字幕引擎进程成为孤儿进程
- 修改字幕引擎文档,添加更详细的开发说明
### 项目优化
- 重构字幕引擎,提示字幕引擎代码的可扩展性和可读性
- 合并 Gummy 和 Vosk 引擎为单个可执行文件
- 字幕引擎和主程序添加 Socket 通信,完全避免字幕引擎成为孤儿进程
## v0.7.0
2025-08-20
### 新增功能
- 添加字幕窗口宽度记忆,重新打开时与上次字幕窗口宽度一致
- 在尝试关闭字幕引擎 4s 后字幕引擎仍未关闭,则强制关闭字幕引擎
- 添加复制最新字幕选项用户可以选择只复制最近1~3条字幕 (#13)
- 添加主题颜色设置,支持六种颜色:蓝色、绿色、橙色、紫色、粉色、暗色/明色
- 添加日志记录显示:可以查看软件的字幕引擎输出的日志记录
### 优化体验
- 优化软件用户界面的部分组件
- 更清晰的日志输出
## v1.0.0
2025-09-08
### 新增功能
- 字幕引擎添加超时关闭功能:如果在规定时间字幕引擎没有启动成功会自动关闭;在字幕引擎启动过程中可选择关闭字幕引擎
- 添加非实时翻译功能:支持调用 Ollama 本地模型进行翻译;支持调用 Google 翻译 API 进行翻译
- 添加新的翻译模型:添加 SOSV 模型,支持识别英语、中文、日语、韩语、粤语
- 添加录音功能:可以将字幕引擎识别的音频保存为 .wav 文件
- 添加多行字幕功能,用户可以设置字幕窗口显示的字幕的行数
### 优化体验
- 优化部分提示信息显示位置
- 替换重采样模型,提高音频重采样质量
- 带有额外信息的标签颜色改为与主题色一致

View File

@@ -10,12 +10,28 @@
- [x] 适配 macOS 平台 *2025/07/08*
- [x] 添加字幕文字描边 *2025/07/09*
- [x] 添加基于 Vosk 的字幕引擎 *2025/07/09*
- [x] 适配 Linux 平台 *2025/07/13*
- [x] 字幕窗口右上角图标改为竖向排布 *2025/07/14*
- [x] 可以调整字幕时间轴 *2025/07/14*
- [x] 可以导出 srt 格式的字幕记录 *2025/07/14*
- [x] 可以获取字幕引擎的系统资源消耗情况 *2025/07/15*
- [x] 添加字幕记录按时间降序排列选择 *2025/07/26*
- [x] 重构字幕引擎 *2025/07/28*
- [x] 优化前端界面提示消息 *2025/07/29*
- [x] 复制字幕记录可选择只复制最近的字幕记录 *2025/08/18*
- [x] 添加颜色主题设置 *2025/08/18*
- [x] 前端页面添加日志内容展示 *2025/08/19*
- [x] 添加 Ollama 模型用于本地字幕引擎的翻译 *2025/09/04*
- [x] 验证 / 添加基于 sherpa-onnx 的字幕引擎 *2025/09/06*
## 待完成
- [ ] 添加 Ollama 模型用于本地字幕引擎的翻译
- [ ] 添加本地字幕引擎
- [ ] 验证 / 添加基于 FunASR 的字幕引擎
- [ ] 调研更多的云端模型火山、OpenAI、Google等
- [ ] 验证 / 添加基于 sherpa-onnx 的字幕引擎
## 后续计划
- [ ] 验证 / 添加基于 FunASR 的字幕引擎
- [ ] 减小软件不必要的体积
## 遥远的未来

View File

@@ -0,0 +1,144 @@
# caption engine api-doc
本文档主要介绍字幕引擎和 Electron 主进程进程的通信约定。
## 原理说明
本项目的 Python 进程通过标准输出向 Electron 主进程发送数据。Python 进程标准输出 (`sys.stdout`) 的内容一定为一行一行的字符串。且每行字符串均可以解释为一个 JSON 对象。每个 JSON 对象一定有 `command` 参数。
Electron 主进程通过 TCP Socket 向 Python 进程发送数据。发送的数据均是转化为字符串的对象,对象格式一定为:
```js
{
command: string,
content: string
}
```
## 标准输出约定
> 数据传递方向:字幕引擎进程 => Electron 主进程
当 JSON 对象的 `command` 参数为下列值时,表示的对应的含义:
### `connect`
```js
{
command: "connect",
content: ""
}
```
字幕引擎 TCP Socket 服务已经准备好,命令 Electron 主进程连接字幕引擎 Socket 服务
### `kill`
```js
{
command: "connect",
content: ""
}
```
命令 Electron 主进程强制结束字幕引擎进程。
### `caption`
```js
{
command: "caption",
index: number,
time_s: string,
time_t: string,
text: string,
translation: string
}
```
Python 端监听到的音频流转换为的字幕数据。
### `translation`
```js
{
command: "translation",
time_s: string,
text: string,
translation: string
}
```
语音识别的内容的翻译,可以根据起始时间确定对应的字幕。
### `print`
```js
{
command: "print",
content: string
}
```
输出 Python 端打印的内容,不计入日志。
### `info`
```js
{
command: "info",
content: string
}
```
Python 端打印的提示信息,会计入日志。
### `warn`
```js
{
command: "warn",
content: string
}
```
Python 端打印的警告信息,会计入日志。
### `error`
```js
{
command: "error",
content: string
}
```
Python 端打印的错误信息,该错误信息会在前端弹窗显示。
### `usage`
```js
{
command: "usage",
content: string
}
```
Gummy 字幕引擎结束时打印计费消耗信息。
## TCP Socket
> 数据传递方向Electron 主进程 => 字幕引擎进程
当 JSON 对象的 `command` 参数为下列值时,表示的对应的含义:
### `stop`
```js
{
command: "stop",
content: ""
}
```
命令当前字幕引擎停止监听并结束任务。

View File

@@ -57,6 +57,19 @@
- 发送:无数据
- 接收:`string`
### `control.engine.info`
**介绍:** 获取字幕引擎的资源消耗情况
**发起方:** 前端控制窗口
**接收方:** 后端控制窗口实例
**数据类型:**
- 发送:无数据
- 接收:`EngineInfo`
## 前端 ==> 后端
### `control.uiLanguage.change`
@@ -71,7 +84,7 @@
### `control.uiTheme.change`
**介绍:** 前端修改界面主题,将修改同步给后端
**介绍:** 前端修改界面主题,将修改同步给后端
**发起方:** 前端控制窗口
@@ -79,6 +92,16 @@
**数据类型:** `UITheme`
### `control.uiColor.change`
**介绍:** 前端修改界面主题颜色,将修改同步给后端
**发起方:** 前端控制窗口
**接收方:** 后端控制窗口实例
**数据类型:** `string`
### `control.leftBarWidth.change`
**介绍:** 前端修改边栏宽度,将修改同步给后端
@@ -159,6 +182,16 @@
**数据类型:** 无数据
### `control.engine.forceKill`
**介绍:** 强制关闭启动超时的字幕引擎
**发起方:** 前端控制窗口
**接收方:** 后端控制窗口实例
**数据类型:** 无数据
### `caption.windowHeight.change`
**介绍:** 字幕窗口宽度发生改变
@@ -223,13 +256,13 @@
### `control.engine.started`
**介绍:** 引擎启动成功
**介绍:** 引擎启动成功,参数为引擎的进程 ID
**发起方:** 后端
**接收方:** 前端控制窗口
**数据类型:** 无数据
**数据类型:** `number`
### `control.engine.stopped`
@@ -261,6 +294,16 @@
**数据类型:** `Controls`
### `control.softwareLog.add`
**介绍:** 添加一条新的日志数据
**发起方:** 后端
**接收方:** 前端控制窗口
**数据类型:** `SoftwareLog`
### `both.styles.set`
**介绍:** 后端将最新字幕样式发送给前端,前端进行设置

View File

@@ -1,156 +1,231 @@
# Caption Engine Documentation
Corresponding Version: v0.4.0
Corresponding version: v1.0.0
![](../../assets/media/structure_en.png)
## Introduction to the Caption Engine
The so-called caption engine is actually a subprogram that captures real-time streaming data from the system's audio input (recording) or output (playing sound) and calls an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into a JSON-formatted string and passed to the main program through standard output (it must be ensured that the string read by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and then displays it on the window.
The so-called caption engine is actually a subprocess that captures streaming data from system audio input (microphone) or output (speaker) in real-time, and invokes an audio-to-text model to generate captions for the corresponding audio. The generated captions are converted into JSON-formatted string data and transmitted to the main program through standard output (ensuring that the string received by the main program can be correctly interpreted as a JSON object). The main program reads and interprets the caption data, processes it, and displays it in the window.
## Functions Required by the Caption Engine
**Communication between the caption engine process and Electron main process follows the standard: [caption engine api-doc](../api-docs/caption-engine.md).**
## Execution Flow
Process of communication between main process and caption engine:
### Starting the Engine
- Electron main process: Use `child_process.spawn()` to start the caption engine process
- Caption engine process: Create a TCP Socket server thread, after creation output a JSON object converted to string via standard output, containing the `command` field with value `connect`
- Main process: Listen to the caption engine process's standard output, try to split the standard output by lines, parse it into a JSON object, and check if the object's `command` field value is `connect`. If so, connect to the TCP Socket server
### Caption Recognition
- Caption engine process: Create a new thread to monitor system audio output, put acquired audio data chunks into a shared queue (`shared_data.chunk_queue`). The caption engine continuously reads audio data chunks from the shared queue and parses them. The caption engine may also create a new thread to perform translation operations. Finally, the caption engine sends parsed caption data object strings through standard output
- Electron main process: Continuously listen to the caption engine's standard output and take different actions based on the parsed object's `command` field
### Stopping the Engine
- Electron main process: When the user operates to close the caption engine in the frontend, the main process sends an object string with `command` field set to `stop` to the caption engine process through Socket communication
- Caption engine process: Receive the caption data object string sent by the main engine process, parse the string into an object. If the object's `command` field is `stop`, set the value of global variable `shared_data.status` to `stop`
- Caption engine process: Main thread continuously monitors system audio output, when `thread_data.status` value is not `running`, end the loop, release resources, and terminate execution
- Electron main process: If the caption engine process termination is detected, perform corresponding processing and provide feedback to the frontend
## Implemented Features
The following features have been implemented and can be directly reused.
### Standard Output
Can output regular information, commands, and error messages.
Examples:
```python
from utils import stdout, stdout_cmd, stdout_obj, stderr
# {"command": "print", "content": "Hello"}\n
stdout("Hello")
# {"command": "connect", "content": "8080"}\n
stdout_cmd("connect", "8080")
# {"command": "print", "content": "print"}\n
stdout_obj({"command": "print", "content": "print"})
# sys.stderr.write("Error Info" + "\n")
stderr("Error Info")
```
### Creating Socket Service
This Socket service listens on a specified port, parses content sent by the Electron main program, and may change the value of `shared_data.status`.
Example:
```python
from utils import start_server
from utils import shared_data
port = 8080
start_server(port)
while thread_data == 'running':
# do something
pass
```
### Audio Acquisition
First, your caption engine needs to capture streaming data from the system's audio input (recording) or output (playing sound). If using Python for development, you can use the PyAudio library to obtain microphone audio input data (cross-platform). Use the PyAudioWPatch library to get system audio output (Windows platform only).
The `AudioStream` class is used to acquire audio data, with cross-platform implementation supporting Windows, Linux, and macOS. The class initialization includes two parameters:
Generally, the captured audio stream data consists of short audio chunks, and the size of these chunks should be adjusted according to the model. For example, Alibaba Cloud's Gummy model performs better with 0.05-second audio chunks compared to 0.2-second ones.
- `audio_type`: Audio acquisition type, 0 for system output audio (speaker), 1 for system input audio (microphone)
- `chunk_rate`: Audio data acquisition frequency, number of audio chunks acquired per second, default is 10
The class contains four methods:
- `open_stream()`: Start audio acquisition
- `read_chunk() -> bytes`: Read an audio chunk
- `close_stream()`: Close audio acquisition
- `close_stream_signal()`: Thread-safe closing of system audio input stream
Example:
```python
from sysaudio import AudioStream
audio_type = 0
chunk_rate = 20
stream = AudioStream(audio_type, chunk_rate)
stream.open_stream()
while True:
data = stream.read_chunk()
# do something with data
pass
stream.close_stream()
```
### Audio Processing
The acquired audio stream may need preprocessing before being converted to text. For instance, Alibaba Cloud's Gummy model can only recognize single-channel audio streams, while the collected audio streams are typically dual-channel, thus requiring conversion from dual-channel to single-channel. Channel conversion can be achieved using methods in the NumPy library.
Before converting audio streams to text, preprocessing may be required. Usually, multi-channel audio needs to be converted to single-channel audio, and resampling may also be needed. This project provides two audio processing functions:
You can directly use the audio acquisition (`caption-engine/sysaudio`) and audio processing (`caption-engine/audioprcs`) modules I have developed.
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`: Convert multi-channel audio chunks to single-channel audio chunks
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`: Convert current multi-channel audio data chunks to single-channel audio data chunks, then perform resampling
Example:
```python
from sysaudio import AudioStream
from utils import merge_chunk_channels
stream = AudioStream(1)
while True:
raw_chunk = stream.read_chunk()
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
# do something with chunk
```
## Features to be Implemented by the Caption Engine
### Audio to Text Conversion
After obtaining the appropriate audio stream, you can convert it into text. This is generally done using various models based on your requirements.
After obtaining suitable audio streams, the audio stream needs to be converted to text. Generally, various models (cloud or local) are used to implement audio-to-text conversion. Appropriate models should be selected according to requirements.
A nearly complete implementation of a caption engine is as follows:
It is recommended to encapsulate this as a class, implementing four methods:
```python
import sys
import argparse
- `start(self)`: Start the model
- `send_audio_frame(self, data: bytes)`: Process current audio chunk data, **generated caption data is sent to Electron main process through standard output**
- `translate(self)`: Continuously retrieve data chunks from `shared_data.chunk_queue` and call `send_audio_frame` method to process data chunks
- `stop(self)`: Stop the model
# Import system audio acquisition module
if sys.platform == 'win32':
from sysaudio.win import AudioStream
elif sys.platform == 'darwin':
from sysaudio.darwin import AudioStream
elif sys.platform == 'linux':
from sysaudio.linux import AudioStream
else:
raise NotImplementedError(f"Unsupported platform: {sys.platform}")
Complete caption engine examples:
# Import audio processing functions
from audioprcs import mergeChunkChannels
# Import audio-to-text module
from audio2text import InvalidParameter, GummyTranslator
def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
# Set standard output to line buffering
sys.stdout.reconfigure(line_buffering=True) # type: ignore
# Create instances for audio acquisition and speech-to-text
stream = AudioStream(audio_type, chunk_rate)
if t_lang == 'none':
gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
else:
gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
# Start instances
stream.openStream()
gummy.start()
while True:
try:
# Read audio stream data
chunk = stream.read_chunk()
chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
try:
# Call the model for translation
gummy.send_audio_frame(chunk_mono)
except InvalidParameter:
gummy.start()
gummy.send_audio_frame(chunk_mono)
except KeyboardInterrupt:
stream.closeStream()
gummy.stop()
break
```
- [gummy.py](../../engine/audio2text/gummy.py)
- [vosk.py](../../engine/audio2text/vosk.py)
- [sosv.py](../../engine/audio2text/sosv.py)
### Caption Translation
Some speech-to-text models don't provide translation functionality, requiring an additional translation module. This part can use either cloud-based translation APIs or local translation models.
Some speech-to-text models do not provide translation. If needed, an additional translation module needs to be added, or built-in translation modules can be used.
### Data Transmission
Example:
After obtaining the text of the current audio stream, it needs to be transmitted to the main program. The caption engine process passes the caption data to the Electron main process through standard output.
```python
from utils import google_translate, ollama_translate
text = "This is a translation test."
google_translate("", "en", text, "time_s")
ollama_translate("qwen3:0.6b", "en", text, "time_s")
```
The content transmitted must be a JSON string, where the JSON object must contain the following parameters:
### Caption Data Transmission
After obtaining the text from the current audio stream, the text needs to be sent to the main program. The caption engine process transmits caption data to the Electron main process through standard output.
The transmitted content must be a JSON string, where the JSON object needs to contain the following parameters:
```typescript
export interface CaptionItem {
index: number, // Caption sequence number
time_s: string, // Caption start time
time_t: string, // Caption end time
text: string, // Caption content
translation: string // Caption translation
command: "caption",
index: number, // Caption sequence number
time_s: string, // Current caption start time
time_t: string, // Current caption end time
text: string, // Caption content
translation: string // Caption translation
}
```
**It is essential to ensure that each time we output caption JSON data, the buffer is flushed, ensuring that the string received by the Electron main process can always be interpreted as a JSON object.**
**Note that the buffer must be flushed after each caption JSON data output to ensure that the Electron main process receives strings that can be interpreted as JSON objects each time.** It is recommended to use the project's existing `stdout_obj` function for transmission.
If using Python, you can refer to the following method to pass data to the main program:
### Command Line Parameter Specification
Custom caption engine settings provide command line parameter specification, so the caption engine parameters need to be set properly. Currently used parameters in this project are as follows:
```python
# caption-engine\main-gummy.py
sys.stdout.reconfigure(line_buffering=True)
# caption-engine\audio2text\gummy.py
...
def send_to_node(self, data):
"""
Send data to the Node.js process
"""
try:
json_data = json.dumps(data) + '\n'
sys.stdout.write(json_data)
sys.stdout.flush()
except Exception as e:
print(f"Error sending data to Node.js: {e}", file=sys.stderr)
...
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# all
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
# gummy and sosv
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
# gummy only
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# vosk and sosv
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
# vosk only
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
# sosv only
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
```
Data receiver code is as follows:
For example, for this project's caption engine, if I want to use the Gummy model, specify the original text as Japanese, translate to Chinese, and capture captions from system audio output, with 0.1s audio data segments each time, the command line parameters would be:
```typescript
// src\main\utils\engine.ts
...
this.process.stdout.on('data', (data) => {
const lines = data.toString().split('\n');
lines.forEach((line: string) => {
if (line.trim()) {
try {
const caption = JSON.parse(line);
addCaptionLog(caption);
} catch (e) {
controlWindow.sendErrorMessage('Unable to parse the output from the caption engine as a JSON object: ' + e)
console.error('[ERROR] Error parsing JSON:', e);
}
}
});
});
this.process.stderr.on('data', (data) => {
controlWindow.sendErrorMessage('Caption engine error: ' + data)
console.error(`[ERROR] Subprocess Error: ${data}`);
});
...
```bash
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
```
## Reference Code
## Additional Notes
The `main-gummy.py` file under the `caption-engine` folder in this project serves as the entry point for the default caption engine. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing data from the caption engine. You can read and understand the implementation details and the complete execution process of the caption engine as needed.
### Communication Standards
[caption engine api-doc](../api-docs/caption-engine.md)
### Program Entry
[main.py](../../engine/main.py)
### Development Recommendations
Except for audio-to-text conversion and translation, other components (audio acquisition, audio resampling, and communication with the main process) are recommended to directly reuse the project's code. If this approach is taken, the content that needs to be added includes:
- `engine/audio2text/`: Add a new audio-to-text class (file-level).
- `engine/main.py`: Add new parameter settings and workflow functions (refer to `main_gummy` and `main_vosk` functions).
### Packaging
After development and testing, the caption engine must be packaged into an executable. Typically, `pyinstaller` is used. If the packaged executable reports errors, check for missing dependencies.
### Execution
With a functional caption engine, it can be launched in the caption software window by specifying the engine's path and runtime arguments.
![](../img/02_en.png)

View File

@@ -1,128 +1,203 @@
# 字幕エンジン説明文書
# 字幕エンジン説明ドキュメント
対応バージョンv0.4.0
## 注意:このドキュメントはメンテナンスが行われていないため、記載されている情報は古くなっています。最新の情報については、[中国語版](./zh.md)または[英語版](./en.md)のドキュメントをご参照ください。
対応バージョンv0.6.0
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
![](../../assets/media/structure_ja.png)
## 字幕エンジン紹介
## 字幕エンジン紹介
所謂字幕エンジンは実際にはサブプログラムであり、システムの音声入力(録音)または出力(音声再生)のストリーミングデータをリアルタイムで取得し、音声からテキストへの変換モデルを使って対応する音声の字幕を生成します。生成された字幕はJSON形式の文字列データに変換され、標準出力を通じてメインプログラムに渡されます(メインプログラムが読み取った文字列が正しJSONオブジェクトとして解釈されることが保証される必要があります)。メインプログラムは字幕データを読み取り、解釈して処理し、ウィンドウに表示します。
字幕エンジンとは、システムのオーディオ入力(マイク)または出力(スピーカー)のストリーミングデータをリアルタイムで取得し、音声を文字に変換するモデルを呼び出して対応する字幕を生成するサブプログラムです。生成された字幕はJSON形式の文字列データに変換され、標準出力を介してメインプログラムに渡されます(メインプログラムが受け取る文字列が正しJSONオブジェクトとして解釈できる必要があります)。メインプログラムは字幕データを読み取り、解釈して処理した後、ウィンドウに表示します。
## 字幕エンジンが必要な機能
**字幕エンジンプロセスとElectronメインプロセス間の通信は、[caption engine api-doc](../api-docs/caption-engine.md)に準拠しています。**
### 音声の取得
## 実行フロー
まず、あなたの字幕エンジンはシステムの音声入力録音または出力音声再生のストリーミングデータを取得する必要があります。Pythonを使用して開発する場合、PyAudioライブラリを使ってマイクからの音声入力データを取得できます全プラットフォーム共通。また、WindowsプラットフォームではPyAudioWPatchライブラリを使ってシステムの音声出力を取得することもできます。
メインプロセスと字幕エンジンの通信フロー:
一般的に取得される音声ストリームデータは、比較的短い時間間隔の音声ブロックで構成されています。モデルに合わせて音声ブロックのサイズを調整する必要があります。例えば、アリババクラウドのGummyモデルでは、0.05秒の音声ブロックを使用した認識結果の方が0.2秒の音声ブロックよりも優れています。
### エンジンの起動
### 音声の処理
- メインプロセス:`child_process.spawn()`を使用して字幕エンジンプロセスを起動
- 字幕エンジンプロセスTCP Socketサーバースレッドを作成し、作成後に標準出力にJSONオブジェクトを文字列化して出力。このオブジェクトには`command`フィールドが含まれ、値は`connect`
- メインプロセス字幕エンジンプロセスの標準出力を監視し、標準出力を行ごとに分割してJSONオブジェクトとして解析し、オブジェクトの`command`フィールドの値が`connect`かどうかを判断。`connect`の場合はTCP Socketサーバーに接続
取得した音声ストリームは、テキストに変換する前に前処理が必要な場合があります。例えば、アリババクラウドのGummyモデルは単一チャンネルの音声ストリームしか認識できませんが、収集された音声ストリームは通常二重チャンネルであるため、二重チャンネルの音声ストリームを単一チャンネルに変換する必要があります。チャンネル数の変換はNumPyライブラリのメソッドを使って行うことができます。
### 字幕認識
あなたは私によって開発された音声の取得(`caption-engine/sysaudio`)と音声の処理(`caption-engine/audioprcs`)モジュールを直接使用することができます。
- 字幕エンジンプロセス:メインスレッドでシステムオーディオ出力を監視し、オーディオデータブロックを字幕エンジンに送信して解析。字幕エンジンはオーディオデータブロックを解析し、標準出力を介して解析された字幕データオブジェクト文字列を送信
- メインプロセス:字幕エンジンの標準出力を引き続き監視し、解析されたオブジェクトの`command`フィールドに基づいて異なる操作を実行
### 音声からテキストへの変換
### エンジンの停止
適切な音声ストリームを得た後、それをテキストに変換することができます。通常、様々なモデルを使って音声ストリームをテキストに変換します。必要に応じてモデルを選択することができます。
- メインプロセスユーザーがフロントエンドで字幕エンジンを停止する操作を実行すると、メインプロセスはSocket通信を介して字幕エンジンプロセスに`command`フィールドが`stop`のオブジェクト文字列を送信
- 字幕エンジンプロセス:メインエンジンプロセスから送信された字幕データオブジェクト文字列を受信し、文字列をオブジェクトとして解析。オブジェクトの`command`フィールドが`stop`の場合、グローバル変数`thread_data.status`の値を`stop`に設定
- 字幕エンジンプロセス:メインスレッドでシステムオーディオ出力をループ監視し、`thread_data.status`の値が`running`でない場合、ループを終了し、リソースを解放して実行を終了
- メインプロセス:字幕エンジンプロセスの終了を検出した場合、対応する処理を実行し、フロントエンドにフィードバック
ほぼ完全な字幕エンジンの実装例:
## プロジェクトで実装済みの機能
以下の機能はすでに実装されており、直接再利用できます。
### 標準出力
通常情報、コマンド、エラー情報を出力できます。
サンプル:
```python
import sys
import argparse
# システム音声の取得に関する設定
if sys.platform == 'win32':
from sysaudio.win import AudioStream
elif sys.platform == 'darwin':
from sysaudio.darwin import AudioStream
elif sys.platform == 'linux':
from sysaudio.linux import AudioStream
else:
raise NotImplementedError(f"Unsupported platform: {sys.platform}")
# 音声処理関数のインポート
from audioprcs import mergeChunkChannels
# 音声からテキストへの変換モジュールのインポート
from audio2text import InvalidParameter, GummyTranslator
def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
# 標準出力をラインバッファリングに設定
sys.stdout.reconfigure(line_buffering=True) # type: ignore
# 音声の取得と音声からテキストへの変換のインスタンスを作成
stream = AudioStream(audio_type, chunk_rate)
if t_lang == 'none':
gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
else:
gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
# インスタンスを開始
stream.openStream()
gummy.start()
while True:
try:
# 音声ストリームデータを読み込む
chunk = stream.read_chunk()
chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
try:
# モデルを使って翻訳を行う
gummy.send_audio_frame(chunk_mono)
except InvalidParameter:
gummy.start()
gummy.send_audio_frame(chunk_mono)
except KeyboardInterrupt:
stream.closeStream()
gummy.stop()
break
from utils import stdout, stdout_cmd, stdout_obj, stderr
stdout("Hello") # {"command": "print", "content": "Hello"}\n
stdout_cmd("connect", "8080") # {"command": "connect", "content": "8080"}\n
stdout_obj({"command": "print", "content": "Hello"})
stderr("Error Info")
```
### Socketサービスの作成
このSocketサービスは指定されたポートを監視し、Electronメインプログラムから送信された内容を解析し、`thread_data.status`の値を変更する可能性があります。
サンプル:
```python
from utils import start_server
from utils import thread_data
port = 8080
start_server(port)
while thread_data == 'running':
# 何か処理
pass
```
### オーディオ取得
`AudioStream`クラスはオーディオデータを取得するために使用され、Windows、Linux、macOSでクロスプラットフォームで実装されています。このクラスの初期化には2つのパラメータが含まれます
- `audio_type`取得するオーディオのタイプ。0はシステム出力オーディオスピーカー、1はシステム入力オーディオマイク
- `chunk_rate`オーディオデータの取得頻度。1秒あたりに取得するオーディオブロックの数
このクラスには3つのメソッドがあります
- `open_stream()`:オーディオ取得を開始
- `read_chunk() -> bytes`1つのオーディオブロックを読み取り
- `close_stream()`:オーディオ取得を閉じる
サンプル:
```python
from sysaudio import AudioStream
audio_type = 0
chunk_rate = 20
stream = AudioStream(audio_type, chunk_rate)
stream.open_stream()
while True:
data = stream.read_chunk()
# データで何か処理
pass
stream.close_stream()
```
### オーディオ処理
取得したオーディオストリームは、文字に変換する前に前処理が必要な場合があります。一般的に、マルチチャンネルオーディオをシングルチャンネルオーディオに変換し、リサンプリングが必要な場合もあります。このプロジェクトでは、3つのオーディオ処理関数を提供しています
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes`:マルチチャンネルオーディオブロックをシングルチャンネルオーディオブロックに変換
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:現在のマルチチャンネルオーディオデータブロックをシングルチャンネルオーディオデータブロックに変換し、リサンプリングを実行
- `resample_mono_chunk(chunk: bytes, orig_sr: int, target_sr: int, mode="sinc_best") -> bytes`:現在のシングルチャンネルオーディオブロックをリサンプリング
## 字幕エンジンで実装が必要な機能
### オーディオから文字への変換
適切なオーディオストリームを取得した後、オーディオストリームを文字に変換する必要があります。一般的に、さまざまなモデル(クラウドまたはローカル)を使用してオーディオストリームを文字に変換します。要件に応じて適切なモデルを選択する必要があります。
この部分はクラスとしてカプセル化することをお勧めします。以下の3つのメソッドを実装する必要があります
- `start(self)`:モデルを起動
- `send_audio_frame(self, data: bytes)`:現在のオーディオブロックデータを処理し、**生成された字幕データを標準出力を介してElectronメインプロセスに送信**
- `stop(self)`:モデルを停止
完全な字幕エンジンの実例:
- [gummy.py](../../engine/audio2text/gummy.py)
- [vosk.py](../../engine/audio2text/vosk.py)
### 字幕翻訳
音声認識モデルによっては翻訳機能を提供していないため、別途翻訳モジュールを追加する必要があります。この部分にはクラウドベースの翻訳APIを使用することも、ローカルの翻訳モデルを使用することも可能です。
一部の音声文字変換モデルは翻訳を提供していません。必要がある場合、翻訳モジュールを追加する必要があります。
### データの
### 字幕データの送
現在の音声ストリームのテキストを得たら、それをメインプログラムに渡す必要があります。字幕エンジンプロセスは標準出力を通じて電子メール主プロセスに字幕データを渡します。
現在のオーディオストリームのテキストを取得した後、そのテキストをメインプログラムに送信する必要があります。字幕エンジンプロセスは標準出力を介して字幕データをElectronメインプロセスに渡します。
渡す内容はJSON文字列でなければなりません。JSONオブジェクトには以下のパラメータを含める必要があります
送信する内容はJSON文字列でなければなりません。JSONオブジェクトには以下のパラメータを含める必要があります
```typescript
export interface CaptionItem {
index: number, // 字幕番号
time_s: string, // 現在の字幕開始時間
time_t: string, // 現在の字幕終了時間
text: string, // 字幕内容
translation: string // 字幕翻訳
command: "caption",
index: number, // 字幕のシーケンス番号
time_s: string, // 現在の字幕の開始時間
time_t: string, // 現在の字幕の終了時間
text: string, // 字幕の内容
translation: string // 字幕の翻訳
}
```
**必ず、字幕JSONデータを出力するたびにバッファをフラッシュし、electronプロセスが受け取る文字列が常にJSONオブジェクトとして解釈できるようにする必要があります。**
**JSONデータを出力するたびにバッファをフラッシュし、electronメインプロセスが受信する文字列が常にJSONオブジェクトとして解釈できるようにする必要があります。**
Python言語を使用する場合、以下の方法でデータをメインプログラムに渡すことができます
プロジェクトで既に実装されている`stdout_obj`関数を使用して送信することをお勧めします
### コマンドラインパラメータの指定
カスタム字幕エンジンの設定はコマンドラインパラメータで指定するため、字幕エンジンのパラメータを設定する必要があります。このプロジェクトで現在使用されているパラメータは以下のとおりです:
```python
# caption-engine\main-gummy.py
sys.stdout.reconfigure(line_buffering=True)
# caption-engine\audio2text\gummy.py
...
def send_to_node(self, data):
"""
Node.jsプロセスにデータを送信する
"""
try:
json_data = json.dumps(data) + '\n'
sys.stdout.write(json_data)
sys.stdout.flush()
except Exception as e:
print(f"Error sending data to Node.js: {e}", file=sys.stderr)
...
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='システムオーディオストリームをテキストに変換')
# 共通
parser.add_argument('-e', '--caption_engine', default='gummy', help='字幕エンジン: gummyまたはvosk')
parser.add_argument('-a', '--audio_type', default=0, help='オーディオストリームソース: 0は出力、1は入力')
parser.add_argument('-c', '--chunk_rate', default=10, help='1秒あたりに収集するオーディオストリームブロックの数')
parser.add_argument('-p', '--port', default=8080, help='サーバーを実行するポート、0はサーバーなし')
# gummy専用
parser.add_argument('-s', '--source_language', default='en', help='ソース言語コード')
parser.add_argument('-t', '--target_language', default='zh', help='ターゲット言語コード')
parser.add_argument('-k', '--api_key', default='', help='GummyモデルのAPI KEY')
# vosk専用
parser.add_argument('-m', '--model_path', default='', help='voskモデルのパス')
```
データ受信側のコードは
たとえば、このプロジェクトの字幕エンジンでGummyモデルを使用し、原文を日本語、翻訳を中国語に指定し、システムオーディオ出力の字幕を取得し、毎回0.1秒のオーディオデータをキャプチャする場合、コマンドラインパラメータは以下のようになります:
```bash
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
```
## その他
### 通信規格
[caption engine api-doc](../api-docs/caption-engine.md)
### プログラムエントリ
[main.py](../../engine/main.py)
### 開発の推奨事項
オーディオから文字への変換以外は、このプロジェクトのコードを直接再利用することをお勧めします。その場合、追加する必要がある内容は:
- `engine/audio2text/`:新しいオーディオから文字への変換クラスを追加(ファイルレベル)
- `engine/main.py`:新しいパラメータ設定とプロセス関数を追加(`main_gummy`関数と`main_vosk`関数を参照)
### パッケージ化
字幕エンジンの開発とテストが完了した後、字幕エンジンを実行可能ファイルにパッケージ化する必要があります。一般的に`pyinstaller`を使用してパッケージ化します。パッケージ化された字幕エンジンファイルの実行でエラーが発生した場合、依存ライブラリが不足している可能性があります。不足している依存ライブラリを確認してください。
### 実行
使用可能な字幕エンジンを取得したら、字幕ソフトウェアウィンドウで字幕エンジンのパスと字幕エンジンの実行コマンド(パラメータ)を指定して字幕エンジンを起動できます。
![](../img/02_ja.png)

View File

@@ -1,156 +1,231 @@
# 字幕引擎说明文档
对应版本v0.4.0
对应版本v1.0.0
![](../../assets/media/structure_zh.png)
## 字幕引擎介绍
所谓的字幕引擎实际上是一个子程序,它会实时获取系统音频输入(录音)或输出(播放声音)的流式数据,并调用音频转文字的模型生成对应音频的字幕。生成的字幕转换为 JSON 格式的字符串数据,并通过标准输出传递给主程序(需要保证主程序读取到的字符串可以被正确解释为 JSON 对象)。主程序读取并解释字幕数据,处理后显示在窗口上。
所谓的字幕引擎实际上是一个子程序,它会实时获取系统音频输入(麦克风)或输出(扬声器)的流式数据,并调用音频转文字的模型生成对应音频的字幕。生成的字幕转换为 JSON 格式的字符串数据,并通过标准输出传递给主程序(需要保证主程序读取到的字符串可以被正确解释为 JSON 对象)。主程序读取并解释字幕数据,处理后显示在窗口上。
## 字幕引擎需要实现的功能
**字幕引擎进程和 Electron 主进程之间的通信遵循的标准为:[caption engine api-doc](../api-docs/caption-engine.md)。**
## 运行流程
主进程和字幕引擎通信的流程:
### 启动引擎
- Electron 主进程:使用 `child_process.spawn()` 启动字幕引擎进程
- 字幕引擎进程:创建 TCP Socket 服务器线程,创建后在标准输出中输出转化为字符串的 JSON 对象,该对象中包含 `command` 字段,值为 `connect`
- 主进程:监听字幕引擎进程的标准输出,尝试将标准输出按行分割,解析为 JSON 对象,并判断对象的 `command` 字段值是否为 `connect`,如果是则连接 TCP Socket 服务器
### 字幕识别
- 字幕引擎进程:新建线程监听系统音频输出,将获取的音频数据块放入共享队列中(`shared_data.chunk_queue`)。字幕引擎不断读取共享队列中的音频数据块并解析。字幕引擎还可能新建线程执行翻译操作。最后字幕引擎通过标准输出发送解析的字幕数据对象字符串
- Electron 主进程:持续监听字幕引擎的标准输出,并根据解析的对象的 `command` 字段采取不同的操作
### 关闭引擎
- Electron 主进程:当用户在前端操作关闭字幕引擎时,主进程通过 Socket 通信给字幕引擎进程发送 `command` 字段为 `stop` 的对象字符串
- 字幕引擎进程:接收主引擎进程发送的字幕数据对象字符串,将字符串解析为对象,如果对象的 `command` 字段为 `stop`,则将全局变量 `shared_data.status` 的值设置为 `stop`
- 字幕引擎进程:主线程循环监听系统音频输出,当 `thread_data.status` 的值不为 `running` 时,则结束循环,释放资源,结束运行
- Electron 主进程:如果检测到字幕引擎进程结束,进行相应处理,并向前端反馈
## 项目已经实现的功能
以下功能已经实现,可以直接复用。
### 标准输出
可以输出普通信息、命令和错误信息。
样例:
```python
from utils import stdout, stdout_cmd, stdout_obj, stderr
# {"command": "print", "content": "Hello"}\n
stdout("Hello")
# {"command": "connect", "content": "8080"}\n
stdout_cmd("connect", "8080")
# {"command": "print", "content": "print"}\n
stdout_obj({"command": "print", "content": "print"})
# sys.stderr.write("Error Info" + "\n")
stderr("Error Info")
```
### 创建 Socket 服务
该 Socket 服务会监听指定端口,会解析 Electron 主程序发送的内容,并可能改变 `shared_data.status` 的值。
样例:
```python
from utils import start_server
from utils import shared_data
port = 8080
start_server(port)
while thread_data == 'running':
# do something
pass
```
### 音频获取
首先,你的字幕引擎需要获取系统音频输入(录音)或输出(播放声音)的流式数据。如果使用 Python 开发,可以使用 PyAudio 库获取麦克风音频输入数据(全平台通用)。使用 PyAudioWPatch 库获取系统音频输出(仅适用于 Windows 平台)。
`AudioStream` 类用于获取音频数据,实现是跨平台的,支持 Windows、Linux 和 macOS。该类初始化包含两个参数
一般获取音频流数据实际上是一个一个的时间比较短的音频块,需要根据模型调整音频块的大小。比如阿里云的 Gummy 模型使用 0.05 秒大小的音频块识别效果优于使用 0.2 秒大小的音频块。
- `audio_type`: 获取音频类型0 表示系统输出音频扬声器1 表示系统输入音频(麦克风)
- `chunk_rate`: 音频数据获取频率,每秒音频获取的音频块的数量,默认为 10
该类包含四个方法:
- `open_stream()`: 开启音频获取
- `read_chunk() -> bytes`: 读取一个音频块
- `close_stream()`: 关闭音频获取
- `close_stream_signal()` 线程安全的关闭系统音频输入流
样例:
```python
from sysaudio import AudioStream
audio_type = 0
chunk_rate = 20
stream = AudioStream(audio_type, chunk_rate)
stream.open_stream()
while True:
data = stream.read_chunk()
# do something with data
pass
stream.close_stream()
```
### 音频处理
获取到的音频流在转文字之前可能需要进行预处理。比如阿里云的 Gummy 模型只能识别单通道的音频流,而收集的音频流一般是双通道的,因此要将通道音频转换为单通道。通道数的转换可以使用 NumPy 库中的方法实现。
获取到的音频流在转文字之前可能需要进行预处理。一般需要将通道音频转换为单通道音频,还可能需要进行重采样。本项目提供了两个音频处理函数:
你可以直接使用我开发好的音频获取(`caption-engine/sysaudio`)和音频处理(`caption-engine/audioprcs`)模块。
- `merge_chunk_channels(chunk: bytes, channels: int) -> bytes` 将多通道音频块转换为单通道音频块
- `resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes`:将当前多通道音频数据块转换成单通道音频数据块,然后进行重采样
样例:
```python
from sysaudio import AudioStream
from utils import merge_chunk_channels
stream = AudioStream(1)
while True:
raw_chunk = stream.read_chunk()
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
# do something with chunk
```
## 字幕引擎需要实现的功能
### 音频转文字
在得到了合适的音频流后,就可以将音频流转换为文字了。一般使用各种模型来实现音频流转文字。根据需求自行选择模型。
在得到了合适的音频流后,需要将音频流转换为文字了。一般使用各种模型(云端或本地)来实现音频流转文字。需要根据需求选择合适的模型。
一个接近完整的字幕引擎实例如下
这部分建议封装为一个类,需要实现四个方法
```python
import sys
import argparse
- `start(self)`:启动模型
- `send_audio_frame(self, data: bytes)`:处理当前音频块数据,**生成的字幕数据通过标准输出发送给 Electron 主进程**
- `translate(self)`:持续从 `shared_data.chunk_queue` 中取出数据块,并调用 `send_audio_frame` 方法处理数据块
- `stop(self)`:停止模型
# 引入系统音频获取勒
if sys.platform == 'win32':
from sysaudio.win import AudioStream
elif sys.platform == 'darwin':
from sysaudio.darwin import AudioStream
elif sys.platform == 'linux':
from sysaudio.linux import AudioStream
else:
raise NotImplementedError(f"Unsupported platform: {sys.platform}")
完整的字幕引擎实例如下:
# 引入音频处理函数
from audioprcs import mergeChunkChannels
# 引入音频转文本模块
from audio2text import InvalidParameter, GummyTranslator
def convert_audio_to_text(s_lang, t_lang, audio_type, chunk_rate, api_key):
# 设置标准输出为行缓冲
sys.stdout.reconfigure(line_buffering=True) # type: ignore
# 创建音频获取和语音转文字实例
stream = AudioStream(audio_type, chunk_rate)
if t_lang == 'none':
gummy = GummyTranslator(stream.RATE, s_lang, None, api_key)
else:
gummy = GummyTranslator(stream.RATE, s_lang, t_lang, api_key)
# 启动实例
stream.openStream()
gummy.start()
while True:
try:
# 读取音频流数据
chunk = stream.read_chunk()
chunk_mono = mergeChunkChannels(chunk, stream.CHANNELS)
try:
# 调用模型进行翻译
gummy.send_audio_frame(chunk_mono)
except InvalidParameter:
gummy.start()
gummy.send_audio_frame(chunk_mono)
except KeyboardInterrupt:
stream.closeStream()
gummy.stop()
break
```
- [gummy.py](../../engine/audio2text/gummy.py)
- [vosk.py](../../engine/audio2text/vosk.py)
- [sosv.py](../../engine/audio2text/sosv.py)
### 字幕翻译
有的语音转文字模型并不提供翻译,需要再添加一个翻译模块。这部分可以使用云端翻译 API 也可以使用本地翻译模
有的语音转文字模型并不提供翻译,如果有需求,需要再添加一个翻译模块也可以使用自带的翻译模
### 数据传递
样例:
在获取到当前音频流的文字后,需要将文字传递给主程序。字幕引擎进程通过标准输出将字幕数据传递给 electron 主进程。
```python
from utils import google_translate, ollama_translate
text = "这是一个翻译测试。"
google_translate("", "en", text, "time_s")
ollama_translate("qwen3:0.6b", "en", text, "time_s")
```
### 字幕数据发送
在获取到当前音频流的文字后,需要将文字发送给主程序。字幕引擎进程通过标准输出将字幕数据传递给 Electron 主进程。
传递的内容必须是 JSON 字符串,其中 JSON 对象需要包含的参数如下:
```typescript
export interface CaptionItem {
index: number, // 字幕序号
time_s: string, // 当前字幕开始时间
time_t: string, // 当前字幕结束时间
text: string, // 字幕内容
translation: string // 字幕翻译
command: "caption",
index: number, // 字幕序号
time_s: string, // 当前字幕开始时间
time_t: string, // 当前字幕结束时间
text: string, // 字幕内容
translation: string // 字幕翻译
}
```
**注意必须确保咱们一起每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。**
**注意必须确保每输出一次字幕 JSON 数据就得刷新缓冲区,确保 electron 主进程每次接收到的字符串都可以被解释为 JSON 对象。** 建议使用项目已经实现的 `stdout_obj` 函数来发送。
如果使用 python 语言,可以参考以下方式将数据传递给主程序:
### 命令行参数的指定
自定义字幕引擎的设置提供命令行参数指定,因此需要设置好字幕引擎的参数,本项目目前用到的参数如下:
```python
# caption-engine\main-gummy.py
sys.stdout.reconfigure(line_buffering=True)
# caption-engine\audio2text\gummy.py
...
def send_to_node(self, data):
"""
将数据发送到 Node.js 进程
"""
try:
json_data = json.dumps(data) + '\n'
sys.stdout.write(json_data)
sys.stdout.flush()
except Exception as e:
print(f"Error sending data to Node.js: {e}", file=sys.stderr)
...
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# all
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-t', '--target_language', default='zh', help='Target language code, "none" for no translation')
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
# gummy and sosv
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
# gummy only
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# vosk and sosv
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
# vosk only
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
# sosv only
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
```
数据接收端代码如下:
比如对于本项目的字幕引擎,我想使用 Gummy 模型,指定原文为日语,翻译为中文,获取系统音频输出的字幕,每次截取 0.1s 的音频数据,那么命令行参数如下:
```typescript
// src\main\utils\engine.ts
...
this.process.stdout.on('data', (data) => {
const lines = data.toString().split('\n');
lines.forEach((line: string) => {
if (line.trim()) {
try {
const caption = JSON.parse(line);
addCaptionLog(caption);
} catch (e) {
controlWindow.sendErrorMessage('字幕引擎输出内容无法解析为 JSON 对象:' + e)
console.error('[ERROR] Error parsing JSON:', e);
}
}
});
});
this.process.stderr.on('data', (data) => {
controlWindow.sendErrorMessage('字幕引擎错误:' + data)
console.error(`[ERROR] Subprocess Error: ${data}`);
});
...
```bash
python main.py -e gummy -s ja -t zh -a 0 -c 10 -k <dashscope-api-key>
```
## 参考代码
## 其他
本项目 `caption-engine` 文件夹下的 `main-gummy.py` 文件为默认字幕引擎的入口代码。`src\main\utils\engine.ts` 为服务端获取字幕引擎数据和进行处理的代码。可以根据需要阅读了解字幕引擎的实现细节和完整运行过程。
### 通信规范
[caption engine api-doc](../api-docs/caption-engine.md)
### 程序入口
[main.py](../../engine/main.py)
### 开发建议
除音频转文字和翻译外,其他(音频获取、音频重采样、与主进程通信)建议直接复用本项目代码。如果这样,那么需要添加的内容为:
- `engine/audio2text/`:添加新的音频转文字类(文件级别)
- `engine/main.py`:添加新参数设置、流程函数(参考 `main_gummy` 函数和 `main_vosk` 函数)
### 打包
在完成字幕引擎的开发和测试后,需要将字幕引擎打包成可执行文件。一般使用 `pyinstaller` 进行打包。如果打包好的字幕引擎文件执行报错,可能是打包漏掉了某些依赖库,请检查是否缺少了依赖库。
### 运行
有了可以使用的字幕引擎,就可以在字幕软件窗口中通过指定字幕引擎的路径和字幕引擎的运行指令(参数)来启动字幕引擎了。
![](../img/02_zh.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 57 KiB

BIN
docs/img/06.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

BIN
docs/img/07.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

View File

@@ -1,14 +1,24 @@
# Auto Caption User Manual
Corresponding Version: v0.4.0
Corresponding Version: v1.0.0
**Note: Due to limited personal resources, the English and Japanese documentation files for this project (except for the README document) will no longer be maintained. The content of this document may not be consistent with the latest version of the project. If you are willing to help with translation, please submit relevant Pull Requests.**
## Software Introduction
Auto Caption is a cross-platform caption display software that can real-time capture system audio input (recording) or output (playback) streaming data and use an audio-to-text model to generate captions for the corresponding audio. The default caption engine provided by the software (using Alibaba Cloud Gummy model) supports recognition and translation in nine languages (Chinese, English, Japanese, Korean, German, French, Russian, Spanish, Italian).
Currently, the default caption engine of the software only has full functionality on Windows and macOS platforms. Additional configuration is required to capture system audio output on macOS.
The default caption engine currently has full functionality on Windows, macOS, and Linux platforms. Additional configuration is required to capture system audio output on macOS.
On Linux platforms, it can only generate captions for audio input (microphone), and currently does not support generating captions for audio output (playback).
The following operating system versions have been tested and confirmed to work properly. The software cannot guarantee normal operation on untested OS versions.
| OS Version | Architecture | Audio Input Capture | Audio Output Capture |
| ------------------ | ------------ | ------------------- | -------------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ Additional config required | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
![](../../assets/media/main_en.png)
@@ -33,9 +43,13 @@ Alibaba Cloud provides detailed tutorials for this part, which can be referenced
## Preparation for Using Vosk Engine
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings. Currently, the Vosk caption engine does not support translated caption content.
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings.
![](../../assets/media/vosk_en.png)
![](../../assets/media/config_en.png)
## Using SOSV Model
The way to use the SOSV model is the same as Vosk. The download address is as follows: https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## Capturing System Audio Output on macOS
@@ -61,6 +75,30 @@ Once BlackHole is confirmed installed, in the `Audio MIDI Setup` page, click the
Now the caption engine can capture system audio output and generate captions.
## Getting System Audio Output on Linux
First execute in the terminal:
```bash
pactl list short sources
```
If you see output similar to the following, no additional configuration is needed:
```bash
220 alsa_output.pci-0000_02_02.0.3.analog-stereo.monitor PipeWire s16le 2ch 48000Hz SUSPENDED
221 alsa_input.pci-0000_02_02.0.3.analog-stereo PipeWire s16le 2ch 48000Hz SUSPENDED
```
Otherwise, install `pulseaudio` and `pavucontrol` using the following commands:
```bash
# For Debian/Ubuntu etc.
sudo apt install pulseaudio pavucontrol
# For CentOS etc.
sudo yum install pulseaudio pavucontrol
```
## Software Usage
### Modifying Settings
@@ -79,7 +117,7 @@ The following image shows the caption display window, which displays the latest
### Exporting Caption Records
In the caption control window, you can see the records of all collected captions. Click the "Export Caption Records" button to export the caption records as a JSON file.
In the caption control window, you can see the records of all collected captions. Click the "Export Log" button to export the caption records as a JSON or SRT file.
## Caption Engine
@@ -92,3 +130,175 @@ The software provides two default caption engines. If you need other caption eng
Note that when using a custom caption engine, all previous caption engine settings will be ineffective, and the configuration of the custom caption engine is entirely done through the engine command.
If you are a developer and want to develop a custom caption engine, please refer to the [Caption Engine Explanation Document](../engine-manual/en.md).
## Using Caption Engine Standalone
### Runtime Parameter Description
> The following content assumes users have some knowledge of running programs via terminal.
The complete set of runtime parameters available for the caption engine is shown below:
![](../img/06.png)
However, when used standalone, some parameters may not need to be used or should not be modified.
The following parameter descriptions only include necessary parameters.
#### `-e , --caption_engine`
The caption engine model to select, currently three options are available: `gummy, vosk, sosv`.
The default value is `gummy`.
This applies to all models.
#### `-a, --audio_type`
The audio type to recognize, where `0` represents system audio output and `1` represents microphone audio input.
The default value is `0`.
This applies to all models.
#### `-d, --display_caption`
Whether to display captions in the console, `0` means do not display, `1` means display.
The default value is `0`, but it's recommended to choose `1` when using only the caption engine.
This applies to all models.
#### `-t, --target_language`
> Note that Vosk and SOSV models have poor sentence segmentation, which can make translated content difficult to understand. It's not recommended to use translation with these two models.
Target language for translation. All models support the following translation languages:
- `none` No translation
- `zh` Simplified Chinese
- `en` English
- `ja` Japanese
- `ko` Korean
Additionally, `vosk` and `sosv` models also support the following translations:
- `de` German
- `fr` French
- `ru` Russian
- `es` Spanish
- `it` Italian
The default value is `none`.
This applies to all models.
#### `-s, --source_language`
Source language for recognition. Default value is `auto`, meaning no specific source language.
Specifying the source language can improve recognition accuracy to some extent. You can specify the source language using the language codes above.
This only applies to Gummy and SOSV models.
The Gummy model can use all the languages mentioned above, plus Cantonese (`yue`).
The SOSV model supports specifying the following languages: English, Chinese, Japanese, Korean, and Cantonese.
#### `-k, --api_key`
Specify the Alibaba Cloud API KEY required for the `Gummy` model.
Default value is empty.
This only applies to the Gummy model.
#### `-tm, --translation_model`
Specify the translation method for Vosk and SOSV models. Default is `ollama`.
Supported values are:
- `ollama` Use local Ollama model for translation. Users need to install Ollama software and corresponding models
- `google` Use Google Translate API for translation. No additional configuration needed, but requires network access to Google
This only applies to Vosk and SOSV models.
#### `-omn, --ollama_name`
Specify the Ollama model to call for translation. Default value is empty.
It's recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
Users need to download the corresponding model in Ollama to use it properly.
This only applies to Vosk and SOSV models.
#### `-vosk, --vosk_model`
Specify the path to the local folder of the Vosk model to call. Default value is empty.
This only applies to the Vosk model.
#### `-sosv, --sosv_model`
Specify the path to the local folder of the SOSV model to call. Default value is empty.
This only applies to the SOSV model.
### Running Caption Engine Using Source Code
> The following content assumes users who use this method have knowledge of Python environment configuration and usage.
First, download the project source code locally. The caption engine source code is located in the `engine` directory of the project. Then configure the Python environment, where the project dependencies are listed in the `requirements.txt` file in the `engine` directory.
After configuration, enter the `engine` directory and execute commands to run the caption engine.
For example, to use the Gummy model, specify audio type as system audio output, source language as English, and target language as Chinese, execute the following command:
> Note: For better visualization, the commands below are written on multiple lines. If execution fails, try removing backslashes and executing as a single line command.
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
To specify the Vosk model, audio type as system audio output, translate to English, and use Ollama `qwen3:0.6b` model for translation:
```bash
python main.py \
-e vosk \
-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
-a 0 \
-d 1 \
-t en \
```
To specify the SOSV model, audio type as microphone, automatically select source language, and no translation:
```bash
python main.py \
-e sosv \
-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
-a 1 \
-d 1 \
-s auto \
-t none
```
Running result using the Gummy model is shown below:
![](../img/07.png)
### Running Subtitle Engine Executable File
First, download the executable file for your platform from [GitHub Releases](https://github.com/HiMeditator/auto-caption/releases/tag/engine) (currently only Windows and Linux platform executable files are provided).
Then open a terminal in the directory containing the caption engine executable file and execute commands to run the caption engine.
Simply replace `python main.py` in the above commands with the executable file name (for example: `engine-win.exe`).

View File

@@ -1,6 +1,6 @@
# Auto Caption ユーザーマニュアル
対応バージョンv0.4.0
対応バージョンv1.0.0
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
@@ -8,9 +8,17 @@
Auto Caption は、クロスプラットフォームの字幕表示ソフトウェアで、システムの音声入力(録音)または出力(音声再生)のストリーミングデータをリアルタイムで取得し、音声からテキストに変換するモデルを利用して対応する音声の字幕を生成します。このソフトウェアが提供するデフォルトの字幕エンジン(アリババクラウド Gummy モデルを使用は、9つの言語中国語、英語、日本語、韓国語、ドイツ語、フランス語、ロシア語、スペイン語、イタリア語の認識と翻訳をサポートしています。
現在、ソフトウェアのデフォルト字幕エンジンは WindowsmacOS プラットフォームでのみ完全な機能を有しています。macOS でシステムオーディオ出力を取得するには追加設定が必要です。
現在のデフォルト字幕エンジンは WindowsmacOS、Linux プラットフォームで完全な機能を有しています。macOSでシステムオーディオ出力を取得するには追加設定が必要です。
Linux プラットフォームでは、オーディオ入力(マイク)からの字幕生成のみ可能で、現在オーディオ出力(再生音)からの字幕生成はサポートしていません。
以下のオペレーティングシステムバージョンで正常動作を確認しています。記載以外の OS での正常動作は保証できません。
| OS バージョン | アーキテクチャ | オーディオ入力取得 | オーディオ出力取得 |
| ------------------- | ------------- | ------------------ | ------------------ |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅ 追加設定が必要 | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
![](../../assets/media/main_ja.png)
@@ -35,9 +43,13 @@ macOS プラットフォームでオーディオ出力を取得するには追
## Voskエンジン使用前の準備
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。現在、Vosk字幕エンジンは字幕の翻訳をサポートしていません。
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。
![](../../assets/media/vosk_ja.png)
![](../../assets/media/config_ja.png)
## SOSVモデルの使用
SOSVモデルの使用方法はVoskと同じで、ダウンロードアドレスは以下の通りですhttps://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## macOS でのシステムオーディオ出力の取得方法
@@ -64,6 +76,30 @@ BlackHoleのインストールが確認できたら、`オーディオ MIDI 設
これで字幕エンジンがシステムオーディオ出力をキャプチャし、字幕を生成できるようになります。
## Linux でシステムオーディオ出力を取得する
まずターミナルで以下を実行してください:
```bash
pactl list short sources
```
以下のような出力が確認できれば追加設定は不要です:
```bash
220 alsa_output.pci-0000_02_02.0.3.analog-stereo.monitor PipeWire s16le 2ch 48000Hz SUSPENDED
221 alsa_input.pci-0000_02_02.0.3.analog-stereo PipeWire s16le 2ch 48000Hz SUSPENDED
```
それ以外の場合は、以下のコマンドで`pulseaudio``pavucontrol`をインストールしてください:
```bash
# Debian/Ubuntu系の場合
sudo apt install pulseaudio pavucontrol
# CentOS系の場合
sudo yum install pulseaudio pavucontrol
```
## ソフトウェアの使い方
### 設定の変更
@@ -82,7 +118,7 @@ BlackHoleのインストールが確認できたら、`オーディオ MIDI 設
### 字幕記録のエクスポート
字幕制御ウィンドウでは、現在収集されたすべての字幕の記録を見ることができます。「字幕記録をエクスポート」ボタンをクリックすると、字幕記録をJSONファイルとしてエクスポートできます。
エクスポート」ボタンをクリックすると、字幕記録を JSON または SRT ファイル形式で出力できます。
## 字幕エンジン

View File

@@ -1,14 +1,22 @@
# Auto Caption 用户手册
对应版本v0.4.0
对应版本v1.0.0
## 软件简介
Auto Caption 是一个跨平台的字幕显示软件,能够实时获取系统音频输入(录音)或输出(播放声音)的流式数据,并调用音频转文字的模型生成对应音频的字幕。软件提供的默认字幕引擎(使用阿里云 Gummy 模型)支持九种语言(中、英、日、韩、德、法、俄、西、意)的识别与翻译。
目前软件默认字幕引擎只有在 Windows macOS 平台下拥有完整功能,在 macOS 要获取系统音频输出需要额外配置。
目前软件默认字幕引擎在 Windows macOS 和 Linux 平台下拥有完整功能,在 macOS 要获取系统音频输出需要额外配置。
在 Linux 平台下只能生成音频输入(麦克风)的字幕,暂不支持音频输出(播放声音)的字幕生成
测试过可正常运行的操作系统信息如下,软件不能保证在非下列版本的操作系统上正常运行
| 操作系统版本 | 处理器架构 | 获取系统音频输入 | 获取系统音频输出 |
| ------------------ | ---------- | ---------------- | ---------------- |
| Windows 11 24H2 | x64 | ✅ | ✅ |
| macOS Sequoia 15.5 | arm64 | ✅需要额外配置 | ✅ |
| Ubuntu 24.04.2 | x64 | ✅ | ✅ |
| Kali Linux 2022.3 | x64 | ✅ | ✅ |
| Kylin Server V10 SP3 | x64 | ✅ | ✅ |
![](../../assets/media/main_zh.png)
@@ -29,14 +37,17 @@ Auto Caption 是一个跨平台的字幕显示软件,能够实时获取系统
这部分阿里云提供了详细的教程,可参考:
- [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
## Vosk 引擎使用前准备
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。目前 Vosk 字幕引擎还不支持翻译字幕内容。
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。
![](../../assets/media/vosk_zh.png)
![](../../assets/media/config_zh.png)
## 使用 SOSV 模型
使用 SOSV 模型的方式和 Vosk 一样下载地址如下https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model
## macOS 获取系统音频输出
@@ -62,6 +73,30 @@ brew install blackhole-64ch
现在字幕引擎就能捕获系统的音频输出并生成字幕了。
## Linux 获取系统音频输出
首先在控制台执行:
```bash
pactl list short sources
```
如果有以下类似的输出内容则无需额外配置:
```bash
220 alsa_output.pci-0000_02_02.0.3.analog-stereo.monitor PipeWire s16le 2ch 48000Hz SUSPENDED
221 alsa_input.pci-0000_02_02.0.3.analog-stereo PipeWire s16le 2ch 48000Hz SUSPENDED
```
否则,执行以下命令安装 `pulseaudio``pavucontrol`
```bash
# Debian or Ubuntu, etc.
sudo apt install pulseaudio pavucontrol
# CentOS, etc.
sudo yum install pulseaudio pavucontrol
```
## 软件使用
### 修改设置
@@ -80,7 +115,7 @@ brew install blackhole-64ch
### 字幕记录的导出
在字幕控制窗口中可以看到当前收集的所有字幕的记录,点击“导出字幕记录”按钮,即可将字幕记录导出为 JSON 文件。
在字幕控制窗口中可以看到当前收集的所有字幕的记录,点击“导出字幕”按钮,即可将字幕记录导出为 JSON 或 SRT 文件。
## 字幕引擎
@@ -93,3 +128,175 @@ brew install blackhole-64ch
注意使用自定义字幕引擎时,前面的字幕引擎的设置将全部不起作用,自定义字幕引擎的配置完全通过引擎指令进行配置。
如果你是开发者,想开发自定义字幕引擎,请查看[字幕引擎说明文档](../engine-manual/zh.md)。
## 单独使用字幕引擎
### 运行参数说明
> 以下内容默认用户对使用终端运行程序有一定了解。
字幕引擎可用使用的完整的运行参数如下:
![](../img/06.png)
而在单独使用时其中某些参数并不需要使用,或者不适合进行修改。
下面的运行参数说明仅包含必要的参数。
#### `-e , --caption_engine`
需要选择的字幕引擎模型,目前有三个可用,分别为:`gummy, vosk, sosv`
该项的默认值为 `gummy`
该项适用于所有模型。
#### `-a, --audio_type`
需要识别的音频类型,其中 `0` 表示系统音频输出,`1` 表示麦克风音频输入。
该项的默认值为 `0`
该项适用于所有模型。
#### `-d, --display_caption`
是否在控制台显示字幕,`0` 表示不显示,`1` 表示显示。
该项默认值为 `0`,只使用字幕引擎的话建议选 `1`
该项适用于所有模型。
#### `-t, --target_language`
> 其中 Vosk 和 SOSV 模型分句效果较差,会导致翻译内容难以理解,不太建议这两个模型使用翻译。
需要翻译成的目标语言,所有模型都支持的翻译语言如下:
- `none` 不进行翻译
- `zh` 简体中文
- `en` 英语
- `ja` 日语
- `ko` 韩语
除此之外 `vosk``sosv` 模型还支持如下翻译:
- `de` 德语
- `fr` 法语
- `ru` 俄语
- `es` 西班牙语
- `it` 意大利语
该项的默认值为 `none`
该项适用于所有模型。
#### `-s, --source_language`
需要识别的语言的源语言,默认值为 `auto`,表示不指定源语言。
但是指定源语言能在一定程度上提高识别准确率,可用使用上面的语言代码指定源语言。
该项仅适用于 Gummy 和 SOSV 模型。
其中 Gummy 模型可用使用上述全部的语言,在加上粤语(`yue`)。
而 SOSV 模型支持指定的语言有:英语、中文、日语、韩语、粤语。
#### `-k, --api_key`
指定 `Gummy` 模型需要使用的阿里云 API KEY。
该项默认值为空。
该项仅适用于 Gummy 模型。
#### `-tm, --translation_model`
指定 Vosk 和 SOSV 模型的翻译方式,默认为 `ollama`
该项支持的值有:
- `ollama` 使用本地 Ollama 模型进行翻译,需要用户安装 Ollama 软件和对应的模型
- `google` 使用 Google 翻译 API 进行翻译,无需额外配置,但是需要有能访问 Google 的网络
该项仅适用于 Vosk 和 SOSV 模型。
#### `-omn, --ollama_name`
指定需要调用进行翻译的 Ollama 模型。该项默认值为空。
建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`
用户需要在 Ollama 中下载了对应的模型才能正常使用。
该项仅适用于 Vosk 和 SOSV 模型。
#### `-vosk, --vosk_model`
指定需要调用的 Vosk 模型的本地文件夹的路径。该项默认值为空。
该项仅适用于 Vosk 模型。
#### `-sosv, --sosv_model`
指定需要调用的 SOSV 模型的本地文件夹的路径。该项默认值为空。
该项仅适用于 SOSV 模型。
### 使用源代码运行字幕引擎
> 以下内容默认使用该方式的用户对 Python 环境配置和使用有所了解。
首先下载项目源代码到本地,其中字幕引擎源代码在项目的 `engine` 目录下。然后配置 Python 环境,其中项目依赖的 Python 包在 `engine` 目录下 `requirements.txt` 文件中。
配置好后进入 `engine` 目录,执行命令进行运行字幕引擎。
比如要使用 Gummy 模型,指定音频类型为系统音频输出,源语言为英语,翻译语言为中文,执行的命令如下:
> 注意:为了更直观,下面的命令写在了多行,如果执行失败,尝试去掉反斜杠,并改换单行命令执行。
```bash
python main.py \
-e gummy \
-k sk-******************************** \
-a 0 \
-d 1 \
-s en \
-t zh
```
指定 Vosk 模型,指定音频类型为系统音频输出,翻译语言为英语,使用 Ollama `qwen3:0.6b` 模型进行翻译:
```bash
python main.py \
-e vosk \
-vosk D:\Projects\auto-caption\engine\models\vosk-model-small-cn-0.22 \
-a 0 \
-d 1 \
-t en \
```
指定 SOSV 模型,指定音频类型为麦克风,自动选择源语言,不翻译,执行的命令如下:
```bash
python main.py \
-e sosv \
-sosv D:\\Projects\\auto-caption\\engine\\models\\sosv-int8 \
-a 1 \
-d 1 \
-s auto \
-t none
```
使用 Gummy 模型的运行效果如下:
![](../img/07.png)
### 运行字幕引擎可执行文件
首先在 [GitHub Release](https://github.com/HiMeditator/auto-caption/releases/tag/engine) 中下载对应平台的可执行文件(目前仅提供 Windows 和 Linux 平台的字幕引擎可执行文件)。
然后再字幕引擎可执行文件所在目录打开终端,执行命令进行运行字幕引擎。
只需要将上述指令中的 `python main.py` 替换为可执行文件名称即可(比如:`engine-win.exe`)。

View File

@@ -1,5 +1,5 @@
appId: com.himeditator.autocaption
productName: auto-caption
productName: Auto Caption
directories:
buildResources: build
files:
@@ -10,21 +10,18 @@ files:
- '!{LICENSE,README.md,README_en.md,README_ja.md}'
- '!{.env,.env.*,.npmrc,pnpm-lock.yaml}'
- '!{tsconfig.json,tsconfig.node.json,tsconfig.web.json}'
- '!caption-engine/*'
- '!engine-test/*'
- '!engine/*'
- '!docs/*'
- '!assets/*'
- '!.repomap/*'
- '!.virtualme/*'
extraResources:
# For Windows
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
- from: ./engine/dist/main.exe
to: ./engine/main.exe
# For macOS and Linux
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
- from: ./engine/dist/main
to: ./engine/main
win:
executableName: auto-caption
icon: build/icon.png

View File

@@ -1,221 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dashscope.audio.asr import * # type: ignore\n",
"import pyaudiowpatch as pyaudio\n",
"import numpy as np\n",
"\n",
"\n",
"def getDefaultSpeakers(mic: pyaudio.PyAudio, info = True):\n",
" \"\"\"\n",
" 获取默认的系统音频输出的回环设备\n",
" Args:\n",
" mic (pyaudio.PyAudio): pyaudio对象\n",
" info (bool, optional): 是否打印设备信息. Defaults to True.\n",
"\n",
" Returns:\n",
" dict: 统音频输出的回环设备\n",
" \"\"\"\n",
" try:\n",
" WASAPI_info = mic.get_host_api_info_by_type(pyaudio.paWASAPI)\n",
" except OSError:\n",
" print(\"Looks like WASAPI is not available on the system. Exiting...\")\n",
" exit()\n",
"\n",
" default_speaker = mic.get_device_info_by_index(WASAPI_info[\"defaultOutputDevice\"])\n",
" if(info): print(\"wasapi_info:\\n\", WASAPI_info, \"\\n\")\n",
" if(info): print(\"default_speaker:\\n\", default_speaker, \"\\n\")\n",
"\n",
" if not default_speaker[\"isLoopbackDevice\"]:\n",
" for loopback in mic.get_loopback_device_info_generator():\n",
" if default_speaker[\"name\"] in loopback[\"name\"]:\n",
" default_speaker = loopback\n",
" if(info): print(\"Using loopback device:\\n\", default_speaker, \"\\n\")\n",
" break\n",
" else:\n",
" print(\"Default loopback output device not found.\")\n",
" print(\"Run `python -m pyaudiowpatch` to check available devices.\")\n",
" print(\"Exiting...\")\n",
" exit()\n",
" \n",
" if(info): print(f\"Recording Device: #{default_speaker['index']} {default_speaker['name']}\")\n",
" return default_speaker\n",
"\n",
"\n",
"class Callback(TranslationRecognizerCallback):\n",
" \"\"\"\n",
" 语音大模型流式传输回调对象\n",
" \"\"\"\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.usage = 0\n",
" self.sentences = []\n",
" self.translations = []\n",
" \n",
" def on_open(self) -> None:\n",
" print(\"\\n流式翻译开始...\\n\")\n",
"\n",
" def on_close(self) -> None:\n",
" print(f\"\\nTokens消耗{self.usage}\")\n",
" print(f\"流式翻译结束...\\n\")\n",
" for i in range(len(self.sentences)):\n",
" print(f\"\\n{self.sentences[i]}\\n{self.translations[i]}\\n\")\n",
"\n",
" def on_event(\n",
" self,\n",
" request_id,\n",
" transcription_result: TranscriptionResult,\n",
" translation_result: TranslationResult,\n",
" usage\n",
" ) -> None:\n",
" if transcription_result is not None:\n",
" id = transcription_result.sentence_id\n",
" text = transcription_result.text\n",
" if transcription_result.stash is not None:\n",
" stash = transcription_result.stash.text\n",
" else:\n",
" stash = \"\"\n",
" print(f\"#{id}: {text}{stash}\")\n",
" if usage: self.sentences.append(text)\n",
" \n",
" if translation_result is not None:\n",
" lang = translation_result.get_language_list()[0]\n",
" text = translation_result.get_translation(lang).text\n",
" if translation_result.get_translation(lang).stash is not None:\n",
" stash = translation_result.get_translation(lang).stash.text\n",
" else:\n",
" stash = \"\"\n",
" print(f\"#{lang}: {text}{stash}\")\n",
" if usage: self.translations.append(text)\n",
" \n",
" if usage: self.usage += usage['duration']"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"采样输入设备:\n",
" - 序号26\n",
" - 名称:耳机 (HUAWEI FreeLace 活力版) [Loopback]\n",
" - 最大输入通道数2\n",
" - 默认低输入延迟0.003s\n",
" - 默认高输入延迟0.01s\n",
" - 默认采样率48000.0Hz\n",
" - 是否回环设备True\n",
"\n",
"音频样本块大小4800\n",
"样本位宽2\n",
"音频数据格式8\n",
"音频通道数2\n",
"音频采样率48000\n",
"\n"
]
}
],
"source": [
"mic = pyaudio.PyAudio()\n",
"default_speaker = getDefaultSpeakers(mic, False)\n",
"\n",
"SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)\n",
"FORMAT = pyaudio.paInt16\n",
"CHANNELS = default_speaker[\"maxInputChannels\"]\n",
"RATE = int(default_speaker[\"defaultSampleRate\"])\n",
"CHUNK = RATE // 10\n",
"INDEX = default_speaker[\"index\"]\n",
"\n",
"dev_info = f\"\"\"\n",
"采样输入设备:\n",
" - 序号:{default_speaker['index']}\n",
" - 名称:{default_speaker['name']}\n",
" - 最大输入通道数:{default_speaker['maxInputChannels']}\n",
" - 默认低输入延迟:{default_speaker['defaultLowInputLatency']}s\n",
" - 默认高输入延迟:{default_speaker['defaultHighInputLatency']}s\n",
" - 默认采样率:{default_speaker['defaultSampleRate']}Hz\n",
" - 是否回环设备:{default_speaker['isLoopbackDevice']}\n",
"\n",
"音频样本块大小:{CHUNK}\n",
"样本位宽:{SAMP_WIDTH}\n",
"音频数据格式:{FORMAT}\n",
"音频通道数:{CHANNELS}\n",
"音频采样率:{RATE}\n",
"\"\"\"\n",
"print(dev_info)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RECORD_SECONDS = 20 # 监听时长(s)\n",
"\n",
"stream = mic.open(\n",
" format = FORMAT,\n",
" channels = CHANNELS,\n",
" rate = RATE,\n",
" input = True,\n",
" input_device_index = INDEX\n",
")\n",
"translator = TranslationRecognizerRealtime(\n",
" model = \"gummy-realtime-v1\",\n",
" format = \"pcm\",\n",
" sample_rate = RATE,\n",
" transcription_enabled = True,\n",
" translation_enabled = True,\n",
" source_language = \"ja\",\n",
" translation_target_languages = [\"zh\"],\n",
" callback = Callback()\n",
")\n",
"translator.start()\n",
"\n",
"for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):\n",
" data = stream.read(CHUNK)\n",
" data_np = np.frombuffer(data, dtype=np.int16)\n",
" data_np_r = data_np.reshape(-1, CHANNELS)\n",
" print(data_np_r.shape)\n",
" mono_data = np.mean(data_np_r.astype(np.float32), axis=1)\n",
" mono_data = mono_data.astype(np.int16)\n",
" mono_data_bytes = mono_data.tobytes()\n",
" translator.send_audio_frame(mono_data_bytes)\n",
"\n",
"translator.stop()\n",
"stream.stop_stream()\n",
"stream.close()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "mystd",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,189 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 7,
"id": "1e12f3ef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" 采样输入设备:\n",
" - 设备类型:音频输出\n",
" - 序号0\n",
" - 名称BlackHole 2ch\n",
" - 最大输入通道数2\n",
" - 默认低输入延迟0.01s\n",
" - 默认高输入延迟0.1s\n",
" - 默认采样率48000.0Hz\n",
"\n",
" 音频样本块大小2400\n",
" 样本位宽2\n",
" 采样格式8\n",
" 音频通道数2\n",
" 音频采样率48000\n",
" \n"
]
}
],
"source": [
"import sys\n",
"import os\n",
"import wave\n",
"\n",
"current_dir = os.getcwd() \n",
"sys.path.append(os.path.join(current_dir, '../caption-engine'))\n",
"\n",
"from sysaudio.darwin import AudioStream\n",
"from audioprcs import resampleRawChunk, mergeChunkChannels\n",
"\n",
"stream = AudioStream(0)\n",
"stream.printInfo()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a72914f4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recording...\n",
"Done\n"
]
}
],
"source": [
"\"\"\"获取系统音频输出5秒然后保存为wav文件\"\"\"\n",
"\n",
"with wave.open('output.wav', 'wb') as wf:\n",
" wf.setnchannels(stream.CHANNELS)\n",
" wf.setsampwidth(stream.SAMP_WIDTH)\n",
" wf.setframerate(stream.RATE)\n",
" stream.openStream()\n",
"\n",
" print('Recording...')\n",
"\n",
" for _ in range(0, 100):\n",
" chunk = stream.read_chunk()\n",
" if isinstance(chunk, bytes):\n",
" wf.writeframes(chunk)\n",
" else:\n",
" raise Exception('Error: chunk is not bytes')\n",
" \n",
" stream.closeStream() \n",
" print('Done')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a6e8a098",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recording...\n",
"Done\n"
]
}
],
"source": [
"\"\"\"获取系统音频输入转换为单通道音频持续5秒然后保存为wav文件\"\"\"\n",
"\n",
"with wave.open('output.wav', 'wb') as wf:\n",
" wf.setnchannels(1)\n",
" wf.setsampwidth(stream.SAMP_WIDTH)\n",
" wf.setframerate(stream.RATE)\n",
" stream.openStream()\n",
"\n",
" print('Recording...')\n",
"\n",
" for _ in range(0, 100):\n",
" chunk = mergeChunkChannels(\n",
" stream.read_chunk(),\n",
" stream.CHANNELS\n",
" )\n",
" if isinstance(chunk, bytes):\n",
" wf.writeframes(chunk)\n",
" else:\n",
" raise Exception('Error: chunk is not bytes')\n",
" \n",
" stream.closeStream() \n",
" print('Done')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "aaca1465",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recording...\n",
"Done\n"
]
}
],
"source": [
"\"\"\"获取系统音频输入转换为单通道音频并重采样到16000Hz持续5秒然后保存为wav文件\"\"\"\n",
"\n",
"with wave.open('output.wav', 'wb') as wf:\n",
" wf.setnchannels(1)\n",
" wf.setsampwidth(stream.SAMP_WIDTH)\n",
" wf.setframerate(16000)\n",
" stream.openStream()\n",
"\n",
" print('Recording...')\n",
"\n",
" for _ in range(0, 100):\n",
" chunk = resampleRawChunk(\n",
" stream.read_chunk(),\n",
" stream.CHANNELS,\n",
" stream.RATE,\n",
" 16000,\n",
" mode=\"sinc_best\"\n",
" )\n",
" if isinstance(chunk, bytes):\n",
" wf.writeframes(chunk)\n",
" else:\n",
" raise Exception('Error: chunk is not bytes')\n",
" \n",
" stream.closeStream() \n",
" print('Done')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,64 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "440d4a07",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"d:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n"
]
},
{
"ename": "ImportError",
"evalue": "\nMarianTokenizer requires the SentencePiece library but it was not found in your environment. Check out the instructions on the\ninstallation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones\nthat match your environment. Please note that you may need to restart your runtime after installation.\n",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mImportError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 3\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mtransformers\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m MarianMTModel, MarianTokenizer\n\u001b[32m----> \u001b[39m\u001b[32m3\u001b[39m tokenizer = \u001b[43mMarianTokenizer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfrom_pretrained\u001b[49m(\u001b[33m\"\u001b[39m\u001b[33mHelsinki-NLP/opus-mt-en-zh\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 4\u001b[39m model = MarianMTModel.from_pretrained(\u001b[33m\"\u001b[39m\u001b[33mHelsinki-NLP/opus-mt-en-zh\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 6\u001b[39m tokenizer.save_pretrained(\u001b[33m\"\u001b[39m\u001b[33m./model_en_zh\u001b[39m\u001b[33m\"\u001b[39m)\n",
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\transformers\\utils\\import_utils.py:1994\u001b[39m, in \u001b[36mDummyObject.__getattribute__\u001b[39m\u001b[34m(cls, key)\u001b[39m\n\u001b[32m 1992\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m (key.startswith(\u001b[33m\"\u001b[39m\u001b[33m_\u001b[39m\u001b[33m\"\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m key != \u001b[33m\"\u001b[39m\u001b[33m_from_config\u001b[39m\u001b[33m\"\u001b[39m) \u001b[38;5;129;01mor\u001b[39;00m key == \u001b[33m\"\u001b[39m\u001b[33mis_dummy\u001b[39m\u001b[33m\"\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m key == \u001b[33m\"\u001b[39m\u001b[33mmro\u001b[39m\u001b[33m\"\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m key == \u001b[33m\"\u001b[39m\u001b[33mcall\u001b[39m\u001b[33m\"\u001b[39m:\n\u001b[32m 1993\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28msuper\u001b[39m().\u001b[34m__getattribute__\u001b[39m(key)\n\u001b[32m-> \u001b[39m\u001b[32m1994\u001b[39m \u001b[43mrequires_backends\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mcls\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mcls\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_backends\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\transformers\\utils\\import_utils.py:1980\u001b[39m, in \u001b[36mrequires_backends\u001b[39m\u001b[34m(obj, backends)\u001b[39m\n\u001b[32m 1977\u001b[39m failed.append(msg.format(name))\n\u001b[32m 1979\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m failed:\n\u001b[32m-> \u001b[39m\u001b[32m1980\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mImportError\u001b[39;00m(\u001b[33m\"\u001b[39m\u001b[33m\"\u001b[39m.join(failed))\n",
"\u001b[31mImportError\u001b[39m: \nMarianTokenizer requires the SentencePiece library but it was not found in your environment. Check out the instructions on the\ninstallation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones\nthat match your environment. Please note that you may need to restart your runtime after installation.\n"
]
}
],
"source": [
"from transformers import MarianMTModel, MarianTokenizer\n",
"\n",
"tokenizer = MarianTokenizer.from_pretrained(\"Helsinki-NLP/opus-mt-en-zh\")\n",
"model = MarianMTModel.from_pretrained(\"Helsinki-NLP/opus-mt-en-zh\")\n",
"\n",
"tokenizer.save_pretrained(\"./model_en_zh\")\n",
"model.save_pretrained(\"./model_en_zh\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "subenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,124 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "6fb12704",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"d:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\vosk\\__init__.py\n"
]
}
],
"source": [
"import vosk\n",
"print(vosk.__file__)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "63a06f5c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" 采样设备:\n",
" - 设备类型:音频输入\n",
" - 序号1\n",
" - 名称:麦克风阵列 (Realtek(R) Audio)\n",
" - 最大输入通道数2\n",
" - 默认低输入延迟0.09s\n",
" - 默认高输入延迟0.18s\n",
" - 默认采样率44100.0Hz\n",
" - 是否回环设备False\n",
"\n",
" 音频样本块大小2205\n",
" 样本位宽2\n",
" 采样格式8\n",
" 音频通道数2\n",
" 音频采样率44100\n",
" \n"
]
}
],
"source": [
"import sys\n",
"import os\n",
"import json\n",
"from vosk import Model, KaldiRecognizer\n",
"\n",
"current_dir = os.getcwd() \n",
"sys.path.append(os.path.join(current_dir, '../caption-engine'))\n",
"\n",
"from sysaudio.win import AudioStream\n",
"from audioprcs import resampleRawChunk, mergeChunkChannels\n",
"\n",
"stream = AudioStream(1)\n",
"stream.printInfo()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "5d5a0afa",
"metadata": {},
"outputs": [],
"source": [
"model = Model(os.path.join(\n",
" current_dir,\n",
" '../caption-engine/models/vosk-model-small-cn-0.22'\n",
"))\n",
"recognizer = KaldiRecognizer(model, 16000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e9d1530",
"metadata": {},
"outputs": [],
"source": [
"stream.openStream()\n",
"\n",
"for i in range(200):\n",
" chunk = stream.read_chunk()\n",
" chunk_mono = resampleRawChunk(chunk, stream.CHANNELS, stream.RATE, 16000)\n",
" if recognizer.AcceptWaveform(chunk_mono):\n",
" result = json.loads(recognizer.Result())\n",
" print(\"acc:\", result.get(\"text\", \"\"))\n",
" else:\n",
" partial = json.loads(recognizer.PartialResult())\n",
" print(\"else:\", partial.get(\"partial\", \"\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "subenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,3 @@
from .gummy import GummyRecognizer
from .vosk import VoskRecognizer
from .sosv import SosvRecognizer

View File

@@ -5,9 +5,10 @@ from dashscope.audio.asr import (
TranslationRecognizerRealtime
)
import dashscope
from dashscope.common.error import InvalidParameter
from datetime import datetime
import json
import sys
from utils import stdout_cmd, stdout_obj, stdout_err
from utils import shared_data
class Callback(TranslationRecognizerCallback):
"""
@@ -15,17 +16,20 @@ class Callback(TranslationRecognizerCallback):
"""
def __init__(self):
super().__init__()
self.index = 0
self.usage = 0
self.cur_id = -1
self.time_str = ''
def on_open(self) -> None:
# print("on_open")
pass
self.usage = 0
self.cur_id = -1
self.time_str = ''
stdout_cmd('info', 'Gummy translator started.')
def on_close(self) -> None:
# print("on_close")
pass
stdout_cmd('info', 'Gummy translator closed.')
stdout_cmd('usage', str(self.usage))
def on_event(
self,
@@ -35,17 +39,17 @@ class Callback(TranslationRecognizerCallback):
usage
) -> None:
caption = {}
if transcription_result is not None:
caption['index'] = transcription_result.sentence_id
caption['text'] = transcription_result.text
if caption['index'] != self.cur_id:
self.cur_id = caption['index']
cur_time = datetime.now().strftime('%H:%M:%S.%f')[:-3]
caption['time_s'] = cur_time
self.time_str = cur_time
else:
caption['time_s'] = self.time_str
if self.cur_id != transcription_result.sentence_id:
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.cur_id = transcription_result.sentence_id
self.index += 1
caption['command'] = 'caption'
caption['index'] = self.index
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
caption['text'] = transcription_result.text
caption['translation'] = ""
if translation_result is not None:
@@ -55,21 +59,11 @@ class Callback(TranslationRecognizerCallback):
if usage:
self.usage += usage['duration']
# print(caption)
self.send_to_node(caption)
if 'text' in caption:
stdout_obj(caption)
def send_to_node(self, data):
"""
将数据发送到 Node.js 进程
"""
try:
json_data = json.dumps(data) + '\n'
sys.stdout.write(json_data)
sys.stdout.flush()
except Exception as e:
print(f"Error sending data to Node.js: {e}", file=sys.stderr)
class GummyTranslator:
class GummyRecognizer:
"""
使用 Gummy 引擎流式处理的音频数据并在标准输出中输出与 Auto Caption 软件可读取的 JSON 字符串数据
@@ -77,8 +71,9 @@ class GummyTranslator:
rate: 音频采样率
source: 源语言代码字符串zh, en, ja
target: 目标语言代码字符串zh, en, ja
api_key: 阿里云百炼平台 API KEY
"""
def __init__(self, rate, source, target, api_key):
def __init__(self, rate: int, source: str, target: str | None, api_key: str | None):
if api_key:
dashscope.api_key = api_key
self.translator = TranslationRecognizerRealtime(
@@ -96,10 +91,27 @@ class GummyTranslator:
"""启动 Gummy 引擎"""
self.translator.start()
def send_audio_frame(self, data):
"""发送音频帧"""
self.translator.send_audio_frame(data)
def translate(self):
"""持续读取共享数据中的音频帧,并进行语音识别,将识别结果输出到标准输出中"""
global shared_data
restart_count = 0
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
try:
self.translator.send_audio_frame(chunk)
except InvalidParameter as e:
restart_count += 1
if restart_count > 5:
stdout_err(str(e))
shared_data.status = "kill"
stdout_cmd('kill')
break
else:
stdout_cmd('info', f'Gummy engine stopped, restart attempt: {restart_count}...')
def stop(self):
"""停止 Gummy 引擎"""
self.translator.stop()
try:
self.translator.stop()
except Exception:
return

176
engine/audio2text/sosv.py Normal file
View File

@@ -0,0 +1,176 @@
"""
Shepra-ONNX SenseVoice Model
This code file references the following:
https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/simulate-streaming-sense-voice-microphone.py
"""
import time
from datetime import datetime
import sherpa_onnx
import threading
import numpy as np
from utils import shared_data
from utils import stdout_cmd, stdout_obj
from utils import google_translate, ollama_translate
class SosvRecognizer:
"""
使用 Sense Voice 非流式模型处理流式音频数据,并在标准输出中输出 Auto Caption 软件可读取的 JSON 字符串数据
初始化参数:
model_path: Shepra ONNX Sense Voice 识别模型路径
vad_model: Silero VAD 模型路径
source: 识别源语言(auto, zh, en, ja, ko, yue)
target: 翻译目标语言
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, model_path: str, source: str, target: str | None, trans_model: str, ollama_name: str):
if model_path.startswith('"'):
model_path = model_path[1:]
if model_path.endswith('"'):
model_path = model_path[:-1]
self.model_path = model_path
self.ext = ""
if self.model_path[-4:] == "int8":
self.ext = ".int8"
self.source = source
self.target = target
if trans_model == 'google':
self.trans_func = google_translate
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.time_str = ''
self.cur_id = 0
self.prev_content = ''
def start(self):
"""启动 Sense Voice 模型"""
self.recognizer = sherpa_onnx.OfflineRecognizer.from_sense_voice(
model=f"{self.model_path}/sensevoice/model{self.ext}.onnx",
tokens=f"{self.model_path}/sensevoice/tokens.txt",
language=self.source,
num_threads = 2,
)
vad_config = sherpa_onnx.VadModelConfig()
vad_config.silero_vad.model = f"{self.model_path}/silero_vad.onnx"
vad_config.silero_vad.threshold = 0.5
vad_config.silero_vad.min_silence_duration = 0.1
vad_config.silero_vad.min_speech_duration = 0.25
vad_config.silero_vad.max_speech_duration = 5
vad_config.sample_rate = 16000
self.window_size = vad_config.silero_vad.window_size
self.vad = sherpa_onnx.VoiceActivityDetector(vad_config, buffer_size_in_seconds=100)
if self.source == 'en':
model_config = sherpa_onnx.OnlinePunctuationModelConfig(
cnn_bilstm=f"{self.model_path}/punct-en/model{self.ext}.onnx",
bpe_vocab=f"{self.model_path}/punct-en/bpe.vocab"
)
punct_config = sherpa_onnx.OnlinePunctuationConfig(
model_config=model_config,
)
self.punct = sherpa_onnx.OnlinePunctuation(punct_config)
else:
punct_config = sherpa_onnx.OfflinePunctuationConfig(
model=sherpa_onnx.OfflinePunctuationModelConfig(
ct_transformer=f"{self.model_path}/punct/model{self.ext}.onnx"
),
)
self.punct = sherpa_onnx.OfflinePunctuation(punct_config)
self.buffer = []
self.offset = 0
self.started = False
self.started_time = .0
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
stdout_cmd('info', 'Shepra ONNX Sense Voice recognizer started.')
def send_audio_frame(self, data: bytes):
"""
发送音频帧给 SOSV 引擎,引擎将自动识别并将识别结果输出到标准输出中
Args:
data: 音频帧数据,采样率必须为 16000Hz
"""
caption = {}
caption['command'] = 'caption'
caption['translation'] = ''
data_np = np.frombuffer(data, dtype=np.int16).astype(np.float32)
self.buffer = np.concatenate([self.buffer, data_np])
while self.offset + self.window_size < len(self.buffer):
self.vad.accept_waveform(self.buffer[self.offset: self.offset + self.window_size])
if not self.started and self.vad.is_speech_detected():
self.started = True
self.started_time = time.time()
self.offset += self.window_size
if not self.started:
if len(self.buffer) > 10 * self.window_size:
self.offset -= len(self.buffer) - 10 * self.window_size
self.buffer = self.buffer[-10 * self.window_size:]
if self.started and time.time() - self.started_time > 0.2:
stream = self.recognizer.create_stream()
stream.accept_waveform(16000, self.buffer)
self.recognizer.decode_stream(stream)
text = stream.result.text.strip()
if text and self.prev_content != text:
caption['index'] = self.cur_id
caption['text'] = text
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.prev_content = text
stdout_obj(caption)
self.started_time = time.time()
while not self.vad.empty():
stream = self.recognizer.create_stream()
stream.accept_waveform(16000, self.vad.front.samples)
self.vad.pop()
self.recognizer.decode_stream(stream)
text = stream.result.text.strip()
if self.source == 'en':
text_with_punct = self.punct.add_punctuation_with_case(text)
else:
text_with_punct = self.punct.add_punctuation(text)
caption['index'] = self.cur_id
caption['text'] = text_with_punct
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
if text:
stdout_obj(caption)
if self.target:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], self.time_str),
daemon=True
)
th.start()
self.cur_id += 1
self.prev_content = ''
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.buffer = []
self.offset = 0
self.started = False
self.started_time = .0
def translate(self):
"""持续读取共享数据中的音频帧,并进行语音识别,将识别结果输出到标准输出中"""
global shared_data
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
self.send_audio_frame(chunk)
def stop(self):
"""停止 Sense Voice 模型"""
stdout_cmd('info', 'Shepra ONNX Sense Voice recognizer closed.')

96
engine/audio2text/vosk.py Normal file
View File

@@ -0,0 +1,96 @@
import json
import threading
import time
from datetime import datetime
from vosk import Model, KaldiRecognizer, SetLogLevel
from utils import shared_data
from utils import stdout_cmd, stdout_obj, google_translate, ollama_translate
class VoskRecognizer:
"""
使用 Vosk 引擎流式处理的音频数据,并在标准输出中输出与 Auto Caption 软件可读取的 JSON 字符串数据
初始化参数:
model_path: Vosk 识别模型路径
target: 翻译目标语言
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, model_path: str, target: str | None, trans_model: str, ollama_name: str):
SetLogLevel(-1)
if model_path.startswith('"'):
model_path = model_path[1:]
if model_path.endswith('"'):
model_path = model_path[:-1]
self.model_path = model_path
self.target = target
if trans_model == 'google':
self.trans_func = google_translate
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.time_str = ''
self.cur_id = 0
self.prev_content = ''
self.model = Model(self.model_path)
self.recognizer = KaldiRecognizer(self.model, 16000)
def start(self):
"""启动 Vosk 引擎"""
stdout_cmd('info', 'Vosk recognizer started.')
def send_audio_frame(self, data: bytes):
"""
发送音频帧给 Vosk 引擎,引擎将自动识别并将识别结果输出到标准输出中
Args:
data: 音频帧数据,采样率必须为 16000Hz
"""
caption = {}
caption['command'] = 'caption'
caption['translation'] = ''
if self.recognizer.AcceptWaveform(data):
content = json.loads(self.recognizer.Result()).get('text', '')
caption['index'] = self.cur_id
caption['text'] = content
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.prev_content = ''
if content == '': return
self.cur_id += 1
if self.target:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], self.time_str),
daemon=True
)
th.start()
else:
content = json.loads(self.recognizer.PartialResult()).get('partial', '')
if content == '' or content == self.prev_content:
return
if self.prev_content == '':
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
caption['index'] = self.cur_id
caption['text'] = content
caption['time_s'] = self.time_str
caption['time_t'] = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.prev_content = content
stdout_obj(caption)
def translate(self):
"""持续读取共享数据中的音频帧,并进行语音识别,将识别结果输出到标准输出中"""
global shared_data
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
self.send_audio_frame(chunk)
def stop(self):
"""停止 Vosk 引擎"""
stdout_cmd('info', 'Vosk recognizer closed.')

212
engine/main.py Normal file
View File

@@ -0,0 +1,212 @@
import wave
import argparse
import threading
import datetime
from utils import stdout, stdout_cmd, change_caption_display
from utils import shared_data, start_server
from utils import merge_chunk_channels, resample_chunk_mono
from audio2text import GummyRecognizer
from audio2text import VoskRecognizer
from audio2text import SosvRecognizer
from sysaudio import AudioStream
def audio_recording(stream: AudioStream, resample: bool, record = False, path = ''):
global shared_data
stream.open_stream()
wf = None
full_name = ''
if record:
if path != '':
if path.startswith('"') and path.endswith('"'):
path = path[1:-1]
if path[-1] != '/':
path += '/'
cur_dt = datetime.datetime.now()
name = cur_dt.strftime("audio-%Y-%m-%dT%H-%M-%S")
full_name = f'{path}{name}.wav'
wf = wave.open(full_name, 'wb')
wf.setnchannels(stream.CHANNELS)
wf.setsampwidth(stream.SAMP_WIDTH)
wf.setframerate(stream.RATE)
stdout_cmd("info", "Audio recording...")
while shared_data.status == 'running':
raw_chunk = stream.read_chunk()
if record: wf.writeframes(raw_chunk) # type: ignore
if raw_chunk is None: continue
if resample:
chunk = resample_chunk_mono(raw_chunk, stream.CHANNELS, stream.RATE, 16000)
else:
chunk = merge_chunk_channels(raw_chunk, stream.CHANNELS)
shared_data.chunk_queue.put(chunk)
if record:
stdout_cmd("info", f"Audio saved to {full_name}")
wf.close() # type: ignore
stream.close_stream_signal()
def main_gummy(s: str, t: str, a: int, c: int, k: str, r: bool, rp: str):
"""
Parameters:
s: Source language
t: Target language
k: Aliyun Bailian API key
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
if t == 'none':
engine = GummyRecognizer(stream.RATE, s, None, k)
else:
engine = GummyRecognizer(stream.RATE, s, t, k)
engine.start()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, False, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
def main_vosk(a: int, c: int, vosk: str, t: str, tm: str, omn: str, r: bool, rp: str):
"""
Parameters:
a: Audio source: 0 for output, 1 for input
c: Chunk number in 1 second
vosk: Vosk model path
t: Target language
tm: Translation model type, ollama or google
omn: Ollama model name
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
if t == 'none':
engine = VoskRecognizer(vosk, None, tm, omn)
else:
engine = VoskRecognizer(vosk, t, tm, omn)
engine.start()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, True, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
def main_sosv(a: int, c: int, sosv: str, s: str, t: str, tm: str, omn: str, r: bool, rp: str):
"""
Parameters:
a: Audio source: 0 for output, 1 for input
c: Chunk number in 1 second
sosv: Sherpa-ONNX SenseVoice model path
s: Source language
t: Target language
tm: Translation model type, ollama or google
omn: Ollama model name
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
if t == 'none':
engine = SosvRecognizer(sosv, s, None, tm, omn)
else:
engine = SosvRecognizer(sosv, s, t, tm, omn)
engine.start()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, True, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# all
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk or sosv')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-d', '--display_caption', default=0, help='Display caption on terminal, 0 for no display, 1 for display')
parser.add_argument('-t', '--target_language', default='none', help='Target language code, "none" for no translation')
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
# gummy and sosv
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
# gummy only
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# vosk and sosv
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
# vosk only
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
# sosv only
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
args = parser.parse_args()
if int(args.port) == 0:
shared_data.status = "running"
else:
start_server(int(args.port))
if int(args.display_caption) != 0:
change_caption_display(True)
print("Caption will be displayed on terminal")
if args.caption_engine == 'gummy':
main_gummy(
args.source_language,
args.target_language,
int(args.audio_type),
int(args.chunk_rate),
args.api_key,
True if int(args.record) == 1 else False,
args.record_path
)
elif args.caption_engine == 'vosk':
main_vosk(
int(args.audio_type),
int(args.chunk_rate),
args.vosk_model,
args.target_language,
args.translation_model,
args.ollama_name,
True if int(args.record) == 1 else False,
args.record_path
)
elif args.caption_engine == 'sosv':
main_sosv(
int(args.audio_type),
int(args.chunk_rate),
args.sosv_model,
args.source_language,
args.target_language,
args.translation_model,
args.ollama_name,
True if int(args.record) == 1 else False,
args.record_path
)
else:
raise ValueError('Invalid caption engine specified.')
if shared_data.status == "kill":
stdout_cmd('kill')

View File

@@ -1,11 +1,15 @@
# -*- mode: python ; coding: utf-8 -*-
from pathlib import Path
import sys
vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
if sys.platform == 'win32':
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
else:
vosk_path = str(Path('./.venv/lib/python3.12/site-packages/vosk').resolve())
a = Analysis(
['main-vosk.py'],
['main.py'],
pathex=[],
binaries=[],
datas=[(vosk_path, 'vosk')],
@@ -26,7 +30,7 @@ exe = EXE(
a.binaries,
a.datas,
[],
name='main-vosk',
name='main',
debug=False,
bootloader_ignore_signals=False,
strip=False,
@@ -39,4 +43,5 @@ exe = EXE(
target_arch=None,
codesign_identity=None,
entitlements_file=None,
onefile=True,
)

10
engine/requirements.txt Normal file
View File

@@ -0,0 +1,10 @@
dashscope
numpy
resampy
vosk
pyinstaller
pyaudio; sys_platform == 'darwin'
pyaudiowpatch; sys_platform == 'win32'
googletrans
ollama
sherpa_onnx

View File

@@ -0,0 +1,10 @@
import sys
if sys.platform == "win32":
from .win import AudioStream
elif sys.platform == "darwin":
from .darwin import AudioStream
elif sys.platform == "linux":
from .linux import AudioStream
else:
raise NotImplementedError(f"Unsupported platform: {sys.platform}")

118
engine/sysaudio/darwin.py Normal file
View File

@@ -0,0 +1,118 @@
"""获取 MacOS 系统音频输入/输出流"""
import pyaudio
from textwrap import dedent
def get_blackhole_device(mic: pyaudio.PyAudio):
"""
获取 BlackHole 设备
"""
device_count = mic.get_device_count()
for i in range(device_count):
dev_info = mic.get_device_info_by_index(i)
if 'blackhole' in str(dev_info["name"]).lower():
return dev_info
raise Exception("The device containing BlackHole was not found.")
class AudioStream:
"""
获取系统音频流(如果要捕获输出音频,仅支持 BlackHole 作为系统音频输出捕获)
初始化参数:
audio_type: 0-系统音频输出流(需配合 BlackHole1-系统音频输入流
chunk_rate: 每秒采集音频块的数量默认为10
"""
def __init__(self, audio_type=0, chunk_rate=10):
self.audio_type = audio_type
self.mic = pyaudio.PyAudio()
if self.audio_type == 0:
self.device = get_blackhole_device(self.mic)
else:
self.device = self.mic.get_default_input_device_info()
self.stop_signal = False
self.stream = None
self.INDEX = self.device["index"]
self.FORMAT = pyaudio.paInt16
self.SAMP_WIDTH = pyaudio.get_sample_size(self.FORMAT)
self.CHANNELS = int(self.device["maxInputChannels"])
self.DEFAULT_RATE = int(self.device["defaultSampleRate"])
self.CHUNK_RATE = chunk_rate
self.RATE = 16000
self.CHUNK = self.RATE // self.CHUNK_RATE
self.open_stream()
self.close_stream()
def get_info(self):
dev_info = f"""
采样设备:
- 设备类型:{ "音频输出" if self.audio_type == 0 else "音频输入" }
- 设备序号:{self.device['index']}
- 设备名称:{self.device['name']}
- 最大输入通道数:{self.device['maxInputChannels']}
- 默认低输入延迟:{self.device['defaultLowInputLatency']}s
- 默认高输入延迟:{self.device['defaultHighInputLatency']}s
- 默认采样率:{self.device['defaultSampleRate']}Hz
- 是否回环设备:{self.device['isLoopbackDevice']}
设备序号:{self.INDEX}
样本格式:{self.FORMAT}
样本位宽:{self.SAMP_WIDTH}
样本通道数:{self.CHANNELS}
样本采样率:{self.RATE}
样本块大小:{self.CHUNK}
"""
return dedent(dev_info).strip()
def open_stream(self):
"""
打开并返回系统音频输出流
"""
if self.stream: return self.stream
try:
self.stream = self.mic.open(
format = self.FORMAT,
channels = int(self.CHANNELS),
rate = self.RATE,
input = True,
input_device_index = int(self.INDEX)
)
except OSError:
self.RATE = self.DEFAULT_RATE
self.CHUNK = self.RATE // self.CHUNK_RATE
self.stream = self.mic.open(
format = self.FORMAT,
channels = int(self.CHANNELS),
rate = self.RATE,
input = True,
input_device_index = int(self.INDEX)
)
return self.stream
def read_chunk(self) -> bytes | None:
"""
读取音频数据
"""
if self.stop_signal:
self.close_stream()
return None
if not self.stream: return None
return self.stream.read(self.CHUNK, exception_on_overflow=False)
def close_stream_signal(self):
"""
线程安全的关闭系统音频输入流,不一定会立即关闭
"""
self.stop_signal = True
def close_stream(self):
"""
立即关闭系统音频输入流
"""
if self.stream is not None:
self.stream.stop_stream()
self.stream.close()
self.stream = None
self.stop_signal = False

109
engine/sysaudio/linux.py Normal file
View File

@@ -0,0 +1,109 @@
"""获取 Linux 系统音频输入流"""
import subprocess
from textwrap import dedent
def find_monitor_source():
result = subprocess.run(
["pactl", "list", "short", "sources"],
stdout=subprocess.PIPE, text=True
)
lines = result.stdout.splitlines()
for line in lines:
parts = line.split('\t')
if len(parts) >= 2 and ".monitor" in parts[1]:
return parts[1]
raise RuntimeError("System output monitor device not found")
def find_input_source():
result = subprocess.run(
["pactl", "list", "short", "sources"],
stdout=subprocess.PIPE, text=True
)
lines = result.stdout.splitlines()
for line in lines:
parts = line.split('\t')
name = parts[1]
if ".monitor" not in name:
return name
raise RuntimeError("Microphone input device not found")
class AudioStream:
"""
获取系统音频流
初始化参数:
audio_type: 0-系统音频输出流不支持不会生效1-系统音频输入流(默认)
chunk_rate: 每秒采集音频块的数量默认为10
"""
def __init__(self, audio_type=1, chunk_rate=10):
self.audio_type = audio_type
if self.audio_type == 0:
self.source = find_monitor_source()
else:
self.source = find_input_source()
self.stop_signal = False
self.process = None
self.FORMAT = 16
self.SAMP_WIDTH = 2
self.CHANNELS = 2
self.RATE = 16000
self.CHUNK_RATE = chunk_rate
self.CHUNK = self.RATE // chunk_rate
def get_info(self):
dev_info = f"""
音频捕获进程:
- 捕获类型:{"音频输出" if self.audio_type == 0 else "音频输入"}
- 设备源:{self.source}
- 捕获进程 PID{self.process.pid if self.process else "None"}
样本格式:{self.FORMAT}
样本位宽:{self.SAMP_WIDTH}
样本通道数:{self.CHANNELS}
样本采样率:{self.RATE}
样本块大小:{self.CHUNK}
"""
print(dev_info)
def open_stream(self):
"""
启动音频捕获进程
"""
self.process = subprocess.Popen(
["parec", "-d", self.source, "--format=s16le", "--rate=16000", "--channels=2"],
stdout=subprocess.PIPE
)
def read_chunk(self):
"""
读取音频数据
"""
if self.stop_signal:
self.close_stream()
return None
if self.process and self.process.stdout:
return self.process.stdout.read(self.CHUNK)
return None
def close_stream_signal(self):
"""
线程安全的关闭系统音频输入流,不一定会立即关闭
"""
self.stop_signal = True
def close_stream(self):
"""
关闭系统音频捕获进程
"""
if self.process:
self.process.terminate()
self.stop_signal = False

View File

@@ -1,14 +1,15 @@
"""获取 Windows 系统音频输入/输出流"""
import pyaudiowpatch as pyaudio
from textwrap import dedent
def getDefaultLoopbackDevice(mic: pyaudio.PyAudio, info = True)->dict:
def get_default_loopback_device(mic: pyaudio.PyAudio, info = True)->dict:
"""
获取默认的系统音频输出的回环设备
Args:
mic (pyaudio.PyAudio): pyaudio对象
info (bool, optional): 是否打印设备信息
mic: pyaudio对象
info: 是否打印设备信息
Returns:
dict: 系统音频输出的回环设备
@@ -45,69 +46,97 @@ class AudioStream:
初始化参数
audio_type: 0-系统音频输出流默认1-系统音频输入流
chunk_rate: 每秒采集音频块的数量默认为20
chunk_rate: 每秒采集音频块的数量默认为10
"""
def __init__(self, audio_type=0, chunk_rate=20):
def __init__(self, audio_type=0, chunk_rate=10, chunk_size=-1):
self.audio_type = audio_type
self.mic = pyaudio.PyAudio()
if self.audio_type == 0:
self.device = getDefaultLoopbackDevice(self.mic, False)
self.device = get_default_loopback_device(self.mic, False)
else:
self.device = self.mic.get_default_input_device_info()
self.stop_signal = False
self.stream = None
self.SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)
self.FORMAT = pyaudio.paInt16
self.CHANNELS = int(self.device["maxInputChannels"])
self.RATE = int(self.device["defaultSampleRate"])
self.CHUNK = self.RATE // chunk_rate
self.INDEX = self.device["index"]
self.FORMAT = pyaudio.paInt16
self.SAMP_WIDTH = pyaudio.get_sample_size(self.FORMAT)
self.CHANNELS = int(self.device["maxInputChannels"])
self.DEFAULT_RATE = int(self.device["defaultSampleRate"])
self.CHUNK_RATE = chunk_rate
def printInfo(self):
self.RATE = 16000
self.CHUNK = self.RATE // self.CHUNK_RATE
self.open_stream()
self.close_stream()
def get_info(self):
dev_info = f"""
采样设备
- 设备类型{ "音频输出" if self.audio_type == 0 else "音频输入" }
- 序号{self.device['index']}
- 名称{self.device['name']}
- 设备序号{self.device['index']}
- 设备名称{self.device['name']}
- 最大输入通道数{self.device['maxInputChannels']}
- 默认低输入延迟{self.device['defaultLowInputLatency']}s
- 默认高输入延迟{self.device['defaultHighInputLatency']}s
- 默认采样率{self.device['defaultSampleRate']}Hz
- 是否回环设备{self.device['isLoopbackDevice']}
音频样本块大小{self.CHUNK}
设备序号{self.INDEX}
样本格式{self.FORMAT}
样本位宽{self.SAMP_WIDTH}
采样格式{self.FORMAT}
音频通道数{self.CHANNELS}
音频采样率{self.RATE}
样本通道数{self.CHANNELS}
样本采样率{self.RATE}
样本块大小{self.CHUNK}
"""
print(dev_info)
return dedent(dev_info).strip()
def openStream(self):
def open_stream(self):
"""
打开并返回系统音频输出流
"""
if self.stream: return self.stream
self.stream = self.mic.open(
format = self.FORMAT,
channels = self.CHANNELS,
rate = self.RATE,
input = True,
input_device_index = self.INDEX
)
try:
self.stream = self.mic.open(
format = self.FORMAT,
channels = self.CHANNELS,
rate = self.RATE,
input = True,
input_device_index = self.INDEX
)
except OSError:
self.RATE = self.DEFAULT_RATE
self.CHUNK = self.RATE // self.CHUNK_RATE
self.stream = self.mic.open(
format = self.FORMAT,
channels = self.CHANNELS,
rate = self.RATE,
input = True,
input_device_index = self.INDEX
)
return self.stream
def read_chunk(self):
def read_chunk(self) -> bytes | None:
"""
读取音频数据
"""
if self.stop_signal:
self.close_stream()
return None
if not self.stream: return None
return self.stream.read(self.CHUNK, exception_on_overflow=False)
def closeStream(self):
def close_stream_signal(self):
"""
关闭系统音频输
线程安全的关闭系统音频输不一定会立即关闭
"""
if self.stream is None: return
self.stream.stop_stream()
self.stream.close()
self.stream = None
self.stop_signal = True
def close_stream(self):
"""
关闭系统音频输入流
"""
if self.stream is not None:
self.stream.stop_stream()
self.stream.close()
self.stream = None
self.stop_signal = False

6
engine/utils/__init__.py Normal file
View File

@@ -0,0 +1,6 @@
from .audioprcs import merge_chunk_channels, resample_chunk_mono
from .sysout import stdout, stdout_err, stdout_cmd, stdout_obj, stderr
from .sysout import change_caption_display
from .shared import shared_data
from .server import start_server
from .translation import ollama_translate, google_translate

64
engine/utils/audioprcs.py Normal file
View File

@@ -0,0 +1,64 @@
import resampy
import numpy as np
import numpy.core.multiarray # do not remove
def merge_chunk_channels(chunk: bytes, channels: int) -> bytes:
"""
将当前多通道音频数据块转换为单通道音频数据块
Args:
chunk: 多通道音频数据块
channels: 通道数
Returns:
单通道音频数据块
"""
if channels == 1: return chunk
# (length * channels,)
chunk_np = np.frombuffer(chunk, dtype=np.int16)
# (length, channels)
chunk_np = chunk_np.reshape(-1, channels)
# (length,)
chunk_mono_f = np.mean(chunk_np.astype(np.float32), axis=1)
chunk_mono = np.round(chunk_mono_f).astype(np.int16)
return chunk_mono.tobytes()
def resample_chunk_mono(chunk: bytes, channels: int, orig_sr: int, target_sr: int) -> bytes:
"""
将当前多通道音频数据块转换成单通道音频数据块,并进行重采样
Args:
chunk: 多通道音频数据块
channels: 通道数
orig_sr: 原始采样率
target_sr: 目标采样率
Return:
单通道音频数据块
"""
if channels == 1:
chunk_mono = np.frombuffer(chunk, dtype=np.int16)
chunk_mono = chunk_mono.astype(np.float32)
else:
# (length * channels,)
chunk_np = np.frombuffer(chunk, dtype=np.int16)
# (length, channels)
chunk_np = chunk_np.reshape(-1, channels)
# (length,)
chunk_mono = np.mean(chunk_np.astype(np.float32), axis=1)
if orig_sr == target_sr:
return chunk_mono.astype(np.int16).tobytes()
chunk_mono_r = resampy.resample(chunk_mono, orig_sr, target_sr)
chunk_mono_r = np.round(chunk_mono_r).astype(np.int16)
real_len = round(chunk_mono.shape[0] * target_sr / orig_sr)
if(chunk_mono_r.shape[0] != real_len):
print(chunk_mono_r.shape[0], real_len)
if(chunk_mono_r.shape[0] > real_len):
chunk_mono_r = chunk_mono_r[:real_len]
else:
while chunk_mono_r.shape[0] < real_len:
chunk_mono_r = np.append(chunk_mono_r, chunk_mono_r[-1])
return chunk_mono_r.tobytes()

41
engine/utils/server.py Normal file
View File

@@ -0,0 +1,41 @@
import socket
import threading
import json
from utils import shared_data, stdout_cmd, stderr
def handle_client(client_socket):
global shared_data
while shared_data.status == 'running':
try:
data = client_socket.recv(4096).decode('utf-8')
if not data:
break
data = json.loads(data)
if data['command'] == 'stop':
shared_data.status = 'stop'
break
except Exception as e:
stderr(f'Communication error: {e}')
break
shared_data.status = 'stop'
client_socket.close()
def start_server(port: int):
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
server.bind(('localhost', port))
server.listen(1)
except Exception as e:
stderr(str(e))
stdout_cmd('kill')
return
stdout_cmd('connect')
client, addr = server.accept()
client_handler = threading.Thread(target=handle_client, args=(client,))
client_handler.daemon = True
client_handler.start()

8
engine/utils/shared.py Normal file
View File

@@ -0,0 +1,8 @@
import queue
class SharedData:
def __init__(self):
self.status = "running"
self.chunk_queue = queue.Queue()
shared_data = SharedData()

62
engine/utils/sysout.py Normal file
View File

@@ -0,0 +1,62 @@
import sys
import json
import sherpa_onnx
display_caption = False
caption_index = -1
display = sherpa_onnx.Display()
def stdout(text: str):
stdout_cmd("print", text)
def stdout_err(text: str):
stdout_cmd("error", text)
def stdout_cmd(command: str, content = ""):
msg = { "command": command, "content": content }
sys.stdout.write(json.dumps(msg) + "\n")
sys.stdout.flush()
def change_caption_display(val: bool):
global display_caption
display_caption = val
def caption_display(obj):
global display_caption
global caption_index
global display
if caption_index >=0 and caption_index != int(obj['index']):
display.finalize_current_sentence()
caption_index = int(obj['index'])
full_text = f"{obj['text']}\n{obj['translation']}"
if obj['translation']:
full_text += "\n"
display.update_text(full_text)
display.display()
def translation_display(obj):
global original_caption
global display
full_text = f"{obj['text']}\n{obj['translation']}"
if obj['translation']:
full_text += "\n"
display.update_text(full_text)
display.display()
display.finalize_current_sentence()
def stdout_obj(obj):
global display_caption
print(obj['command'], display_caption)
if obj['command'] == 'caption' and display_caption:
caption_display(obj)
return
if obj['command'] == 'translation' and display_caption:
translation_display(obj)
return
sys.stdout.write(json.dumps(obj) + "\n")
sys.stdout.flush()
def stderr(text: str):
sys.stderr.write(text + "\n")
sys.stderr.flush()

View File

@@ -0,0 +1,51 @@
from ollama import chat
from ollama import ChatResponse
import asyncio
from googletrans import Translator
from .sysout import stdout_cmd, stdout_obj
lang_map = {
'en': 'English',
'es': 'Spanish',
'fr': 'French',
'de': 'German',
'it': 'Italian',
'ru': 'Russian',
'ja': 'Japanese',
'ko': 'Korean',
'zh': 'Chinese',
'zh-cn': 'Chinese'
}
def ollama_translate(model: str, target: str, text: str, time_s: str):
response: ChatResponse = chat(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = response.message.content or ""
if content.startswith('<think>'):
index = content.find('</think>')
if index != -1:
content = content[index+8:]
stdout_obj({
"command": "translation",
"time_s": time_s,
"text": text,
"translation": content.strip()
})
def google_translate(model: str, target: str, text: str, time_s: str):
translator = Translator()
try:
res = asyncio.run(translator.translate(text, dest=target))
stdout_obj({
"command": "translation",
"time_s": time_s,
"text": text,
"translation": res.text
})
except Exception as e:
stdout_cmd("warn", f"Google translation request failed, please check your network connection...")

893
package-lock.json generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,7 @@
{
"name": "auto-caption",
"productName": "Auto Caption",
"version": "0.4.0",
"version": "1.0.0",
"description": "A cross-platform subtitle display software.",
"main": "./out/main/index.js",
"author": "himeditator",
@@ -25,6 +25,7 @@
"@electron-toolkit/preload": "^3.0.1",
"@electron-toolkit/utils": "^4.0.0",
"ant-design-vue": "^4.2.6",
"pidusage": "^4.0.1",
"pinia": "^3.0.2",
"vue-i18n": "^11.1.9",
"vue-router": "^4.5.1"
@@ -34,6 +35,7 @@
"@electron-toolkit/eslint-config-ts": "^3.0.0",
"@electron-toolkit/tsconfig": "^1.0.1",
"@types/node": "^22.14.1",
"@types/pidusage": "^2.0.5",
"@vitejs/plugin-vue": "^5.2.3",
"electron": "^35.1.5",
"electron-builder": "^25.1.8",

View File

@@ -3,6 +3,7 @@ import path from 'path'
import { is } from '@electron-toolkit/utils'
import icon from '../../build/icon.png?asset'
import { controlWindow } from './ControlWindow'
import { allConfig } from './utils/AllConfig'
class CaptionWindow {
window: BrowserWindow | undefined;
@@ -10,7 +11,7 @@ class CaptionWindow {
public createWindow(): void {
this.window = new BrowserWindow({
icon: icon,
width: 900,
width: allConfig.captionWindowWidth,
height: 100,
minWidth: 480,
show: false,
@@ -30,6 +31,12 @@ class CaptionWindow {
this.window?.show()
})
this.window.on('close', () => {
if(this.window) {
allConfig.captionWindowWidth = this.window?.getBounds().width;
}
})
this.window.on('closed', () => {
this.window = undefined
})

View File

@@ -1,12 +1,16 @@
import { shell, BrowserWindow, ipcMain, nativeTheme, dialog } from 'electron'
import path from 'path'
import { EngineInfo } from './types'
import pidusage from 'pidusage'
import { is } from '@electron-toolkit/utils'
import icon from '../../build/icon.png?asset'
import { captionWindow } from './CaptionWindow'
import { allConfig } from './utils/AllConfig'
import { captionEngine } from './utils/CaptionEngine'
import { Log } from './utils/Log'
class ControlWindow {
mounted: boolean = false;
window: BrowserWindow | undefined;
public createWindow(): void {
@@ -32,6 +36,7 @@ class ControlWindow {
})
this.window.on('closed', () => {
this.mounted = false
this.window = undefined
allConfig.writeConfig()
})
@@ -61,7 +66,8 @@ class ControlWindow {
})
ipcMain.handle('both.window.mounted', () => {
return allConfig.getFullConfig()
this.mounted = true
return allConfig.getFullConfig(Log.getAndClearLogQueue())
})
ipcMain.handle('control.nativeTheme.get', () => {
@@ -81,6 +87,21 @@ class ControlWindow {
return result.filePaths[0];
})
ipcMain.handle('control.engine.info', async () => {
const info: EngineInfo = {
pid: 0, ppid: 0, port: 0, cpu: 0, mem: 0, elapsed: 0
}
if(captionEngine.status !== 'running') return info
const stats = await pidusage(captionEngine.process.pid)
info.pid = stats.pid
info.ppid = stats.ppid
info.port = captionEngine.port
info.cpu = stats.cpu
info.mem = stats.memory
info.elapsed = stats.elapsed
return info
})
ipcMain.on('control.uiLanguage.change', (_, args) => {
allConfig.uiLanguage = args
if(captionWindow.window){
@@ -92,6 +113,10 @@ class ControlWindow {
allConfig.uiTheme = args
})
ipcMain.on('control.uiColor.change', (_, args) => {
allConfig.uiColor = args
})
ipcMain.on('control.leftBarWidth.change', (_, args) => {
allConfig.leftBarWidth = args
})
@@ -134,6 +159,10 @@ class ControlWindow {
captionEngine.stop()
})
ipcMain.on('control.engine.forceKill', () => {
captionEngine.kill()
})
ipcMain.on('control.captionLog.clear', () => {
allConfig.captionLog.splice(0)
})

View File

@@ -4,5 +4,6 @@ export default {
"engine.start.error": "Caption engine failed to start: ",
"engine.output.parse.error": "Unable to parse caption engine output as a JSON object: ",
"engine.error": "Caption engine error: ",
"engine.shutdown.error": "Failed to shut down the caption engine process: "
"engine.shutdown.error": "Failed to shut down the caption engine process: ",
"engine.start.timeout": "Caption engine startup timeout, automatically force stopped"
}

View File

@@ -4,5 +4,6 @@ export default {
"engine.start.error": "字幕エンジンの起動に失敗しました: ",
"engine.output.parse.error": "字幕エンジンの出力を JSON オブジェクトとして解析できませんでした: ",
"engine.error": "字幕エンジンエラー: ",
"engine.shutdown.error": "字幕エンジンプロセスの終了に失敗しました: "
"engine.shutdown.error": "字幕エンジンプロセスの終了に失敗しました: ",
"engine.start.timeout": "字幕エンジンの起動がタイムアウトしました。自動的に強制停止しました"
}

View File

@@ -4,5 +4,6 @@ export default {
"engine.start.error": "字幕引擎启动失败:",
"engine.output.parse.error": "字幕引擎输出内容无法解析为 JSON 对象:",
"engine.error": "字幕引擎错误:",
"engine.shutdown.error": "字幕引擎进程关闭失败:"
"engine.shutdown.error": "字幕引擎进程关闭失败:",
"engine.start.timeout": "字幕引擎启动超时,已自动强制停止"
}

View File

@@ -25,7 +25,7 @@ app.whenReady().then(() => {
})
app.on('will-quit', async () => {
captionEngine.stop()
captionEngine.kill()
allConfig.writeConfig()
});

View File

@@ -6,17 +6,24 @@ export interface Controls {
engineEnabled: boolean,
sourceLang: string,
targetLang: string,
transModel: string,
ollamaName: string,
engine: string,
audio: 0 | 1,
translation: boolean,
recording: boolean,
API_KEY: string,
modelPath: string,
voskModelPath: string,
sosvModelPath: string,
recordingPath: string,
customized: boolean,
customizedApp: string,
customizedCommand: string
customizedCommand: string,
startTimeoutSeconds: number
}
export interface Styles {
lineNumber: number,
lineBreak: number,
fontFamily: string,
fontSize: number,
@@ -45,12 +52,30 @@ export interface CaptionItem {
translation: string
}
export interface SoftwareLogItem {
type: "INFO" | "WARN" | "ERROR",
index: number,
time: string,
text: string
}
export interface FullConfig {
platform: string,
uiLanguage: UILanguage,
uiTheme: UITheme,
uiColor: string,
leftBarWidth: number,
styles: Styles,
controls: Controls,
captionLog: CaptionItem[]
captionLog: CaptionItem[],
softwareLog: SoftwareLogItem[]
}
export interface EngineInfo {
pid: number,
ppid: number,
port:number,
cpu: number,
mem: number,
elapsed: number
}

View File

@@ -1,12 +1,25 @@
import {
UILanguage, UITheme, Styles, Controls,
CaptionItem, FullConfig
CaptionItem, FullConfig, SoftwareLogItem
} from '../types'
import { Log } from './Log'
import { app, BrowserWindow } from 'electron'
import * as path from 'path'
import * as fs from 'fs'
import os from 'os'
interface CaptionTranslation {
time_s: string,
translation: string
}
function getDesktopPath() {
const homeDir = os.homedir()
return path.join(homeDir, 'Desktop')
}
const defaultStyles: Styles = {
lineNumber: 1,
lineBreak: 1,
fontFamily: 'sans-serif',
fontSize: 24,
@@ -30,24 +43,35 @@ const defaultStyles: Styles = {
const defaultControls: Controls = {
sourceLang: 'en',
targetLang: 'zh',
transModel: 'ollama',
ollamaName: '',
engine: 'gummy',
audio: 0,
engineEnabled: false,
API_KEY: '',
modelPath: '',
voskModelPath: '',
sosvModelPath: '',
recordingPath: getDesktopPath(),
translation: true,
recording: false,
customized: false,
customizedApp: '',
customizedCommand: ''
customizedCommand: '',
startTimeoutSeconds: 30
};
class AllConfig {
captionWindowWidth: number = 900;
uiLanguage: UILanguage = 'zh';
leftBarWidth: number = 8;
uiTheme: UITheme = 'system';
uiColor: string = '#1677ff';
styles: Styles = {...defaultStyles};
controls: Controls = {...defaultControls};
lastLogIndex: number = -1;
captionLog: CaptionItem[] = [];
constructor() {}
@@ -55,39 +79,44 @@ class AllConfig {
public readConfig() {
const configPath = path.join(app.getPath('userData'), 'config.json')
if(fs.existsSync(configPath)){
Log.info('Read Config from:', configPath)
const config = JSON.parse(fs.readFileSync(configPath, 'utf-8'))
if(config.captionWindowWidth) this.captionWindowWidth = config.captionWindowWidth
if(config.uiLanguage) this.uiLanguage = config.uiLanguage
if(config.uiTheme) this.uiTheme = config.uiTheme
if(config.uiColor) this.uiColor = config.uiColor
if(config.leftBarWidth) this.leftBarWidth = config.leftBarWidth
if(config.styles) this.setStyles(config.styles)
if(process.platform !== 'win32' && process.platform !== 'darwin') config.controls.audio = 1
if(config.controls) this.setControls(config.controls)
console.log('[INFO] Read Config from:', configPath)
}
}
public writeConfig() {
const config = {
captionWindowWidth: this.captionWindowWidth,
uiLanguage: this.uiLanguage,
uiTheme: this.uiTheme,
uiColor: this.uiColor,
leftBarWidth: this.leftBarWidth,
controls: this.controls,
styles: this.styles
}
const configPath = path.join(app.getPath('userData'), 'config.json')
fs.writeFileSync(configPath, JSON.stringify(config, null, 2))
console.log('[INFO] Write Config to:', configPath)
Log.info('Write Config to:', configPath)
}
public getFullConfig(): FullConfig {
public getFullConfig(softwareLog: SoftwareLogItem[]): FullConfig {
return {
platform: process.platform,
uiLanguage: this.uiLanguage,
uiTheme: this.uiTheme,
uiColor: this.uiColor,
leftBarWidth: this.leftBarWidth,
styles: this.styles,
controls: this.controls,
captionLog: this.captionLog
captionLog: this.captionLog,
softwareLog: softwareLog
}
}
@@ -97,7 +126,7 @@ class AllConfig {
this.styles[key] = args[key]
}
}
console.log('[INFO] Set Styles:', this.styles)
Log.info('Set Styles:', this.styles)
}
public resetStyles() {
@@ -106,7 +135,7 @@ class AllConfig {
public sendStyles(window: BrowserWindow) {
window.webContents.send('both.styles.set', this.styles)
console.log(`[INFO] Send Styles to #${window.id}:`, this.styles)
Log.info(`Send Styles to #${window.id}:`, this.styles)
}
public setControls(args: Object) {
@@ -117,38 +146,57 @@ class AllConfig {
}
}
this.controls.engineEnabled = engineEnabled
console.log('[INFO] Set Controls:', this.controls)
let _controls = {...this.controls}
_controls.API_KEY = _controls.API_KEY.replace(/./g, '*')
Log.info('Set Controls:', _controls)
}
public sendControls(window: BrowserWindow) {
public sendControls(window: BrowserWindow, info = true) {
window.webContents.send('control.controls.set', this.controls)
console.log(`[INFO] Send Controls to #${window.id}:`, this.controls)
if(info) Log.info(`Send Controls to #${window.id}:`, this.controls)
}
public updateCaptionLog(log: CaptionItem) {
let command: 'add' | 'upd' = 'add'
if(
this.captionLog.length &&
this.captionLog[this.captionLog.length - 1].index === log.index &&
this.captionLog[this.captionLog.length - 1].time_s === log.time_s
this.lastLogIndex === log.index
) {
this.captionLog.splice(this.captionLog.length - 1, 1, log)
command = 'upd'
}
else {
this.captionLog.push(log)
this.lastLogIndex = log.index
}
this.captionLog[this.captionLog.length - 1].index = this.captionLog.length
for(const window of BrowserWindow.getAllWindows()){
this.sendCaptionLog(window, command)
}
}
public sendCaptionLog(window: BrowserWindow, command: 'add' | 'upd' | 'set') {
public updateCaptionTranslation(trans: CaptionTranslation){
for(let i = this.captionLog.length - 1; i >= 0; i--){
if(this.captionLog[i].time_s === trans.time_s){
this.captionLog[i].translation = trans.translation
for(const window of BrowserWindow.getAllWindows()){
this.sendCaptionLog(window, 'upd', i)
}
break
}
}
}
public sendCaptionLog(
window: BrowserWindow,
command: 'add' | 'upd' | 'set',
index: number | undefined = undefined
) {
if(command === 'add'){
window.webContents.send(`both.captionLog.add`, this.captionLog[this.captionLog.length - 1])
window.webContents.send(`both.captionLog.add`, this.captionLog.at(-1))
}
else if(command === 'upd'){
window.webContents.send(`both.captionLog.upd`, this.captionLog[this.captionLog.length - 1])
if(index !== undefined) window.webContents.send(`both.captionLog.upd`, this.captionLog[index])
else window.webContents.send(`both.captionLog.upd`, this.captionLog.at(-1))
}
else if(command === 'set'){
window.webContents.send(`both.captionLog.set`, this.captionLog)

View File

@@ -1,168 +1,286 @@
import { spawn, exec } from 'child_process'
import { exec, spawn } from 'child_process'
import { app } from 'electron'
import { is } from '@electron-toolkit/utils'
import path from 'path'
import net from 'net'
import { controlWindow } from '../ControlWindow'
import { allConfig } from './AllConfig'
import { i18n } from '../i18n'
import { Log } from './Log'
export class CaptionEngine {
appPath: string = ''
command: string[] = []
process: any | undefined
processStatus: 'running' | 'stopping' | 'stopped' = 'stopped'
client: net.Socket | undefined
port: number = 8080
status: 'running' | 'starting' | 'stopping' | 'stopped' | 'starting-timeout' = 'stopped'
timerID: NodeJS.Timeout | undefined
startTimeoutID: NodeJS.Timeout | undefined
private getApp(): boolean {
allConfig.controls.customized = false
if (allConfig.controls.customized && allConfig.controls.customizedApp) {
if (allConfig.controls.customized) {
Log.info('Using customized caption engine')
this.appPath = allConfig.controls.customizedApp
this.command = [allConfig.controls.customizedCommand]
allConfig.controls.customized = true
this.command = allConfig.controls.customizedCommand.split(' ')
this.port = Math.floor(Math.random() * (65535 - 1024 + 1)) + 1024
this.command.push('-p', this.port.toString())
}
else if (allConfig.controls.engine === 'gummy') {
if(!allConfig.controls.API_KEY && !process.env.DASHSCOPE_API_KEY) {
else {
if(allConfig.controls.engine === 'gummy' &&
!allConfig.controls.API_KEY && !process.env.DASHSCOPE_API_KEY
) {
controlWindow.sendErrorMessage(i18n('gummy.key.missing'))
return false
}
let gummyName = 'main-gummy'
if (process.platform === 'win32') {
gummyName += '.exe'
}
this.command = []
if (is.dev) {
this.appPath = path.join(
app.getAppPath(),
'caption-engine', 'dist', gummyName
)
if(process.platform === "win32") {
this.appPath = path.join(
app.getAppPath(), 'engine',
'.venv', 'Scripts', 'python.exe'
)
this.command.push(path.join(
app.getAppPath(), 'engine', 'main.py'
))
// this.appPath = path.join(app.getAppPath(), 'engine', 'dist', 'main.exe')
}
else {
this.appPath = path.join(
app.getAppPath(), 'engine',
'.venv', 'bin', 'python3'
)
this.command.push(path.join(
app.getAppPath(), 'engine', 'main.py'
))
}
}
else {
this.appPath = path.join(
process.resourcesPath, 'caption-engine', gummyName
)
if(process.platform === 'win32') {
this.appPath = path.join(process.resourcesPath, 'engine', 'main.exe')
}
else {
this.appPath = path.join(process.resourcesPath, 'engine', 'main')
}
}
this.command = []
this.command.push('-s', allConfig.controls.sourceLang)
this.command.push('-a', allConfig.controls.audio ? '1' : '0')
if(allConfig.controls.recording) {
this.command.push('-r', '1')
this.command.push('-rp', `"${allConfig.controls.recordingPath}"`)
}
this.port = Math.floor(Math.random() * (65535 - 1024 + 1)) + 1024
this.command.push('-p', this.port.toString())
this.command.push(
'-t', allConfig.controls.translation ?
allConfig.controls.targetLang : 'none'
)
this.command.push('-a', allConfig.controls.audio ? '1' : '0')
if(allConfig.controls.API_KEY) {
this.command.push('-k', allConfig.controls.API_KEY)
if(allConfig.controls.engine === 'gummy') {
this.command.push('-e', 'gummy')
this.command.push('-s', allConfig.controls.sourceLang)
if(allConfig.controls.API_KEY) {
this.command.push('-k', allConfig.controls.API_KEY)
}
}
else if(allConfig.controls.engine === 'vosk'){
this.command.push('-e', 'vosk')
this.command.push('-vosk', `"${allConfig.controls.voskModelPath}"`)
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
}
else if(allConfig.controls.engine === 'sosv'){
this.command.push('-e', 'sosv')
this.command.push('-s', allConfig.controls.sourceLang)
this.command.push('-sosv', `"${allConfig.controls.sosvModelPath}"`)
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
}
}
else if(allConfig.controls.engine === 'vosk'){
let voskName = 'main-vosk'
if (process.platform === 'win32') {
voskName += '.exe'
}
if (is.dev) {
this.appPath = path.join(
app.getAppPath(),
'caption-engine', 'dist', voskName
)
}
else {
this.appPath = path.join(
process.resourcesPath, 'caption-engine', voskName
)
}
this.command = []
this.command.push('-a', allConfig.controls.audio ? '1' : '0')
this.command.push('-m', `"${allConfig.controls.modelPath}"`)
Log.info('Engine Path:', this.appPath)
if(this.command.length > 2 && this.command.at(-2) === '-k') {
const _command = [...this.command]
_command[_command.length -1] = _command[_command.length -1].replace(/./g, '*')
Log.info('Engine Command:', _command)
}
console.log('[INFO] Engine Path:', this.appPath)
console.log('[INFO] Engine Command:', this.command)
else Log.info('Engine Command:', this.command)
return true
}
public start() {
if (this.processStatus !== 'stopped') {
return
public connect() {
if(this.client) { Log.warn('Client already exists, ignoring...') }
if (this.startTimeoutID) {
clearTimeout(this.startTimeoutID)
this.startTimeoutID = undefined
}
if(!this.getApp()){ return }
try {
this.process = spawn(this.appPath, this.command)
}
catch (e) {
controlWindow.sendErrorMessage(i18n('engine.start.error') + e)
console.error('[ERROR] Error starting subprocess:', e)
return
}
this.processStatus = 'running'
console.log('[INFO] Caption Engine Started, PID:', this.process.pid)
this.client = net.createConnection({ port: this.port }, () => {
Log.info('Connected to caption engine server');
});
this.status = 'running'
allConfig.controls.engineEnabled = true
if(controlWindow.window){
allConfig.sendControls(controlWindow.window)
allConfig.sendControls(controlWindow.window, false)
controlWindow.window.webContents.send(
'control.engine.started',
this.process.pid
)
}
}
public sendCommand(command: string, content: string = "") {
if(this.client === undefined) {
Log.error('Client not initialized yet')
return
}
const data = JSON.stringify({command, content})
this.client.write(data);
Log.info(`Send data to python server: ${data}`);
}
public start() {
if (this.status !== 'stopped') {
Log.warn('Caption engine is not stopped, current status:', this.status)
return
}
if(!this.getApp()){ return }
this.process = spawn(this.appPath, this.command)
this.status = 'starting'
Log.info('Caption Engine Starting, PID:', this.process.pid)
const timeoutMs = allConfig.controls.startTimeoutSeconds * 1000
this.startTimeoutID = setTimeout(() => {
if (this.status === 'starting') {
Log.warn(`Engine start timeout after ${allConfig.controls.startTimeoutSeconds} seconds, forcing kill...`)
this.status = 'starting-timeout'
controlWindow.sendErrorMessage(i18n('engine.start.timeout'))
this.kill()
}
}, timeoutMs)
this.process.stdout.on('data', (data: any) => {
const lines = data.toString().split('\n');
const lines = data.toString().split('\n')
lines.forEach((line: string) => {
if (line.trim()) {
try {
const caption = JSON.parse(line);
allConfig.updateCaptionLog(caption);
const data_obj = JSON.parse(line)
handleEngineData(data_obj)
} catch (e) {
controlWindow.sendErrorMessage(i18n('engine.output.parse.error') + e)
console.error('[ERROR] Error parsing JSON:', e);
Log.error('Error parsing JSON:', e)
}
}
});
});
this.process.stderr.on('data', (data) => {
controlWindow.sendErrorMessage(i18n('engine.error') + data)
console.error(`[ERROR] Subprocess Error: ${data}`);
this.process.stderr.on('data', (data: any) => {
const lines = data.toString().split('\n')
lines.forEach((line: string) => {
if(line.trim()){
Log.error(line)
}
})
});
this.process.on('close', (code: any) => {
console.log(`[INFO] Subprocess exited with code ${code}`);
this.process = undefined;
this.client = undefined
allConfig.controls.engineEnabled = false
if(controlWindow.window){
allConfig.sendControls(controlWindow.window)
allConfig.sendControls(controlWindow.window, false)
controlWindow.window.webContents.send('control.engine.stopped')
}
this.processStatus = 'stopped'
console.log('[INFO] Caption engine process stopped')
this.status = 'stopped'
clearInterval(this.timerID)
if (this.startTimeoutID) {
clearTimeout(this.startTimeoutID)
this.startTimeoutID = undefined
}
Log.info(`Engine exited with code ${code}`)
});
}
public stop() {
if(this.processStatus !== 'running') return
if(this.status !== 'running'){
Log.warn('Trying to stop engine which is not running, current status:', this.status)
}
this.sendCommand('stop')
if(this.client){
this.client.destroy()
this.client = undefined
}
this.status = 'stopping'
this.timerID = setTimeout(() => {
if(this.status !== 'stopping') return
Log.warn('Engine process still not stopped, trying to kill...')
this.kill()
}, 4000);
}
public kill(){
if(!this.process || !this.process.pid) return
if(this.status !== 'running'){
Log.warn('Trying to kill engine which is not running, current status:', this.status)
}
Log.warn('Killing engine process, PID:', this.process.pid)
if (this.startTimeoutID) {
clearTimeout(this.startTimeoutID)
this.startTimeoutID = undefined
}
if(this.client){
this.client.destroy()
this.client = undefined
}
if (this.process.pid) {
console.log('[INFO] Trying to stop process, PID:', this.process.pid)
let cmd = `kill ${this.process.pid}`;
let cmd = `kill -9 ${this.process.pid}`;
if (process.platform === "win32") {
cmd = `taskkill /pid ${this.process.pid} /t /f`
}
exec(cmd, (error) => {
if (error) {
controlWindow.sendErrorMessage(i18n('engine.shutdown.error') + error)
console.error(`[ERROR] Failed to kill process: ${error}`)
Log.error('Failed to kill process:', error)
} else {
Log.info('Process killed successfully')
}
})
}
else {
this.process = undefined;
allConfig.controls.engineEnabled = false
if(controlWindow.window){
allConfig.sendControls(controlWindow.window)
controlWindow.window.webContents.send('control.engine.stopped')
}
this.processStatus = 'stopped'
console.log('[INFO] Process PID undefined, caption engine process stopped')
return
}
}
function handleEngineData(data: any) {
if(data.command === 'connect'){
captionEngine.connect()
}
else if(data.command === 'kill') {
if(captionEngine.status !== 'stopped') {
Log.warn('Error occurred, trying to kill caption engine...')
captionEngine.kill()
}
this.processStatus = 'stopping'
console.log('[INFO] Caption engine process stopping')
}
else if(data.command === 'caption') {
allConfig.updateCaptionLog(data);
}
else if(data.command === 'translation') {
allConfig.updateCaptionTranslation(data);
}
else if(data.command === 'print') {
console.log(data.content)
}
else if(data.command === 'info') {
Log.info('Engine Info:', data.content)
}
else if(data.command === 'warn') {
Log.warn('Engine Warn:', data.content)
}
else if(data.command === 'error') {
Log.error('Engine Error:', data.content)
controlWindow.sendErrorMessage(/*i18n('engine.error') +*/ data.content)
}
else if(data.command === 'usage') {
Log.info('Engine Token Usage: ', data.content)
}
else {
Log.warn('Unknown command:', data)
}
}

58
src/main/utils/Log.ts Normal file
View File

@@ -0,0 +1,58 @@
import { controlWindow } from "../ControlWindow"
import { type SoftwareLogItem } from "../types"
let logIndex = 0
const logQueue: SoftwareLogItem[] = []
function getTimeString() {
const now = new Date()
const HH = String(now.getHours()).padStart(2, '0')
const MM = String(now.getMinutes()).padStart(2, '0')
const SS = String(now.getSeconds()).padStart(2, '0')
const MS = String(now.getMilliseconds()).padStart(3, '0')
return `${HH}:${MM}:${SS}.${MS}`
}
export class Log {
static getAndClearLogQueue() {
const copiedQueue = structuredClone(logQueue)
logQueue.length = 0
return copiedQueue
}
static handleLog(logType: "INFO" | "WARN" | "ERROR", ...msg: any[]) {
const timeStr = getTimeString()
const logPre = `[${logType} ${timeStr}]`
let logStr = ""
for(let i = 0; i < msg.length; i++) {
logStr += i ? " " : ""
if(typeof msg[i] === "string") logStr += msg[i]
else logStr += JSON.stringify(msg[i], undefined, 2)
}
console.log(logPre, logStr)
const logItem: SoftwareLogItem = {
type: logType,
index: ++logIndex,
time: timeStr,
text: logStr
}
if(controlWindow.mounted && controlWindow.window) {
controlWindow.window.webContents.send('control.softwareLog.add', logItem)
}
else {
logQueue.push(logItem)
}
}
static info(...msg: any[]){
this.handleLog("INFO", ...msg)
}
static warn(...msg: any[]){
this.handleLog("WARN", ...msg)
}
static error(...msg: any[]){
this.handleLog("ERROR", ...msg)
}
}

View File

@@ -2,7 +2,7 @@
<html>
<head>
<meta charset="UTF-8" />
<title>Auto Caption</title>
<title>Auto Caption v1.0.0</title>
<!-- https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP -->
<meta
http-equiv="Content-Security-Policy"

View File

@@ -6,6 +6,7 @@
import { onMounted } from 'vue'
import { FullConfig } from './types'
import { useCaptionLogStore } from './stores/captionLog'
import { useSoftwareLogStore } from './stores/softwareLog'
import { useCaptionStyleStore } from './stores/captionStyle'
import { useEngineControlStore } from './stores/engineControl'
import { useGeneralSettingStore } from './stores/generalSetting'
@@ -14,11 +15,13 @@ onMounted(() => {
window.electron.ipcRenderer.invoke('both.window.mounted').then((data: FullConfig) => {
useGeneralSettingStore().uiLanguage = data.uiLanguage
useGeneralSettingStore().uiTheme = data.uiTheme
useGeneralSettingStore().uiColor = data.uiColor
useGeneralSettingStore().leftBarWidth = data.leftBarWidth
useCaptionStyleStore().setStyles(data.styles)
useEngineControlStore().platform = data.platform
useEngineControlStore().setControls(data.controls)
useCaptionLogStore().captionData = data.captionLog
useSoftwareLogStore().softwareLogs = data.softwareLog
})
})
</script>

View File

@@ -17,6 +17,7 @@
}
.input-area {
display: inline-block;
width: calc(100% - 100px);
min-width: 100px;
}

View File

@@ -1,16 +1,78 @@
<template>
<div class="caption-list">
<div>
<a-app class="caption-title">
<span style="margin-right: 30px;">{{ $t('log.title') }}</span>
</a-app>
<div>
<div class="caption-title">
<span style="margin-right: 30px;">{{ $t('log.title') }}</span>
</div>
<a-popover :title="$t('log.baseTime')">
<template #content>
<div class="base-time">
<div class="base-time-container">
<a-input
type="number" min="0"
v-model:value="baseHH"
></a-input>
<span class="base-time-label">{{ $t('log.hour') }}</span>
</div>
</div><span style="margin: 0 4px;">:</span>
<div class="base-time">
<div class="base-time-container">
<a-input
type="number" min="0" max="59"
v-model:value="baseMM"
></a-input>
<span class="base-time-label">{{ $t('log.min') }}</span>
</div>
</div><span style="margin: 0 4px;">:</span>
<div class="base-time">
<div class="base-time-container">
<a-input
type="number" min="0" max="59"
v-model:value="baseSS"
></a-input>
<span class="base-time-label">{{ $t('log.sec') }}</span>
</div>
</div><span style="margin: 0 4px;">.</span>
<div class="base-time">
<div class="base-time-container">
<a-input
type="number" min="0" max="999"
v-model:value="baseMS"
></a-input>
<span class="base-time-label">{{ $t('log.ms') }}</span>
</div>
</div>
</template>
<a-button
type="primary"
style="margin-right: 20px;"
@click="changeBaseTime"
:disabled="captionData.length === 0"
>{{ $t('log.changeTime') }}</a-button>
</a-popover>
<a-popover :title="$t('log.exportOptions')">
<template #content>
<div class="input-item">
<span class="input-label">{{ $t('log.exportFormat') }}</span>
<a-radio-group v-model:value="exportFormat">
<a-radio-button value="srt"><code>.srt</code></a-radio-button>
<a-radio-button value="json"><code>.json</code></a-radio-button>
</a-radio-group>
</div>
<div class="input-item">
<span class="input-label">{{ $t('log.exportContent') }}</span>
<a-radio-group v-model:value="contentOption">
<a-radio-button value="both">{{ $t('log.both') }}</a-radio-button>
<a-radio-button value="source">{{ $t('log.source') }}</a-radio-button>
<a-radio-button value="target">{{ $t('log.translation') }}</a-radio-button>
</a-radio-group>
</div>
</template>
<a-button
style="margin-right: 20px;"
@click="exportCaptions"
:disabled="captionData.length === 0"
>{{ $t('log.export') }}</a-button>
</a-popover>
<a-popover :title="$t('log.copyOptions')">
<template #content>
<div class="input-item">
@@ -21,72 +83,98 @@
</div>
<div class="input-item">
<span class="input-label">{{ $t('log.copyContent') }}</span>
<a-radio-group v-model:value="copyOption">
<a-radio-group v-model:value="contentOption">
<a-radio-button value="both">{{ $t('log.both') }}</a-radio-button>
<a-radio-button value="source">{{ $t('log.source') }}</a-radio-button>
<a-radio-button value="target">{{ $t('log.translation') }}</a-radio-button>
</a-radio-group>
</div>
<div class="input-item">
<span class="input-label">{{ $t('log.copyNum') }}</span>
<a-radio-group v-model:value="copyNum">
<a-radio-button :value="0"><code>[:]</code></a-radio-button>
<a-radio-button :value="1"><code>[-1:]</code></a-radio-button>
<a-radio-button :value="2"><code>[-2:]</code></a-radio-button>
<a-radio-button :value="3"><code>[-3:]</code></a-radio-button>
</a-radio-group>
</div>
</template>
<a-button
style="margin-right: 20px;"
@click="copyCaptions"
:disabled="captionData.length === 0"
>{{ $t('log.copy') }}</a-button>
</a-popover>
<a-button
danger
@click="clearCaptions"
>{{ $t('log.clear') }}</a-button>
</div>
<a-table
:columns="columns"
:data-source="captionData"
v-model:pagination="pagination"
>
<template #bodyCell="{ column, record }">
<template v-if="column.key === 'index'">
{{ record.index }}
</template>
<template v-if="column.key === 'time'">
<div class="time-cell">
<div class="time-start">{{ record.time_s }}</div>
<div class="time-end">{{ record.time_t }}</div>
</div>
</template>
<template v-if="column.key === 'content'">
<div class="caption-content">
<div class="caption-text">{{ record.text }}</div>
<div class="caption-translation">{{ record.translation }}</div>
</div>
</template>
</template>
</a-table>
</a-popover>
<a-button
danger
@click="clearCaptions"
>{{ $t('log.clear') }}</a-button>
</div>
<a-table
:columns="columns"
:data-source="captionData"
v-model:pagination="pagination"
style="margin-top: 10px;"
>
<template #bodyCell="{ column, record }">
<template v-if="column.key === 'index'">
{{ record.index }}
</template>
<template v-if="column.key === 'time'">
<div class="time-cell">
<code class="time-start"
:style="`color: ${uiColor}`"
>{{ record.time_s }}</code>
<code class="time-end">{{ record.time_t }}</code>
</div>
</template>
<template v-if="column.key === 'content'">
<div class="caption-content">
<div class="caption-text">{{ record.text }}</div>
<div
class="caption-translation"
:style="`border-left: 3px solid ${uiColor};`"
>{{ record.translation }}</div>
</div>
</template>
</template>
</a-table>
</template>
<script setup lang="ts">
import { ref } from 'vue'
import { storeToRefs } from 'pinia'
import { useCaptionLogStore } from '@renderer/stores/captionLog'
import { useGeneralSettingStore } from '@renderer/stores/generalSetting'
import { message } from 'ant-design-vue'
import { useI18n } from 'vue-i18n'
import * as tc from '../utils/timeCalc'
import { CaptionItem } from '../types'
const { t } = useI18n()
const captionLog = useCaptionLogStore()
const { captionData } = storeToRefs(captionLog)
const generalSetting = useGeneralSettingStore()
const { uiColor } = storeToRefs(generalSetting)
const exportFormat = ref('srt')
const showIndex = ref(true)
const copyTime = ref(true)
const copyOption = ref('both')
const contentOption = ref('both')
const copyNum = ref(0)
const baseHH = ref<number>(0)
const baseMM = ref<number>(0)
const baseSS = ref<number>(0)
const baseMS = ref<number>(0)
const pagination = ref({
current: 1,
pageSize: 10,
pageSize: 20,
showSizeChanger: true,
pageSizeOptions: ['10', '20', '50'],
showTotal: (total: number) => `Total: ${total}`,
pageSizeOptions: ['10', '20', '50', '100'],
onChange: (page: number, pageSize: number) => {
pagination.value.current = page
pagination.value.pageSize = pageSize
@@ -103,12 +191,23 @@ const columns = [
dataIndex: 'index',
key: 'index',
width: 80,
sorter: (a: CaptionItem, b: CaptionItem) => {
if(a.index <= b.index) return -1
return 1
},
sortDirections: ['descend'],
defaultSortOrder: 'descend',
},
{
title: 'time',
dataIndex: 'time',
key: 'time',
width: 160,
width: 150,
sorter: (a: CaptionItem, b: CaptionItem) => {
if(a.time_s <= b.time_s) return -1
return 1
},
sortDirections: ['descend', 'ascend'],
},
{
title: 'content',
@@ -117,28 +216,73 @@ const columns = [
},
]
function changeBaseTime() {
if(baseHH.value < 0) baseHH.value = 0
if(baseMM.value < 0) baseMM.value = 0
if(baseMM.value > 59) baseMM.value = 59
if(baseSS.value < 0) baseSS.value = 0
if(baseSS.value > 59) baseSS.value = 59
if(baseMS.value < 0) baseMS.value = 0
if(baseMS.value > 999) baseMS.value = 999
const newBase: tc.Time = {
hh: Number(baseHH.value),
mm: Number(baseMM.value),
ss: Number(baseSS.value),
ms: Number(baseMS.value)
}
const oldBase = tc.getTimeFromStr(captionData.value[0].time_s)
const deltaMs = tc.getMsFromTime(newBase) - tc.getMsFromTime(oldBase)
for(let i = 0; i < captionData.value.length; i++){
captionData.value[i].time_s =
tc.getNewTimeStr(captionData.value[i].time_s, deltaMs)
captionData.value[i].time_t =
tc.getNewTimeStr(captionData.value[i].time_t, deltaMs)
}
}
function exportCaptions() {
const jsonData = JSON.stringify(captionData.value, null, 2)
const blob = new Blob([jsonData], { type: 'application/json' })
const exportData = getExportData()
const blob = new Blob([exportData], {
type: exportFormat.value === 'json' ? 'application/json' : 'text/plain'
})
const url = URL.createObjectURL(blob)
const a = document.createElement('a')
a.href = url
const timestamp = new Date().toISOString().replace(/[:.]/g, '-')
a.download = `captions-${timestamp}.json`
a.download = `captions-${timestamp}.${exportFormat.value}`
document.body.appendChild(a)
a.click()
document.body.removeChild(a)
URL.revokeObjectURL(url)
}
function copyCaptions() {
function getExportData() {
if(exportFormat.value === 'json') return JSON.stringify(captionData.value, null, 2)
let content = ''
for(let i = 0; i < captionData.value.length; i++){
const item = captionData.value[i]
content += `${i+1}\n`
content += `${item.time_s} --> ${item.time_t}\n`.replace(/\./g, ',')
if(contentOption.value === 'both') content += `${item.text}\n${item.translation}\n\n`
else if(contentOption.value === 'source') content += `${item.text}\n\n`
else content += `${item.translation}\n\n`
}
return content
}
function copyCaptions() {
let content = ''
let start = 0
if(copyNum.value > 0) {
start = captionData.value.length - copyNum.value
if(start < 0) start = 0
}
for(let i = start; i < captionData.value.length; i++){
const item = captionData.value[i]
if(showIndex.value) content += `${i+1}\n`
if(copyTime.value) content += `${item.time_s} --> ${item.time_t}\n`.replace(/\./g, ',')
if(copyOption.value === 'both') content += `${item.text}\n${item.translation}\n\n`
else if(copyOption.value === 'source') content += `${item.text}\n\n`
if(contentOption.value === 'both') content += `${item.text}\n${item.translation}\n\n`
else if(contentOption.value === 'source') content += `${item.text}\n\n`
else content += `${item.translation}\n\n`
}
navigator.clipboard.writeText(content)
@@ -160,10 +304,28 @@ function clearCaptions() {
}
.caption-title {
color: var(--icon-color);
display: inline-block;
font-size: 24px;
font-weight: bold;
margin-bottom: 10px;
margin: 10px 0;
}
.base-time {
width: 64px;
display: inline-block;
}
.base-time-container {
display: flex;
flex-direction: column;
align-items: center;
gap: 4px;
}
.base-time-label {
font-size: 12px;
color: var(--tag-color);
}
.time-cell {
@@ -174,10 +336,13 @@ function clearCaptions() {
}
.time-start {
color: #1677ff;
display: block;
font-weight: bold;
}
.time-end {
display: block;
font-weight: bold;
color: #ff4d4f;
}
@@ -193,6 +358,5 @@ function clearCaptions() {
.caption-translation {
font-size: 14px;
padding-left: 16px;
border-left: 3px solid #1890ff;
}
</style>

View File

@@ -6,6 +6,16 @@
<a @click="resetStyle">{{ $t('style.resetStyle') }}</a>
</template>
<div class="input-item">
<span class="input-label">{{ $t('style.lineNumber') }}</span>
<a-radio-group v-model:value="currentLineNumber">
<a-radio-button :value="1">1</a-radio-button>
<a-radio-button :value="2">2</a-radio-button>
<a-radio-button :value="3">3</a-radio-button>
<a-radio-button :value="4">4</a-radio-button>
</a-radio-group>
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.longCaption') }}</span>
<a-select
@@ -34,20 +44,18 @@
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.fontSize') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="0" max="64"
:min="0" :max="72"
v-model:value="currentFontSize"
/>
<div class="input-item-value">{{ currentFontSize }}px</div>
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.fontWeight') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="1" max="9"
:min="1" :max="9"
v-model:value="currentFontWeight"
/>
<div class="input-item-value">{{ currentFontWeight*100 }}</div>
@@ -63,11 +71,10 @@
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.opacity') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="0"
max="100"
:min="0"
:max="100"
v-model:value="currentOpacity"
/>
<div class="input-item-value">{{ currentOpacity }}%</div>
@@ -76,12 +83,12 @@
<div class="input-item">
<span class="input-label">{{ $t('style.preview') }}</span>
<a-switch v-model:checked="currentPreview" />
<span style="display:inline-block;width:20px;"></span>
<span style="display:inline-block;width:10px;"></span>
<div style="display: inline-block;">
<span class="switch-label">{{ $t('style.translation') }}</span>
<a-switch v-model:checked="currentTransDisplay" />
</div>
<span style="display:inline-block;width:20px;"></span>
<span style="display:inline-block;width:10px;"></span>
<div style="display: inline-block;">
<span class="switch-label">{{ $t('style.textShadow') }}</span>
<a-switch v-model:checked="currentTextShadow" />
@@ -111,20 +118,18 @@
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.fontSize') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="0" max="64"
:min="0" :max="72"
v-model:value="currentTransFontSize"
/>
<div class="input-item-value">{{ currentTransFontSize }}px</div>
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.fontWeight') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="1" max="9"
:min="1" :max="9"
v-model:value="currentTransFontWeight"
/>
<div class="input-item-value">{{ currentTransFontWeight*100 }}</div>
@@ -136,30 +141,27 @@
<a-card size="small" :title="$t('style.shadow.title')">
<div class="input-item">
<span class="input-label">{{ $t('style.shadow.offsetX') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="-10" max="10"
:min="-10" :max="10"
v-model:value="currentOffsetX"
/>
<div class="input-item-value">{{ currentOffsetX }}px</div>
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.shadow.offsetY') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="-10" max="10"
:min="-10" :max="10"
v-model:value="currentOffsetY"
/>
<div class="input-item-value">{{ currentOffsetY }}px</div>
</div>
<div class="input-item">
<span class="input-label">{{ $t('style.shadow.blur') }}</span>
<a-input
<a-slider
class="input-area"
type="range"
min="0" max="10"
:min="0" :max="12"
v-model:value="currentBlur"
/>
<div class="input-item-value">{{ currentBlur }}px</div>
@@ -186,28 +188,57 @@
textShadow: currentTextShadow ? `${currentOffsetX}px ${currentOffsetY}px ${currentBlur}px ${currentTextShadowColor}` : 'none'
}"
>
<p :class="[currentLineBreak?'':'left-ellipsis']"
:style="{
fontFamily: currentFontFamily,
fontSize: currentFontSize + 'px',
color: currentFontColor,
fontWeight: currentFontWeight * 100
}">
<span v-if="captionData.length">{{ captionData[captionData.length-1].text }}</span>
<span v-else>{{ $t('example.original') }}</span>
</p>
<p :class="[currentLineBreak?'':'left-ellipsis']"
v-if="currentTransDisplay"
:style="{
fontFamily: currentTransFontFamily,
fontSize: currentTransFontSize + 'px',
color: currentTransFontColor,
fontWeight: currentTransFontWeight * 100
}"
>
<span v-if="captionData.length">{{ captionData[captionData.length-1].translation }}</span>
<span v-else>{{ $t('example.translation') }}</span>
</p>
<template v-if="captionData.length">
<template
v-for="val in revArr[Math.min(currentLineNumber, captionData.length)]"
:key="captionData[captionData.length - val].time_s"
>
<p :class="[currentLineBreak?'':'left-ellipsis']"
:style="{
fontFamily: currentFontFamily,
fontSize: currentFontSize + 'px',
color: currentFontColor,
fontWeight: currentFontWeight * 100
}">
<span>{{ captionData[captionData.length - val].text }}</span>
</p>
<p :class="[currentLineBreak?'':'left-ellipsis']"
v-if="currentTransDisplay && captionData[captionData.length - val].translation"
:style="{
fontFamily: currentTransFontFamily,
fontSize: currentTransFontSize + 'px',
color: currentTransFontColor,
fontWeight: currentTransFontWeight * 100
}"
>
<span>{{ captionData[captionData.length - val].translation }}</span>
</p>
</template>
</template>
<template v-else>
<template v-for="val in currentLineNumber" :key="val">
<p :class="[currentLineBreak?'':'left-ellipsis']"
:style="{
fontFamily: currentFontFamily,
fontSize: currentFontSize + 'px',
color: currentFontColor,
fontWeight: currentFontWeight * 100
}">
<span>{{ $t('example.original') }}</span>
</p>
<p :class="[currentLineBreak?'':'left-ellipsis']"
v-if="currentTransDisplay"
:style="{
fontFamily: currentTransFontFamily,
fontSize: currentTransFontSize + 'px',
color: currentTransFontColor,
fontWeight: currentTransFontWeight * 100
}"
>
<span>{{ $t('example.translation') }}</span>
</p>
</template>
</template>
</div>
</Teleport>
@@ -220,6 +251,14 @@ import { storeToRefs } from 'pinia'
import { notification } from 'ant-design-vue'
import { useI18n } from 'vue-i18n'
import { useCaptionLogStore } from '@renderer/stores/captionLog';
const revArr = {
1: [1],
2: [2, 1],
3: [3, 2, 1],
4: [4, 3, 2, 1],
}
const captionLog = useCaptionLogStore();
const { captionData } = storeToRefs(captionLog);
@@ -228,6 +267,7 @@ const { t } = useI18n()
const captionStyle = useCaptionStyleStore()
const { changeSignal } = storeToRefs(captionStyle)
const currentLineNumber = ref<number>(1)
const currentLineBreak = ref<number>(0)
const currentFontFamily = ref<string>('sans-serif')
const currentFontSize = ref<number>(24)
@@ -261,6 +301,7 @@ function useSameStyle(){
}
function applyStyle(){
captionStyle.lineNumber = currentLineNumber.value;
captionStyle.lineBreak = currentLineBreak.value;
captionStyle.fontFamily = currentFontFamily.value;
captionStyle.fontSize = currentFontSize.value;
@@ -282,13 +323,15 @@ function applyStyle(){
captionStyle.sendStylesChange();
notification.open({
notification.open({
placement: 'topLeft',
message: t('noti.styleChange'),
description: t('noti.styleInfo')
});
}
function backStyle(){
currentLineNumber.value = captionStyle.lineNumber;
currentLineBreak.value = captionStyle.lineBreak;
currentFontFamily.value = captionStyle.fontFamily;
currentFontSize.value = captionStyle.fontSize;
@@ -314,7 +357,7 @@ function resetStyle() {
}
watch(changeSignal, (val) => {
if(val == true) {
if(val === true) {
backStyle();
captionStyle.changeSignal = false;
}
@@ -335,13 +378,12 @@ watch(changeSignal, (val) => {
}
.preview-container {
line-height: 2em;
width: 60%;
text-align: center;
position: absolute;
padding: 20px;
padding: 10px;
border-radius: 10px;
left: 50%;
left: 64%;
transform: translateX(-50%);
bottom: 20px;
}
@@ -349,7 +391,7 @@ watch(changeSignal, (val) => {
.preview-container p {
text-align: center;
margin: 0;
line-height: 1.5em;
line-height: 1.6em;
}
.left-ellipsis {

View File

@@ -5,23 +5,6 @@
<a @click="applyChange">{{ $t('engine.applyChange') }}</a> |
<a @click="cancelChange">{{ $t('engine.cancelChange') }}</a>
</template>
<div class="input-item">
<span class="input-label">{{ $t('engine.sourceLang') }}</span>
<a-select
class="input-area"
v-model:value="currentSourceLang"
:options="langList"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.transLang') }}</span>
<a-select
:disabled="currentEngine === 'vosk'"
class="input-area"
v-model:value="currentTargetLang"
:options="langList.filter((item) => item.value !== 'auto')"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.captionEngine') }}</span>
<a-select
@@ -30,10 +13,48 @@
:options="captionEngine"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.sourceLang') }}</span>
<a-select
:disabled="currentEngine === 'vosk'"
class="input-area"
v-model:value="currentSourceLang"
:options="sLangList"
></a-select>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.transLang') }}</span>
<a-select
class="input-area"
v-model:value="currentTargetLang"
:options="tLangList"
></a-select>
</div>
<div class="input-item" v-if="transModel">
<span class="input-label">{{ $t('engine.transModel') }}</span>
<a-select
class="input-area"
v-model:value="currentTransModel"
:options="transModel"
></a-select>
</div>
<div class="input-item" v-if="transModel && currentTransModel === 'ollama'">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.ollamaNote') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.ollama') }}</span>
</a-popover>
<a-input
class="input-area"
v-model:value="currentOllamaName"
></a-input>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.audioType') }}</span>
<a-select
:disabled="platform !== 'win32' && platform !== 'darwin'"
class="input-area"
v-model:value="currentAudio"
:options="audioType"
@@ -42,20 +63,59 @@
<div class="input-item">
<span class="input-label">{{ $t('engine.enableTranslation') }}</span>
<a-switch v-model:checked="currentTranslation" />
<span style="display:inline-block;width:20px;"></span>
<span style="display:inline-block;width:10px;"></span>
<div style="display: inline-block;">
<span class="switch-label">{{ $t('engine.enableRecording') }}</span>
<a-switch v-model:checked="currentRecording" />
</div>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.customEngine') }}</span>
<a-switch v-model:checked="currentCustomized" />
<span style="display:inline-block;width:10px;"></span>
<div style="display: inline-block;">
<span class="switch-label">{{ $t('engine.showMore') }}</span>
<a-switch v-model:checked="showMore" />
</div>
</div>
<a-card size="small" :title="$t('engine.showMore')" v-show="showMore">
<div class="input-item">
<a-card size="small" :title="$t('engine.custom.title')" v-show="currentCustomized">
<template #extra>
<a-popover>
<template #content>
<p class="label-hover-info">{{ $t('engine.apikeyInfo') }}</p>
<p class="customize-note">{{ $t('engine.custom.note') }}</p>
</template>
<span class="input-label info-label">{{ $t('engine.apikey') }}</span>
<a><InfoCircleOutlined />{{ $t('engine.custom.attention') }}</a>
</a-popover>
</template>
<div class="input-item">
<span class="input-label">{{ $t('engine.custom.app') }}</span>
<a-input
class="input-area"
v-model:value="currentCustomizedApp"
></a-input>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.custom.command') }}</span>
<a-input
class="input-area"
v-model:value="currentCustomizedCommand"
></a-input>
</div>
</a-card>
<a-card size="small" :title="$t('engine.showMore')" v-show="showMore" style="margin-top:10px;">
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.apikeyInfo') }}</p>
<p><a href="https://bailian.console.aliyun.com" target="_blank">
https://bailian.console.aliyun.com
</a></p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.apikey') }}</span>
</a-popover>
<a-input
class="input-area"
@@ -64,51 +124,89 @@
/>
</div>
<div class="input-item">
<a-popover>
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.modelPathInfo') }}</p>
<p class="label-hover-info">{{ $t('engine.voskModelPathInfo') }}</p>
<p class="label-hover-info">
<a href="https://alphacephei.com/vosk/models" target="_blank">Vosk {{ $t('engine.modelDownload') }}</a>
</p>
</template>
<span class="input-label info-label">{{ $t('engine.modelPath') }}</span>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.voskModelPath') }}</span>
</a-popover>
<span
class="input-folder"
@click="selectFolderPath"
:style="{color: uiColor}"
@click="selectFolderPath('vosk')"
><span><FolderOpenOutlined /></span></span>
<a-input
class="input-area"
style="width:calc(100% - 140px);"
v-model:value="currentModelPath"
v-model:value="currentVoskModelPath"
/>
</div>
<div class="input-item">
<span style="margin-right:5px;">{{ $t('engine.customEngine') }}</span>
<a-switch v-model:checked="currentCustomized" />
</div>
<div v-show="currentCustomized">
<a-card size="small" :title="$t('engine.custom.title')">
<template #extra>
<a-popover>
<template #content>
<p class="customize-note">{{ $t('engine.custom.note') }}</p>
</template>
<a><InfoCircleOutlined />{{ $t('engine.custom.attention') }}</a>
</a-popover>
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.sosvModelPathInfo') }}</p>
<p class="label-hover-info">
<a href="https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model" target="_blank">SOSV {{ $t('engine.modelDownload') }}</a>
</p>
</template>
<div class="input-item">
<span class="input-label">{{ $t('engine.custom.app') }}</span>
<a-input
class="input-area"
v-model:value="currentCustomizedApp"
></a-input>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.custom.command') }}</span>
<a-input
class="input-area"
v-model:value="currentCustomizedCommand"
></a-input>
</div>
</a-card>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.sosvModelPath') }}</span>
</a-popover>
<span
class="input-folder"
:style="{color: uiColor}"
@click="selectFolderPath('sosv')"
><span><FolderOpenOutlined /></span></span>
<a-input
class="input-area"
style="width:calc(100% - 140px);"
v-model:value="currentSosvModelPath"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.recordingPathInfo') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.recordingPath') }}</span>
</a-popover>
<span
class="input-folder"
:style="{color: uiColor}"
@click="selectFolderPath('rec')"
><span><FolderOpenOutlined /></span></span>
<a-input
class="input-area"
style="width:calc(100% - 140px);"
v-model:value="currentRecordingPath"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.startTimeoutInfo') }}</p>
</template>
<span
class="input-label info-label"
:style="{color: uiColor, verticalAlign: 'middle'}"
>{{ $t('engine.startTimeout') }}</span>
</a-popover>
<a-input-number
class="input-area"
v-model:value="currentStartTimeoutSeconds"
:min="10"
:max="120"
:step="5"
:addon-after="$t('engine.seconds')"
/>
</div>
</a-card>
</a-card>
@@ -116,55 +214,102 @@
</template>
<script setup lang="ts">
import { ref, computed, watch } from 'vue'
import { ref, computed, watch, h } from 'vue'
import { storeToRefs } from 'pinia'
import { useGeneralSettingStore } from '@renderer/stores/generalSetting'
import { useEngineControlStore } from '@renderer/stores/engineControl'
import { notification } from 'ant-design-vue'
import { FolderOpenOutlined ,InfoCircleOutlined } from '@ant-design/icons-vue';
import { ExclamationCircleOutlined, FolderOpenOutlined ,InfoCircleOutlined } from '@ant-design/icons-vue';
import { useI18n } from 'vue-i18n'
const { t } = useI18n()
const showMore = ref(false)
const engineControl = useEngineControlStore()
const { platform, captionEngine, audioType, changeSignal } = storeToRefs(engineControl)
const { captionEngine, audioType, changeSignal } = storeToRefs(engineControl)
const generalSetting = useGeneralSettingStore()
const { uiColor } = storeToRefs(generalSetting)
const currentSourceLang = ref('auto')
const currentTargetLang = ref('zh')
const currentEngine = ref<string>('gummy')
const currentAudio = ref<0 | 1>(0)
const currentTranslation = ref<boolean>(false)
const currentTranslation = ref<boolean>(true)
const currentRecording = ref<boolean>(false)
const currentTransModel = ref('ollama')
const currentOllamaName = ref('')
const currentAPI_KEY = ref<string>('')
const currentModelPath = ref<string>('')
const currentVoskModelPath = ref<string>('')
const currentSosvModelPath = ref<string>('')
const currentRecordingPath = ref<string>('')
const currentCustomized = ref<boolean>(false)
const currentCustomizedApp = ref('')
const currentCustomizedCommand = ref('')
const currentStartTimeoutSeconds = ref<number>(30)
const langList = computed(() => {
const sLangList = computed(() => {
for(let item of captionEngine.value){
if(item.value === currentEngine.value) {
return item.languages
return item.languages.filter(item => item.type <= 0)
}
}
return []
})
const tLangList = computed(() => {
for(let item of captionEngine.value){
if(item.value === currentEngine.value) {
return item.languages.filter(item => item.type >= 0)
}
}
return []
})
const transModel = computed(() => {
for(let item of captionEngine.value){
if(item.value === currentEngine.value) {
return item.transModel
}
}
return []
})
function applyChange(){
if(
currentTranslation.value && transModel.value &&
currentTransModel.value === 'ollama' && !currentOllamaName.value.trim()
) {
notification.open({
message: t('noti.ollamaNameNull'),
description: t('noti.ollamaNameNullNote'),
duration: null,
icon: () => h(ExclamationCircleOutlined, { style: 'color: #ff4d4f' })
})
return
}
engineControl.sourceLang = currentSourceLang.value
engineControl.targetLang = currentTargetLang.value
engineControl.transModel = currentTransModel.value
engineControl.ollamaName = currentOllamaName.value
engineControl.engine = currentEngine.value
engineControl.audio = currentAudio.value
engineControl.translation = currentTranslation.value
engineControl.recording = currentRecording.value
engineControl.API_KEY = currentAPI_KEY.value
engineControl.modelPath = currentModelPath.value
engineControl.voskModelPath = currentVoskModelPath.value
engineControl.sosvModelPath = currentSosvModelPath.value
engineControl.recordingPath = currentRecordingPath.value
engineControl.customized = currentCustomized.value
engineControl.customizedApp = currentCustomizedApp.value
engineControl.customizedCommand = currentCustomizedCommand.value
engineControl.startTimeoutSeconds = currentStartTimeoutSeconds.value
engineControl.sendControlsChange()
notification.open({
placement: 'topLeft',
message: t('noti.engineChange'),
description: t('noti.changeInfo')
});
@@ -173,20 +318,31 @@ function applyChange(){
function cancelChange(){
currentSourceLang.value = engineControl.sourceLang
currentTargetLang.value = engineControl.targetLang
currentTransModel.value = engineControl.transModel
currentOllamaName.value = engineControl.ollamaName
currentEngine.value = engineControl.engine
currentAudio.value = engineControl.audio
currentTranslation.value = engineControl.translation
currentRecording.value = engineControl.recording
currentAPI_KEY.value = engineControl.API_KEY
currentModelPath.value = engineControl.modelPath
currentVoskModelPath.value = engineControl.voskModelPath
currentSosvModelPath.value = engineControl.sosvModelPath
currentRecordingPath.value = engineControl.recordingPath
currentCustomized.value = engineControl.customized
currentCustomizedApp.value = engineControl.customizedApp
currentCustomizedCommand.value = engineControl.customizedCommand
currentStartTimeoutSeconds.value = engineControl.startTimeoutSeconds
}
function selectFolderPath() {
function selectFolderPath(type: 'vosk' | 'sosv' | 'rec') {
window.electron.ipcRenderer.invoke('control.folder.select').then((folderPath) => {
if(!folderPath) return
currentModelPath.value = folderPath
if(type == 'vosk')
currentVoskModelPath.value = folderPath
else if(type == 'sosv')
currentSosvModelPath.value = folderPath
else if(type == 'rec')
currentRecordingPath.value = folderPath
})
}
@@ -200,9 +356,12 @@ watch(changeSignal, (val) => {
watch(currentEngine, (val) => {
if(val == 'vosk'){
currentSourceLang.value = 'auto'
currentTargetLang.value = ''
currentTargetLang.value = useGeneralSettingStore().uiLanguage
if(currentTargetLang.value === 'zh') {
currentTargetLang.value = 'zh-cn'
}
}
else if(val == 'gummy'){
else{
currentSourceLang.value = 'auto'
currentTargetLang.value = useGeneralSettingStore().uiLanguage
}
@@ -218,8 +377,8 @@ watch(currentEngine, (val) => {
}
.info-label {
color: #1677ff;
cursor: pointer;
font-style: italic;
}
.input-folder {
@@ -230,20 +389,12 @@ watch(currentEngine, (val) => {
transition: all 0.25s;
}
.input-folder>span {
padding: 0 4px;
border: 2px solid #1677ff;
color: #1677ff;
border-radius: 30%;
}
.input-folder:hover {
transform: scale(1.1);
}
.customize-note {
padding: 10px 10px 0;
color: red;
max-width: min(40vw, 480px);
}
</style>

View File

@@ -1,22 +1,59 @@
<template>
<div class="caption-stat">
<a-row>
<a-col :span="6">
<a-col :span="5">
<a-statistic
:title="$t('status.engine')"
:value="(customized && customizedApp)?$t('status.customized'):engine"
:value="customized?$t('status.customized'):engine"
/>
</a-col>
<a-col :span="6">
<a-statistic
:title="$t('status.status')"
:value="engineEnabled?$t('status.started'):$t('status.stopped')"
/>
</a-col>
<a-col :span="6">
<a-popover :title="$t('status.engineStatus')">
<template #content>
<a-row class="engine-status">
<a-col :flex="1" :title="$t('status.pid')" style="cursor:pointer;">
<div class="engine-status-title">pid</div>
<div>{{ pid }}</div>
</a-col>
<a-col :flex="1" :title="$t('status.ppid')" style="cursor:pointer;">
<div class="engine-status-title">ppid</div>
<div>{{ ppid }}</div>
</a-col>
<a-col :flex="1" :title="$t('status.port')" style="cursor:pointer;">
<div class="engine-status-title">port</div>
<div>{{ port }}</div>
</a-col>
<a-col :flex="1" :title="$t('status.cpu')" style="cursor:pointer;">
<div class="engine-status-title">cpu</div>
<div>{{ cpu.toFixed(1) }}%</div>
</a-col>
<a-col :flex="1" :title="$t('status.mem')" style="cursor:pointer;">
<div class="engine-status-title">mem</div>
<div>{{ (mem/1024/1024).toFixed(2) }}MB</div>
</a-col>
<a-col :flex="1" :title="$t('status.elapsed')" style="cursor:pointer;">
<div class="engine-status-title">elapsed</div>
<div>{{ (elapsed/1000).toFixed(0) }}s</div>
</a-col>
</a-row>
</template>
<a-col :span="5" @mouseenter="getEngineInfo" style="cursor: pointer;">
<a-statistic
:title="$t('status.status')"
:value="engineEnabled?$t('status.started'):$t('status.stopped')"
>
<template #suffix v-if="engineEnabled">
<InfoCircleOutlined style="font-size:18px;color:#1677ff"/>
</template>
</a-statistic>
</a-col>
</a-popover>
<a-col :span="5">
<a-statistic :title="$t('status.logNumber')" :value="captionData.length" />
</a-col>
<a-col :span="6">
<a-col :span="5">
<a-statistic :title="$t('status.logNumber2')" :value="softwareLogs.length" />
</a-col>
<a-col :span="4">
<div class="about-tag">{{ $t('status.aboutProj') }}</div>
<GithubOutlined class="proj-info" @click="showAbout = true"/>
</a-col>
@@ -30,13 +67,30 @@
@click="openCaptionWindow"
>{{ $t('status.openCaption') }}</a-button>
<a-button
v-if="!isStarting"
class="control-button"
:disabled="engineEnabled"
:loading="pending && !engineEnabled"
:disabled="pending || engineEnabled"
@click="startEngine"
>{{ $t('status.startEngine') }}</a-button>
<a-popconfirm
v-if="isStarting"
:title="$t('status.forceKillConfirm')"
:ok-text="$t('status.confirm')"
:cancel-text="$t('status.cancel')"
@confirm="forceKillEngine"
>
<a-button
danger
class="control-button"
type="primary"
:icon="h(LoadingOutlined)"
>{{ $t('status.forceKillStarting') }}</a-button>
</a-popconfirm>
<a-button
danger class="control-button"
:disabled="!engineEnabled"
:loading="pending && engineEnabled"
:disabled="pending || !engineEnabled"
@click="stopEngine"
>{{ $t('status.stopEngine') }}</a-button>
</div>
@@ -47,7 +101,7 @@
<p class="about-desc">{{ $t('status.about.desc') }}</p>
<a-divider />
<div class="about-info">
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v0.4.0</a-tag></p>
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v1.0.0</a-tag></p>
<p>
<b>{{ $t('status.about.author') }}</b>
<a
@@ -88,38 +142,103 @@
</template>
<script setup lang="ts">
import { ref } from 'vue'
import { EngineInfo } from '@renderer/types'
import { ref, watch, h } from 'vue'
import { storeToRefs } from 'pinia'
import { useCaptionLogStore } from '@renderer/stores/captionLog'
import { useSoftwareLogStore } from '@renderer/stores/softwareLog'
import { useEngineControlStore } from '@renderer/stores/engineControl'
import { GithubOutlined } from '@ant-design/icons-vue';
import { GithubOutlined, InfoCircleOutlined, LoadingOutlined } from '@ant-design/icons-vue'
const showAbout = ref(false)
const pending = ref(false)
const isStarting = ref(false)
const captionLog = useCaptionLogStore()
const { captionData } = storeToRefs(captionLog)
const softwareLog = useSoftwareLogStore()
const { softwareLogs } = storeToRefs(softwareLog)
const engineControl = useEngineControlStore()
const { engineEnabled, engine, customized, customizedApp } = storeToRefs(engineControl)
const { engineEnabled, engine, customized, errorSignal } = storeToRefs(engineControl)
const pid = ref(0)
const ppid = ref(0)
const port = ref(0)
const cpu = ref(0)
const mem = ref(0)
const elapsed = ref(0)
function openCaptionWindow() {
window.electron.ipcRenderer.send('control.captionWindow.activate')
}
function startEngine() {
console.log(`@@${engineControl.modelPath}##`)
if(engineControl.engine === 'vosk' && engineControl.modelPath.trim() === '') {
pending.value = true
isStarting.value = true
if(engineControl.engine === 'vosk' && engineControl.voskModelPath.trim() === '') {
engineControl.emptyModelPathErr()
pending.value = false
isStarting.value = false
return
}
if(engineControl.engine === 'sosv' && engineControl.sosvModelPath.trim() === '') {
engineControl.emptyModelPathErr()
pending.value = false
isStarting.value = false
return
}
window.electron.ipcRenderer.send('control.engine.start')
}
function stopEngine() {
pending.value = true
window.electron.ipcRenderer.send('control.engine.stop')
}
function forceKillEngine() {
pending.value = true
isStarting.value = false
window.electron.ipcRenderer.send('control.engine.forceKill')
}
function getEngineInfo() {
window.electron.ipcRenderer.invoke('control.engine.info').then((data: EngineInfo) => {
pid.value = data.pid
ppid.value = data.ppid
port.value = data.port
cpu.value = data.cpu
mem.value = data.mem
elapsed.value = data.elapsed
})
}
watch(engineEnabled, (enabled) => {
pending.value = false
if (enabled) {
isStarting.value = false
}
})
watch(errorSignal, () => {
pending.value = false
isStarting.value = false
errorSignal.value = false
})
</script>
<style scoped>
.engine-status {
width: max(420px, 36vw);
display: flex;
align-items: center;
padding: 5px 10px;
}
.engine-status-title {
font-size: 12px;
color: var(--tag-color);
}
.about-tag {
color: var(--tag-color);
margin-bottom: 16px;

View File

@@ -28,11 +28,24 @@
</a-radio-group>
</div>
<div class="input-item">
<span class="input-label">{{ $t('general.color') }}</span>
<a-radio-group v-model:value="uiColor">
<template v-for="color in colorList" :key="color">
<a-radio-button :value="color"
:style="{backgroundColor: color}"
>
<CheckOutlined style="color: white;" v-if="color === uiColor" />
<span v-else>&nbsp;</span>
</a-radio-button>
</template>
</a-radio-group>
</div>
<div class="input-item">
<span class="input-label">{{ $t('general.barWidth') }}</span>
<a-input
type="range" class="span-input"
min="6" max="12" v-model:value="leftBarWidth"
<a-slider class="span-input"
:min="6" :max="12" v-model:value="leftBarWidth"
/>
<div class="input-item-value">{{ (leftBarWidth * 100 / 24).toFixed(0) }}%</div>
</div>
@@ -41,19 +54,55 @@
</template>
<script setup lang="ts">
import { ref, watch } from 'vue'
import { storeToRefs } from 'pinia'
import { useGeneralSettingStore } from '@renderer/stores/generalSetting'
import { InfoCircleOutlined } from '@ant-design/icons-vue';
import { InfoCircleOutlined, CheckOutlined } from '@ant-design/icons-vue'
const generalSettingStore = useGeneralSettingStore()
const { uiLanguage, uiTheme, leftBarWidth } = storeToRefs(generalSettingStore)
const { uiLanguage, realTheme, uiTheme, uiColor, leftBarWidth } = storeToRefs(generalSettingStore)
const colorListLight = [
'#1677ff',
'#00b96b',
'#fa8c16',
'#9254de',
'#eb2f96',
'#000000'
]
const colorListDark = [
'#1677ff',
'#00b96b',
'#fa8c16',
'#9254de',
'#eb2f96',
'#b9d7ea'
]
const colorList = ref(colorListLight)
watch(realTheme, (val) => {
if(val === 'dark') {
colorList.value = colorListDark
} else {
colorList.value = colorListLight
}
console.log(val)
})
watch(uiTheme, (val) => {
console.log(val)
})
</script>
<style scoped>
@import url(../assets/input.css);
.span-input {
display: inline-block;
width: 100px;
margin: 0;
}
.general-note {

View File

@@ -0,0 +1,115 @@
<template>
<div>
<div class="log-title">
<span style="margin-right: 30px;">{{ $t('log.title2') }}</span>
</div>
<a-button
danger
@click="softwareLog.clear()"
>{{ $t('log.clear') }}</a-button>
</div>
<a-table
:columns="columns"
:data-source="softwareLogs"
v-model:pagination="pagination"
style="margin-top: 10px;"
>
<template #bodyCell="{ column, record }">
<template v-if="column.key === 'index'">
{{ record.index }}
</template>
<template v-if="column.key === 'type'">
<code :class="record.type">{{ record.type }}</code>
</template>
<template v-if="column.key === 'time'">
<code>{{ record.time }}</code>
</template>
<template v-if="column.key === 'content'">
<code>{{ record.text }}</code>
</template>
</template>
</a-table>
</template>
<script setup lang="ts">
import { ref } from 'vue'
import { storeToRefs } from 'pinia'
import { useSoftwareLogStore } from '@renderer/stores/softwareLog'
import { type SoftwareLogItem } from '../types'
const softwareLog = useSoftwareLogStore()
const { softwareLogs } = storeToRefs(softwareLog)
const pagination = ref({
current: 1,
pageSize: 20,
showSizeChanger: true,
pageSizeOptions: ['10', '20', '50', '100'],
onChange: (page: number, pageSize: number) => {
pagination.value.current = page
pagination.value.pageSize = pageSize
},
onShowSizeChange: (current: number, size: number) => {
pagination.value.current = current
pagination.value.pageSize = size
}
})
const columns = [
{
title: 'index',
dataIndex: 'index',
key: 'index',
width: 80,
sorter: (a: SoftwareLogItem, b: SoftwareLogItem) => {
if(a.index <= b.index) return -1
return 1
},
sortDirections: ['descend'],
defaultSortOrder: 'descend',
},
{
title: 'type',
dataIndex: 'type',
key: 'type',
width: 80,
sorter: (a: SoftwareLogItem, b: SoftwareLogItem) => {
if(a.type <= b.type) return -1
return 1
},
},
{
title: 'time',
dataIndex: 'time',
key: 'time',
width: 135,
sortDirections: ['descend'],
},
{
title: 'content',
dataIndex: 'content',
key: 'content',
},
]
</script>
<style scoped>
.log-title {
color: var(--icon-color);
display: inline-block;
font-size: 24px;
font-weight: bold;
margin: 10px 0;
}
.WARN {
color: #ff7c05;
font-weight: bold;
}
.ERROR {
color: #ff0000;
font-weight: bold;
}
</style>

View File

@@ -1,78 +1,183 @@
// type: -1 仅为源语言0 两者皆可, 1 仅为翻译语言
export const engines = {
zh: [
{
value: 'gummy',
label: '云端 - 阿里云 - Gummy',
label: '云端 / 阿里云 / Gummy',
languages: [
{ value: 'auto', label: '自动检测' },
{ value: 'en', label: '英语' },
{ value: 'zh', label: '中文' },
{ value: 'ja', label: '日语' },
{ value: 'ko', label: '韩语' },
{ value: 'de', label: '德语' },
{ value: 'fr', label: '法语' },
{ value: 'ru', label: '俄语' },
{ value: 'es', label: '西班牙语' },
{ value: 'it', label: '意大利语' },
{ value: 'auto', type: -1, label: '自动检测' },
{ value: 'en', type: 0, label: '英语' },
{ value: 'zh', type: 0, label: '中文' },
{ value: 'ja', type: 0, label: '日语' },
{ value: 'ko', type: 0, label: '韩语' },
{ value: 'de', type: -1, label: '德语' },
{ value: 'fr', type: -1, label: '法语' },
{ value: 'ru', type: -1, label: '俄语' },
{ value: 'es', type: -1, label: '西班牙语' },
{ value: 'it', type: -1, label: '意大利语' },
{ value: 'yue', type: -1, label: '粤语' },
]
},
{
value: 'vosk',
label: '本地 - Vosk',
label: '本地 / Vosk',
languages: [
{ value: 'auto', label: '需要自行配置模型' },
{ value: 'auto', type: -1, label: '需要自行配置模型' },
{ value: 'en', type: 1, label: '英语' },
{ value: 'zh-cn', type: 1, label: '中文' },
{ value: 'ja', type: 1, label: '日语' },
{ value: 'ko', type: 1, label: '韩语' },
{ value: 'de', type: 1, label: '德语' },
{ value: 'fr', type: 1, label: '法语' },
{ value: 'ru', type: 1, label: '俄语' },
{ value: 'es', type: 1, label: '西班牙语' },
{ value: 'it', type: 1, label: '意大利语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 本地模型' },
{ value: 'google', label: 'Google API 调用' },
]
},
{
value: 'sosv',
label: '本地 / SOSV',
languages: [
{ value: 'auto', type: -1, label: '自动检测' },
{ value: 'en', type: 0, label: '英语' },
{ value: 'zh', type: 0, label: '中文' },
{ value: 'ja', type: 0, label: '日语' },
{ value: 'ko', type: 0, label: '韩语' },
{ value: 'yue', type: -1, label: '粤语' },
{ value: 'de', type: 1, label: '德语' },
{ value: 'fr', type: 1, label: '法语' },
{ value: 'ru', type: 1, label: '俄语' },
{ value: 'es', type: 1, label: '西班牙语' },
{ value: 'it', type: 1, label: '意大利语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 本地模型' },
{ value: 'google', label: 'Google API 调用' },
]
}
],
en: [
{
value: 'gummy',
label: 'Cloud - Alibaba Cloud - Gummy',
label: 'Cloud / Alibaba Cloud / Gummy',
languages: [
{ value: 'auto', label: 'Auto Detect' },
{ value: 'en', label: 'English' },
{ value: 'zh', label: 'Chinese' },
{ value: 'ja', label: 'Japanese' },
{ value: 'ko', label: 'Korean' },
{ value: 'de', label: 'German' },
{ value: 'fr', label: 'French' },
{ value: 'ru', label: 'Russian' },
{ value: 'es', label: 'Spanish' },
{ value: 'it', label: 'Italian' },
{ value: 'auto', type: -1, label: 'Auto Detect' },
{ value: 'en', type: 0, label: 'English' },
{ value: 'zh', type: 0, label: 'Chinese' },
{ value: 'ja', type: 0, label: 'Japanese' },
{ value: 'ko', type: 0, label: 'Korean' },
{ value: 'de', type: -1, label: 'German' },
{ value: 'fr', type: -1, label: 'French' },
{ value: 'ru', type: -1, label: 'Russian' },
{ value: 'es', type: -1, label: 'Spanish' },
{ value: 'it', type: -1, label: 'Italian' },
{ value: 'yue', type: -1, label: 'Cantonese' },
]
},
{
value: 'vosk',
label: 'Local - Vosk',
label: 'Local / Vosk',
languages: [
{ value: 'auto', label: 'Model needs to be configured manually' },
{ value: 'auto', type: -1, label: 'Model needs to be configured manually' },
{ value: 'en', type: 1, label: 'English' },
{ value: 'zh-cn', type: 1, label: 'Chinese' },
{ value: 'ja', type: 1, label: 'Japanese' },
{ value: 'ko', type: 1, label: 'Korean' },
{ value: 'de', type: 1, label: 'German' },
{ value: 'fr', type: 1, label: 'French' },
{ value: 'ru', type: 1, label: 'Russian' },
{ value: 'es', type: 1, label: 'Spanish' },
{ value: 'it', type: 1, label: 'Italian' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Local Model' },
{ value: 'google', label: 'Google API Call' },
]
},
{
value: 'sosv',
label: 'Local / SOSV',
languages: [
{ value: 'auto', type: -1, label: 'Auto Detect' },
{ value: 'en', type: 0, label: 'English' },
{ value: 'zh-cn', type: 0, label: 'Chinese' },
{ value: 'ja', type: 0, label: 'Japanese' },
{ value: 'ko', type: 0, label: 'Korean' },
{ value: 'yue', type: -1, label: 'Cantonese' },
{ value: 'de', type: 1, label: 'German' },
{ value: 'fr', type: 1, label: 'French' },
{ value: 'ru', type: 1, label: 'Russian' },
{ value: 'es', type: 1, label: 'Spanish' },
{ value: 'it', type: 1, label: 'Italian' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Local Model' },
{ value: 'google', label: 'Google API Call' },
]
}
],
ja: [
{
value: 'gummy',
label: 'クラウド - アリババクラウド - Gummy',
label: 'クラウド / アリババクラウド / Gummy',
languages: [
{ value: 'auto', label: '自動検出' },
{ value: 'en', label: '英語' },
{ value: 'zh', label: '中国語' },
{ value: 'ja', label: '日本語' },
{ value: 'ko', label: '韓国語' },
{ value: 'de', label: 'ドイツ語' },
{ value: 'fr', label: 'フランス語' },
{ value: 'ru', label: 'ロシア語' },
{ value: 'es', label: 'スペイン語' },
{ value: 'it', label: 'イタリア語' },
{ value: 'auto', type: -1, label: '自動検出' },
{ value: 'en', type: 0, label: '英語' },
{ value: 'zh', type: 0, label: '中国語' },
{ value: 'ja', type: 0, label: '日本語' },
{ value: 'ko', type: 0, label: '韓国語' },
{ value: 'de', type: -1, label: 'ドイツ語' },
{ value: 'fr', type: -1, label: 'フランス語' },
{ value: 'ru', type: -1, label: 'ロシア語' },
{ value: 'es', type: -1, label: 'スペイン語' },
{ value: 'it', type: -1, label: 'イタリア語' },
{ value: 'yue', type: -1, label: '広東語' },
]
},
{
value: 'vosk',
label: 'ローカル - Vosk',
label: 'ローカル / Vosk',
languages: [
{ value: 'auto', label: 'モデルを手動で設定する必要があります' },
{ value: 'auto', type: -1, label: 'モデルを手動で設定する必要があります' },
{ value: 'en', type: 1, label: '英語' },
{ value: 'zh-cn', type: 1, label: '中国語' },
{ value: 'ja', type: 1, label: '日本語' },
{ value: 'ko', type: 1, label: '韓国語' },
{ value: 'de', type: 1, label: 'ドイツ語' },
{ value: 'fr', type: 1, label: 'フランス語' },
{ value: 'ru', type: 1, label: 'ロシア語' },
{ value: 'es', type: 1, label: 'スペイン語' },
{ value: 'it', type: 1, label: 'イタリア語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama ローカルモデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
},
{
value: 'sosv',
label: 'ローカル / SOSV',
languages: [
{ value: 'auto', type: -1, label: '自動検出' },
{ value: 'en', type: 0, label: '英語' },
{ value: 'zh-cn', type: 0, label: '中国語' },
{ value: 'ja', type: 0, label: '日本語' },
{ value: 'ko', type: 0, label: '韓国語' },
{ value: 'yue', type: -1, label: '広東語' },
{ value: 'de', type: 1, label: 'ドイツ語' },
{ value: 'fr', type: 1, label: 'フランス語' },
{ value: 'ru', type: 1, label: 'ロシア語' },
{ value: 'es', type: 1, label: 'スペイン語' },
{ value: 'it', type: 1, label: 'イタリア語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama ローカルモデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
}
]
}

View File

@@ -0,0 +1,41 @@
import { h } from 'vue';
import { OrderedListOutlined, FileTextOutlined } from '@ant-design/icons-vue'
export const logMenu = {
zh: [
{
key: 'captionLog',
icon: () => h(OrderedListOutlined),
label: '字幕记录',
},
{
key: 'projLog',
icon: () => h(FileTextOutlined),
label: '日志记录',
},
],
en: [
{
key: 'captionLog',
icon: () => h(OrderedListOutlined),
label: 'Caption Log',
},
{
key: 'projLog',
icon: () => h(FileTextOutlined),
label: 'Software Log',
},
],
ja: [
{
key: 'captionLog',
icon: () => h(OrderedListOutlined),
label: '字幕記録',
},
{
key: 'projLog',
icon: () => h(FileTextOutlined),
label: 'ログ記録',
},
]
}

View File

@@ -1,10 +1,29 @@
import { theme } from 'ant-design-vue';
export const antDesignTheme = {
light: {
token: {}
},
dark: {
algorithm: theme.darkAlgorithm,
}
let isLight = true
let themeColor = '#1677ff'
export function setThemeColor(color: string) {
themeColor = color
}
export function getTheme(curIsLight?: boolean) {
const lightTheme = {
token: {
colorPrimary: themeColor,
colorInfo: themeColor
}
}
const darkTheme = {
algorithm: theme.darkAlgorithm,
token: {
colorPrimary: themeColor,
colorInfo: themeColor
}
}
if(curIsLight !== undefined){
isLight = curIsLight
}
return isLight ? lightTheme : darkTheme
}

View File

@@ -18,3 +18,4 @@ export * from './config/engine'
export * from './config/audio'
export * from './config/theme'
export * from './config/linebreak'
export * from './config/logMenu'

View File

@@ -18,14 +18,19 @@ export default {
"args": ", command arguments: ",
"pidInfo": ", caption engine process PID: ",
"empty": "Model Path is Empty",
"emptyInfo": "The Vosk model path is empty. Please set the Vosk model path in the additional settings of the subtitle engine settings.",
"emptyInfo": "The model path for the selected local model is empty. Please download the corresponding model first, and then configure the model path in [Caption Engine Settings > More Settings].",
"stopped": "Caption Engine Stopped",
"stoppedInfo": "The caption engine has stopped. You can click the 'Start Caption Engine' button to restart it.",
"error": "An error occurred",
"engineError": "The subtitle engine encountered an error and requested a forced exit.",
"socketError": "The Socket connection between the main program and the caption engine failed",
"engineChange": "Cpation Engine Configuration Changed",
"changeInfo": "If the caption engine is already running, you need to restart it for the changes to take effect.",
"styleChange": "Caption Style Changed",
"styleInfo": "Caption style changes have been saved and applied."
"styleInfo": "Caption style changes have been saved and applied.",
"engineStartTimeout": "Caption engine startup timeout, automatically force stopped",
"ollamaNameNull": "'Ollama' Field is Empty",
"ollamaNameNullNote": "When selecting Ollama model as the translation model, the 'Ollama' field cannot be empty and must be filled with the name of a locally configured Ollama model."
},
general: {
"title": "General Settings",
@@ -35,7 +40,8 @@ export default {
"theme": "Theme",
"light": "light",
"dark": "dark",
"system": "system"
"system": "system",
"color": "Color"
},
engine: {
"title": "Caption Engine Settings",
@@ -43,17 +49,30 @@ export default {
"cancelChange": "Cancel Changes",
"sourceLang": "Source",
"transLang": "Translation",
"transModel": "Model",
"ollama": "Ollama",
"ollamaNote": "To use for translation, the name of the local Ollama model that will call the service on the default port. It is recommended to use a non-inference model with less than 1B parameters.",
"captionEngine": "Engine",
"audioType": "Audio Type",
"systemOutput": "System Audio Output (Speaker)",
"systemInput": "System Audio Input (Microphone)",
"enableTranslation": "Translation",
"enableRecording": "Enable Recording",
"showMore": "More Settings",
"apikey": "API KEY",
"modelPath": "Model Path",
"voskModelPath": "Vosk Path",
"sosvModelPath": "SOSV Path",
"recordingPath": "Save Path",
"startTimeout": "Timeout",
"seconds": "seconds",
"apikeyInfo": "API KEY required for the Gummy subtitle engine, which needs to be obtained from the Alibaba Cloud Bailing platform. For more details, see the project user manual.",
"modelPathInfo": "The folder path of the model required by the Vosk subtitle engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"customEngine": "Custom Engine",
"voskModelPathInfo": "The folder path of the model required by the Vosk subtitle engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"sosvModelPathInfo": "The folder path of the model required by the SOSV subtitle engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"recordingPathInfo": "The path to save recording files, requiring a folder path. The software will automatically name the recording file and save it as .wav file.",
"modelDownload": "Model Download Link",
"startTimeoutInfo": "Caption engine startup timeout duration. Engine will be forcefully stopped if startup exceeds this time. Recommended range: 10-120 seconds.",
"customEngine": "Custom Eng",
custom: {
"title": "Custom Caption Engine",
"attention": "Attention",
@@ -66,6 +85,7 @@ export default {
"title": "Caption Style Settings",
"applyStyle": "Apply",
"cancelChange": "Cancel",
"lineNumber": "CaptionLines",
"resetStyle": "Reset",
"longCaption": "LongCaption",
"fontFamily": "Font Family",
@@ -91,16 +111,29 @@ export default {
},
status: {
"engine": "Caption Engine",
"engineStatus": "Caption Engine Status",
"pid": "Process ID",
"ppid": "Parent Process ID",
"cpu": "CPU Usage",
"port": "Socket Port Number",
"mem": "Memory Usage",
"elapsed": "Running Time",
"customized": "Customized",
"status": "Engine Status",
"started": "Started",
"stopped": "Not Started",
"logNumber": "Caption Count",
"logNumber2": "Log Count",
"aboutProj": "About Project",
"openCaption": "Open Caption Window",
"startEngine": "Start Caption Engine",
"restartEngine": "Restart Caption Engine",
"stopEngine": "Stop Caption Engine",
"forceKill": "Force Stop",
"forceKillStarting": "Starting Engine... (Force Stop)",
"forceKillConfirm": "Are you sure you want to force stop the caption engine? This will terminate the process immediately.",
"confirm": "Confirm",
"cancel": "Cancel",
about: {
"title": "About This Project",
"proj": "Auto Caption Project",
@@ -110,21 +143,33 @@ export default {
"projLink": "Project Link",
"manual": "User Manual",
"engineDoc": "Caption Engine Manual",
"date": "July 11, 2026"
"date": "September 8th, 2025"
}
},
log: {
"title": "Caption Log",
"copy": "Copy to Clipboard",
"changeTime": "Modify Time",
"baseTime": "First Caption Start Time",
"hour": "Hour",
"min": "Minute",
"sec": "Second",
"ms": "Millisecond",
"export": "Export Log",
"copy": "Copy Log",
"exportOptions": "Export Options",
"exportFormat": "Format",
"exportContent": "Content",
"copyOptions": "Copy Options",
"addIndex": "Add Index",
"copyTime": "Copy Time",
"copyContent": "Content",
"both": "Original and Translation",
"source": "Original Only",
"translation": "Translation Only",
"both": "Both",
"source": "Original",
"translation": "Translation",
"copyNum": "Copy Count",
"all": "All",
"copySuccess": "Subtitle copied to clipboard",
"export": "Export Caption Log",
"clear": "Clear Caption Log"
"clear": "Clear Log",
"title2": "Software Log"
}
}

View File

@@ -18,14 +18,19 @@ export default {
"args": "、コマンド引数:",
"pidInfo": "、字幕エンジンプロセス PID",
"empty": "モデルパスが空です",
"emptyInfo": "Vosk モデルのパスが空です。字幕エンジン設定の追加設定で Vosk モデルのパスを設定してください。",
"emptyInfo": "選択されたローカルモデルに対応するモデルパスが空です。まず対応するモデルをダウンロードし、次に【字幕エンジン設定 > 詳細設定】でモデルのパスを設定してください。",
"stopped": "字幕エンジンが停止しました",
"stoppedInfo": "字幕エンジンが停止しました。再起動するには「字幕エンジンを開始」ボタンをクリックしてください。",
"error": "エラーが発生しました",
"engineError": "字幕エンジンにエラーが発生し、強制終了が要求されました。",
"socketError": "メインプログラムと字幕エンジンの Socket 接続に失敗しました",
"engineChange": "字幕エンジンの設定が変更されました",
"changeInfo": "字幕エンジンがすでに起動している場合、変更を有効にするには再起動が必要です。",
"styleChange": "字幕のスタイルが変更されました",
"styleInfo": "字幕のスタイル変更が保存され、適用されました"
"styleInfo": "字幕のスタイル変更が保存され、適用されました",
"engineStartTimeout": "字幕エンジンの起動がタイムアウトしました。自動的に強制停止しました",
"ollamaNameNull": "Ollama フィールドが空です",
"ollamaNameNullNote": "Ollama モデルを翻訳モデルとして選択する場合、Ollama フィールドは空にできません。ローカルで設定された Ollama モデルの名前を入力してください。"
},
general: {
"title": "一般設定",
@@ -35,7 +40,8 @@ export default {
"theme": "テーマ",
"light": "明るい",
"dark": "暗い",
"system": "システム"
"system": "システム",
"color": "カラー"
},
engine: {
"title": "字幕エンジン設定",
@@ -43,17 +49,29 @@ export default {
"cancelChange": "変更をキャンセル",
"sourceLang": "ソース言語",
"transLang": "翻訳言語",
"transModel": "翻訳モデル",
"ollama": "Ollama",
"ollamaNote": "翻訳に使用する、デフォルトポートでサービスを呼び出すローカルOllamaモデルの名前。1B 未満のパラメータを持つ非推論モデルの使用を推奨します。",
"captionEngine": "エンジン",
"audioType": "オーディオ",
"systemOutput": "システムオーディオ出力(スピーカー)",
"systemInput": "システムオーディオ入力(マイク)",
"enableTranslation": "翻訳",
"enableRecording": "録音",
"showMore": "詳細設定",
"apikey": "API KEY",
"modelPath": "モデルパス",
"voskModelPath": "Voskパス",
"sosvModelPath": "SOSVパス",
"recordingPath": "保存パス",
"startTimeout": "時間制限",
"seconds": "秒",
"apikeyInfo": "Gummy 字幕エンジンに必要な API KEY は、アリババクラウド百煉プラットフォームから取得する必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"modelPathInfo": "Vosk 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"customEngine": "カスタムエンジン",
"voskModelPathInfo": "Vosk 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"sosvModelPathInfo": "SOSV 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"recordingPathInfo": "録音ファイルの保存パスで、フォルダパスを指定する必要があります。ソフトウェアが自動的に録音ファイルに名前を付けて .wav ファイルとして保存します。",
"modelDownload": "モデルダウンロードリンク",
"startTimeoutInfo": "字幕エンジンの起動タイムアウト時間です。この時間を超えると自動的に強制停止されます。10-120秒の範囲で設定することを推奨します。",
"customEngine": "カスタム",
custom: {
"title": "カスタムキャプションエンジン",
"attention": "注意事項",
@@ -67,6 +85,7 @@ export default {
"applyStyle": "適用",
"cancelChange": "キャンセル",
"resetStyle": "リセット",
"lineNumber": "字幕行数",
"longCaption": "長い字幕",
"fontFamily": "フォント",
"fontColor": "カラー",
@@ -91,16 +110,29 @@ export default {
},
status: {
"engine": "字幕エンジン",
"customized": "カスタマイズ済み",
"engineStatus": "字幕エンジンの状態",
"pid": "プロセス ID",
"ppid": "親プロセス ID",
"port": "Socket ポート番号",
"cpu": "CPU 使用率",
"mem": "メモリ使用量",
"elapsed": "稼働時間",
"customized": "カスタム",
"status": "エンジン状態",
"started": "開始済み",
"stopped": "未開始",
"logNumber": "字幕数",
"logNumber2": "ログ数",
"aboutProj": "プロジェクト情報",
"openCaption": "字幕ウィンドウを開く",
"startEngine": "字幕エンジンを開始",
"restartEngine": "字幕エンジンを再起動",
"stopEngine": "字幕エンジンを停止",
"forceKill": "強制停止",
"forceKillStarting": "エンジン起動中... (強制停止)",
"forceKillConfirm": "字幕エンジンを強制停止しますか?プロセスが直ちに終了されます。",
"confirm": "確認",
"cancel": "キャンセル",
about: {
"title": "このプロジェクトについて",
"proj": "Auto Caption プロジェクト",
@@ -110,21 +142,33 @@ export default {
"projLink": "プロジェクトリンク",
"manual": "ユーザーマニュアル",
"engineDoc": "字幕エンジンマニュアル",
"date": "2025 年 711 日"
"date": "2025 年 98 日"
}
},
log: {
"title": "字幕ログ",
"copy": "クリップボードにコピー",
"title": "字幕記録",
"changeTime": "時間を変更",
"baseTime": "最初の字幕開始時間",
"hour": "時",
"min": "分",
"sec": "秒",
"ms": "ミリ秒",
"export": "エクスポート",
"copy": "記録をコピー",
"exportOptions": "エクスポートオプション",
"exportFormat": "形式",
"exportContent": "内容",
"copyOptions": "コピー設定",
"addIndex": "順序番号",
"copyTime": "時間",
"copyContent": "内容",
"both": "原文と翻訳",
"source": "原文のみ",
"translation": "翻訳のみ",
"both": "すべて",
"source": "原文",
"translation": "翻訳",
"copyNum": "コピー数",
"all": "すべて",
"copySuccess": "字幕がクリップボードにコピーされました",
"export": "エクスポート",
"clear": "字幕ログをクリア"
"clear": "記録をクリア",
"title2": "ログ記録"
}
}

View File

@@ -18,14 +18,19 @@ export default {
"args": ",命令参数:",
"pidInfo": ",字幕引擎进程 PID",
"empty": "模型路径为空",
"emptyInfo": "Vosk 模型模型路径为空,请在字幕引擎设置更多设置中设置 Vosk 模型的路径。",
"emptyInfo": "选择的本地模型对应的模型路径为空。请先下载对应模型,然后在【字幕引擎设置 > 更多设置】中配置对应模型的路径。",
"stopped": "字幕引擎停止",
"stoppedInfo": "字幕引擎已经停止,可点击“启动字幕引擎”按钮重新启动",
"error": "发生错误",
"engineError": "字幕引擎发生错误并请求强制退出",
"socketError": "主程序与字幕引擎的 Socket 连接未成功",
"engineChange": "字幕引擎配置已更改",
"changeInfo": "如果字幕引擎已经启动,需要重启字幕引擎修改才会生效",
"styleChange": "字幕样式已修改",
"styleInfo": "字幕样式修改已经保存并生效"
"styleInfo": "字幕样式修改已经保存并生效",
"engineStartTimeout": "字幕引擎启动超时,已自动强制停止",
"ollamaNameNull": "Ollama 字段为空",
"ollamaNameNullNote": "选择 Ollama 模型作为翻译模型时Ollama 字段不能为空,需要填写本地已经配置好的 Ollama 模型的名称。"
},
general: {
"title": "通用设置",
@@ -35,7 +40,8 @@ export default {
"theme": "主题",
"light": "浅色",
"dark": "深色",
"system": "系统"
"system": "系统",
"color": "颜色"
},
engine: {
"title": "字幕引擎设置",
@@ -43,16 +49,28 @@ export default {
"cancelChange": "取消更改",
"sourceLang": "源语言",
"transLang": "翻译语言",
"transModel": "翻译模型",
"ollama": "Ollama",
"ollamaNote": "要使用的进行翻译的本地 Ollama 模型的名称,将调用默认端口的服务,建议使用参数量小于 1B 的非推理模型。",
"captionEngine": "字幕引擎",
"audioType": "音频类型",
"systemOutput": "系统音频输出(扬声器)",
"systemInput": "系统音频输入(麦克风)",
"enableTranslation": "启用翻译",
"enableRecording": "启用录制",
"showMore": "更多设置",
"apikey": "API KEY",
"modelPath": "模型路径",
"voskModelPath": "Vosk路径",
"sosvModelPath": "SOSV路径",
"recordingPath": "保存路径",
"startTimeout": "启动超时",
"seconds": "秒",
"apikeyInfo": "Gummy 字幕引擎需要的 API KEY需要在阿里云百炼平台获取。详细信息见项目用户手册。",
"modelPathInfo": "Vosk 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"voskModelPathInfo": "Vosk 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"sosvModelPathInfo": "SOSV 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"recordingPathInfo": "录音文件保存路径,需要提供文件夹路径。软件会自动命名录音文件并保存为 .wav 文件。",
"modelDownload": "模型下载地址",
"startTimeoutInfo": "字幕引擎启动超时时间,超过此时间将自动强制停止。建议设置为 10-120 秒之间。",
"customEngine": "自定义引擎",
custom: {
"title": "自定义字幕引擎",
@@ -67,6 +85,7 @@ export default {
"applyStyle": "应用样式",
"cancelChange": "取消更改",
"resetStyle": "恢复默认",
"lineNumber": "字幕行数",
"longCaption": "长字幕",
"fontFamily": "字体族",
"fontColor": "字体颜色",
@@ -91,16 +110,29 @@ export default {
},
status: {
"engine": "字幕引擎",
"engineStatus": "字幕引擎状态",
"pid": "进程ID",
"ppid": "父进程ID",
"port": "Socket 端口号",
"cpu": "CPU使用率",
"mem": "内存使用量",
"elapsed": "运行时间",
"customized": "自定义",
"status": "引擎状态",
"started": "已启动",
"stopped": "未启动",
"logNumber": "字幕数量",
"logNumber2": "日志数量",
"aboutProj": "项目关于",
"openCaption": "打开字幕窗口",
"startEngine": "启动字幕引擎",
"restartEngine": "重启字幕引擎",
"stopEngine": "关闭字幕引擎",
"forceKill": "强行停止",
"forceKillStarting": "正在启动引擎... (强行停止)",
"forceKillConfirm": "确定要强行停止字幕引擎吗?这将立即终止进程。",
"confirm": "确定",
"cancel": "取消",
about: {
"title": "关于本项目",
"proj": "Auto Caption 项目",
@@ -110,21 +142,33 @@ export default {
"projLink": "项目链接",
"manual": "用户手册",
"engineDoc": "字幕引擎手册",
"date": "2025 年 711 日"
"date": "2025 年 98 日"
}
},
log: {
"title": "字幕记录",
"export": "导出字幕记录",
"copy": "复制到剪贴板",
"changeTime": "修改时间",
"baseTime": "首条字幕起始时间",
"hour": "时",
"min": "分",
"sec": "秒",
"ms": "毫秒",
"export": "导出字幕",
"copy": "复制内容",
"exportOptions": "导出选项",
"exportFormat": "导出格式",
"exportContent": "导出内容",
"copyOptions": "复制选项",
"addIndex": "添加序号",
"copyTime": "复制时间",
"copyContent": "复制内容",
"both": "原文与翻译",
"source": "原文",
"translation": "翻译",
"both": "全部",
"source": "原文",
"translation": "翻译",
"copyNum": "复制数量",
"all": "全部",
"copySuccess": "字幕已复制到剪贴板",
"clear": "清空字幕记录"
"clear": "清空记录",
"title2": "日志记录"
}
}

View File

@@ -15,7 +15,12 @@ export const useCaptionLogStore = defineStore('captionLog', () => {
})
window.electron.ipcRenderer.on('both.captionLog.upd', (_, log) => {
captionData.value.splice(captionData.value.length - 1, 1, log)
for(let i = captionData.value.length - 1; i >= 0; i--) {
if(captionData.value[i].time_s === log.time_s){
captionData.value.splice(i, 1, log)
break
}
}
})
window.electron.ipcRenderer.on('both.captionLog.set', (_, logs) => {

View File

@@ -4,6 +4,7 @@ import { Styles } from '@renderer/types'
import { breakOptions } from '@renderer/i18n'
export const useCaptionStyleStore = defineStore('captionStyle', () => {
const lineNumber = ref<number>(1)
const lineBreak = ref<number>(1)
const fontFamily = ref<string>('sans-serif')
const fontSize = ref<number>(24)
@@ -38,6 +39,7 @@ export const useCaptionStyleStore = defineStore('captionStyle', () => {
function sendStylesChange() {
const styles: Styles = {
lineNumber: lineNumber.value,
lineBreak: lineBreak.value,
fontFamily: fontFamily.value,
fontSize: fontSize.value,
@@ -65,6 +67,7 @@ export const useCaptionStyleStore = defineStore('captionStyle', () => {
}
function setStyles(args: Styles){
lineNumber.value = args.lineNumber
lineBreak.value = args.lineBreak
fontFamily.value = args.fontFamily
fontSize.value = args.fontSize
@@ -91,6 +94,7 @@ export const useCaptionStyleStore = defineStore('captionStyle', () => {
})
return {
lineNumber, // 显示字幕行数
lineBreak, // 换行方式
fontFamily, // 字体族
fontSize, // 字体大小

Some files were not shown because too many files have changed in this diff Show More