7 Commits
engine ... main

Author SHA1 Message Date
himeditator
564954a834 release v1.1.1 2026-01-31 13:45:25 +08:00
himeditator
aed15af386 feat(window): 字幕窗口改为始终顶置,窗口顶置选项改为鼠标穿透选项(#26) 2026-01-31 13:38:53 +08:00
himeditator
4f9d33abc1 feat(README): 添加预览视频 2026-01-17 21:47:17 +08:00
himeditator
0dc70d491e release v1.1.0 2026-01-10 22:50:57 +08:00
HS RedWoods
086ea90a5f Merge pull request #25 from nocmt/dev_glmasr
feat(engine): 添加GLM-ASR语音识别引擎支持
2026-01-10 20:17:35 +08:00
himeditator
3324b630d1 feat(app): 适配最新版本
- 修改软件内部分提示文本
- 添加 API KEY 掩码,防止直接输出 API KEY 内容
2026-01-10 20:15:32 +08:00
nocmt
0825e48902 feat(engine): 添加GLM-ASR语音识别引擎支持
- 新增GLM-ASR云端语音识别引擎实现
- 扩展配置界面添加GLM相关参数设置
- Ollama支持自定义域名和Apikey以支持云端和其他LLM
- 修改音频处理逻辑以支持新引擎
- 更新依赖项和构建配置
- 修复Ollama翻译功能相关问题
2026-01-10 16:02:24 +08:00
47 changed files with 842 additions and 233 deletions

5
.gitignore vendored
View File

@@ -7,8 +7,13 @@ out
__pycache__
.venv
test.py
engine/build
engine/portaudio
engine/pyinstaller_cache
engine/models
engine/notebook
# engine/main.spec
.repomap
.virtualme

View File

@@ -3,7 +3,7 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption 是一个跨平台的实时字幕显示软件。</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.1.1-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
@@ -14,7 +14,7 @@
| <a href="./README_en.md">English</a>
| <a href="./README_ja.md">日本語</a> |
</p>
<p><i>v1.0.0 版本已经发布,新增 SOSV 本地字幕模型。当前功能已经基本完整,暂无继续开发计划...</i></p>
<p><i>v1.1.1 版本已经发布,新增 GLM-ASR 云端字幕模型和 OpenAI 兼容模型翻译...</i></p>
</div>
![](./assets/media/main_zh.png)
@@ -35,18 +35,24 @@ SOSV 模型下载:[ Shepra-ONNX SenseVoice Model](https://github.com/HiMeditat
[更新日志](./docs/CHANGELOG.md)
## 👁️‍🗨️ 预览
https://github.com/user-attachments/assets/9c188d78-9520-4397-bacf-4c8fdcc54874
## ✨ 特性
- 生成音频输出或麦克风输入的字幕
- 支持调用本地 Ollama 模型或云端 Google 翻译 API 进行翻译
- 支持调用本地 Ollama 模型、云端 OpenAI 兼容模型、或云端 Google 翻译 API 进行翻译
- 跨平台Windows、macOS、Linux、多界面语言中文、英语、日语支持
- 丰富的字幕样式设置(字体、字体大小、字体粗细、字体颜色、背景颜色等)
- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、本地 Vosk 模型、本地 SOSV 模型、还可以自己开发模型)
- 灵活的字幕引擎选择(阿里云 Gummy 云端模型、GLM-ASR 云端模型、本地 Vosk 模型、本地 SOSV 模型、还可以自己开发模型)
- 多语言识别与翻译(见下文“⚙️ 自带字幕引擎说明”)
- 字幕记录展示与导出(支持导出 `.srt``.json` 格式)
## 📖 基本使用
> ⚠️ 注意:目前只维护了 Windows 平台的软件的最新版本,其他平台的最后版本停留在 v1.0.0。
软件已经适配了 Windows、macOS 和 Linux 平台。测试过的主流平台信息如下:
| 操作系统版本 | 处理器架构 | 获取系统音频输入 | 获取系统音频输出 |
@@ -59,14 +65,15 @@ macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,
下载软件后,需要根据自己的需求选择对应的模型,然后配置模型。
| | 识别效果 | 部署类型 | 支持语言 | 翻译 | 备注 |
| ------------------------------------------------------------ | -------- | ------------- | ---------- | ---------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 很好😊 | 云端 / 阿里云 | 10 种 | 自带翻译 | 收费0.54CNY / 小时 |
| [Vosk](https://alphacephei.com/vosk) | 较差😞 | 本地 / CPU | 超过 30 种 | 需额外配置 | 支持的语言非常多 |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | 本地 / CPU | 5 种 | 需额外配置 | 仅有一个模型 |
| 自己开发 | 🤔 | 自定义 | 自定义 | 自定义 | 根据[文档](./docs/engine-manual/zh.md)使用 Python 自己开发 |
| | 准确率 | 实时性 | 部署类型 | 支持语言 | 翻译 | 备注 |
| ------------------------------------------------------------ | -------- | ------------- | ---------- | ---------- | ---------------------------------------------------------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 很好😊 | 很好😊 | 云端 / 阿里云 | 10 种 | 自带翻译 | 收费0.54CNY / 小时 |
| [glm-asr-2512](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512) | 很好😊 | 较差😞 | 云端 / 智谱 AI | 4 种 | 需额外配置 | 收费,约 0.72CNY / 小时 |
| [Vosk](https://alphacephei.com/vosk) | 较差😞 | 很好😊 | 本地 / CPU | 超过 30 种 | 需额外配置 | 支持的语言非常多 |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | 一般😐 | 本地 / CPU | 5 种 | 需额外配置 | 仅有一个模型 |
| 自己开发 | 🤔 | 🤔 | 自定义 | 自定义 | 自定义 | 根据[文档](./docs/engine-manual/zh.md)使用 Python 自己开发 |
如果你选择使用 Vosk 或 SOSV 模型,你还需要配置自己的翻译模型。
如果你选择的不是 Gummy 模型,你还需要配置自己的翻译模型。
### 配置翻译模型
@@ -78,11 +85,22 @@ macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,
> 注意:使用参数量过大的模型会导致资源消耗和翻译延迟较大。建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`。
使用该模型之前你需要确定本机安装了 [Ollama](https://ollama.com/) 软件,并已经下载了需要的大语言模型。只需要将需要调用的大模型名称添加到设置中的 `Ollama` 字段中
使用该模型之前你需要确定本机安装了 [Ollama](https://ollama.com/) 软件,并已经下载了需要的大语言模型。只需要将需要调用的大模型名称添加到设置中的 `模型名称` 字段中,并保证 `Base URL` 字段为空
#### OpenAI 兼容模型
如果觉得本地 Ollama 模型的翻译效果不佳,或者不想在本地安装 Ollama 模型,那么可以使用云端的 OpenAI 兼容模型。
以下是一些模型提供商的 `Base URL`
- OpenAI: https://api.openai.com/v1
- DeepSeekhttps://api.deepseek.com
- 阿里云https://dashscope.aliyuncs.com/compatible-mode/v1
API Key 需要在对应的模型提供商处获取。
#### Google 翻译 API
> 注意Google 翻译 API 在部分地区无法使用。
> 注意Google 翻译 API 在无法访问国际网络的地区无法使用。
无需任何配置,联网即可使用。
@@ -90,11 +108,17 @@ macOS 平台和 Linux 平台获取系统音频输出需要进行额外设置,
> 国际版的阿里云服务似乎并没有提供 Gummy 模型,因此目前非中国用户可能无法使用 Gummy 字幕引擎。
如果要使用默认的 Gummy 字幕引擎(使用云端模型进行语音识别和翻译),首先需要获取阿里云百炼平台的 API KEY然后将 API KEY 添加到软件设置中或者配置到环境变量中(仅 Windows 平台支持读取环境变量中的 API KEY这样才能正常使用该模型。相关教程
如果要使用默认的 Gummy 字幕引擎(使用云端模型进行语音识别和翻译),首先需要获取阿里云百炼平台的 API KEY然后将 API KEY 添加到软件设置中(在字幕引擎设置的更多设置中)或者配置到环境变量中(仅 Windows 平台支持读取环境变量中的 API KEY这样才能正常使用该模型。相关教程
- [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### 使用 GLM-ASR 模型
使用前需要获取智谱 AI 平台的 API KEY并添加到软件设置中。
API KEY 获取相关链接:[快速开始](https://docs.bigmodel.cn/cn/guide/start/quick-start)。
### 使用 Vosk 模型
> Vosk 模型的识别效果较差,请谨慎使用。
@@ -132,7 +156,7 @@ python main.py \
## ⚙️ 自带字幕引擎说明
目前软件自带 3 个字幕引擎,正在规划新的引擎。它们的详细信息如下。
目前软件自带 4 个字幕引擎。它们的详细信息如下。
### Gummy 字幕引擎(云端)
@@ -159,6 +183,10 @@ $$
而且引擎只会获取到音频流的时候才会上传数据,因此实际上传速率可能更小。模型结果回传流量消耗较小,没有纳入考虑。
### GLM-ASR 字幕引擎(云端)
https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512
### Vosk 字幕引擎(本地)
基于 [vosk-api](https://github.com/alphacep/vosk-api) 开发。该字幕引擎的优点是可选的语言模型非常多(超过 30 种),缺点是识别效果比较差,且生成内容没有标点符号。
@@ -168,16 +196,6 @@ $$
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) 是一个整合包,该整合包主要基于 [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html),并添加了端点检测模型和标点恢复模型。该模型支持识别的语言有:英语、中文、日语、韩语、粤语。
### 新规划字幕引擎
以下为备选模型,将根据模型效果和集成难易程度选择。
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
## 🚀 项目运行
![](./assets/media/structure_zh.png)

View File

@@ -3,7 +3,7 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption is a cross-platform real-time caption display software.</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.1.1-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
@@ -14,7 +14,7 @@
| <b>English</b>
| <a href="./README_ja.md">日本語</a> |
</p>
<p><i>Version 1.0.0 has been released, with the addition of the SOSV local caption model. The current features are basically complete, and there are no further development plans...</i></p>
<p><i>v1.1.1 has been released, adding the GLM-ASR cloud caption model and OpenAI compatible model translation...</i></p>
</div>
![](./assets/media/main_en.png)
@@ -35,18 +35,24 @@ SOSV Model Download: [Shepra-ONNX SenseVoice Model](https://github.com/HiMeditat
[Changelog](./docs/CHANGELOG.md)
## 👁️‍🗨️ Preview
https://github.com/user-attachments/assets/9c188d78-9520-4397-bacf-4c8fdcc54874
## ✨ Features
- Generate captions from audio output or microphone input
- Supports translation by calling local Ollama models or cloud-based Google Translate API
- Supports calling local Ollama models, cloud-based OpenAI compatible models, or cloud-based Google Translate API for translation
- Cross-platform (Windows, macOS, Linux) and multi-language interface (Chinese, English, Japanese) support
- Rich caption style settings (font, font size, font weight, font color, background color, etc.)
- Flexible caption engine selection (Alibaba Cloud Gummy cloud model, local Vosk model, local SOSV model, or you can develop your own model)
- Flexible caption engine selection (Aliyun Gummy cloud model,GLM-ASR cloud model, local Vosk model, local SOSV model, or you can develop your own model)
- Multi-language recognition and translation (see below "⚙️ Built-in Subtitle Engines")
- Subtitle record display and export (supports exporting `.srt` and `.json` formats)
## 📖 Basic Usage
> ⚠️ Note: Currently, only the latest version of the software on Windows platform is maintained, while the last versions for other platforms remain at v1.0.0.
The software has been adapted for Windows, macOS, and Linux platforms. The tested platform information is as follows:
| OS Version | Architecture | System Audio Input | System Audio Output |
@@ -60,14 +66,15 @@ Additional configuration is required to capture system audio output on macOS and
After downloading the software, you need to select the corresponding model according to your needs and then configure the model.
| | Recognition Quality | Deployment Type | Supported Languages | Translation | Notes |
| ------------------------------------------------------------ | ------------------- | ------------------ | ------------------- | ------------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Excellent 😊 | Alibaba Cloud | 10 languages | Built-in | Paid, 0.54 CNY/hour |
| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Local / CPU | Over 30 languages | Requires setup | Supports many languages |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Fair 😐 | Local / CPU | 5 languages | Requires setup | Only one model available |
| DIY Development | 🤔 | Custom | Custom | Custom | Develop your own using Python according to [documentation](./docs/engine-manual/zh.md) |
| | Accuracy | Real-time | Deployment Type | Supported Languages | Translation | Notes |
| ------------------------------------------------------------ | -------- | --------- | --------------- | ------------------- | ----------- | ----- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | Very good 😊 | Very good 😊 | Cloud / Alibaba Cloud | 10 languages | Built-in translation | Paid, 0.54 CNY/hour |
| [glm-asr-2512](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512) | Very good 😊 | Poor 😞 | Cloud / Zhipu AI | 4 languages | Requires additional configuration | Paid, approximately 0.72 CNY/hour |
| [Vosk](https://alphacephei.com/vosk) | Poor 😞 | Very good 😊 | Local / CPU | Over 30 languages | Requires additional configuration | Supports many languages |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | Average 😐 | Average 😐 | Local / CPU | 5 languages | Requires additional configuration | Only one model |
| Self-developed | 🤔 | 🤔 | Custom | Custom | Custom | Develop your own using Python according to the [documentation](./docs/engine-manual/en.md) |
If you choose to use the Vosk or SOSV model, you also need to configure your own translation model.
If you choose a model other than Gummy, you also need to configure your own translation model.
### Configuring Translation Models
@@ -79,7 +86,18 @@ If you choose to use the Vosk or SOSV model, you also need to configure your own
> Note: Using models with too many parameters will lead to high resource consumption and translation delays. It is recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
Before using this model, you need to ensure that [Ollama](https://ollama.com/) software is installed on your machine and the required large language model has been downloaded. Simply add the name of the large model you want to call to the `Ollama` field in the settings.
Before using this model, you need to confirm that the [Ollama](https://ollama.com/) software is installed on your local machine and that you have downloaded the required large language model. Simply add the name of the large model you want to call to the `Model Name` field in the settings, and ensure that the `Base URL` field is empty.
#### OpenAI Compatible Model
If you feel the translation effect of the local Ollama model is not good enough, or don't want to install the Ollama model locally, then you can use cloud-based OpenAI compatible models.
Here are some model provider `Base URL`s:
- OpenAI: https://api.openai.com/v1
- DeepSeek: https://api.deepseek.com
- Alibaba Cloud: https://dashscope.aliyuncs.com/compatible-mode/v1
The API Key needs to be obtained from the corresponding model provider.
#### Google Translate API
@@ -96,6 +114,12 @@ To use the default Gummy caption engine (using cloud models for speech recogniti
- [Get API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configure API Key through Environment Variables](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### Using the GLM-ASR Model
Before using it, you need to obtain an API KEY from the Zhipu AI platform and add it to the software settings.
For API KEY acquisition, see: [Quick Start](https://docs.bigmodel.cn/en/guide/start/quick-start).
### Using Vosk Model
> The recognition effect of the Vosk model is poor, please use it with caution.
@@ -133,7 +157,7 @@ python main.py \
## ⚙️ Built-in Subtitle Engines
Currently, the software comes with 3 caption engines, with new engines under development. Their detailed information is as follows.
Currently, the software comes with 4 caption engines, with new engines under development. Their detailed information is as follows.
### Gummy Subtitle Engine (Cloud)
@@ -160,6 +184,10 @@ $$
The engine only uploads data when receiving audio streams, so the actual upload rate may be lower. The return traffic consumption of model results is small and not considered here.
### GLM-ASR Caption Engine (Cloud)
https://docs.bigmodel.cn/en/guide/models/sound-and-video/glm-asr-2512
### Vosk Subtitle Engine (Local)
Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advantage of this caption engine is that there are many optional language models (over 30 languages), but the disadvantage is that the recognition effect is relatively poor, and the generated content has no punctuation.
@@ -168,16 +196,6 @@ Developed based on [vosk-api](https://github.com/alphacep/vosk-api). The advanta
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model) is an integrated package, mainly based on [Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html), with added endpoint detection model and punctuation restoration model. The languages supported by this model for recognition are: English, Chinese, Japanese, Korean, and Cantonese.
### Planned New Subtitle Engines
The following are candidate models that will be selected based on model performance and ease of integration.
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
## 🚀 Project Setup
![](./assets/media/structure_en.png)

View File

@@ -3,7 +3,7 @@
<h1 align="center">auto-caption</h1>
<p>Auto Caption はクロスプラットフォームのリアルタイム字幕表示ソフトウェアです。</p>
<p>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.0.0-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/releases"><img src="https://img.shields.io/badge/release-1.1.1-blue"></a>
<a href="https://github.com/HiMeditator/auto-caption/issues"><img src="https://img.shields.io/github/issues/HiMeditator/auto-caption?color=orange"></a>
<img src="https://img.shields.io/github/languages/top/HiMeditator/auto-caption?color=royalblue">
<img src="https://img.shields.io/github/repo-size/HiMeditator/auto-caption?color=green">
@@ -14,7 +14,7 @@
| <a href="./README_en.md">English</a>
| <b>日本語</b> |
</p>
<p><i>v1.0.0 バージョンがリリースされ、SOSV ローカル字幕モデルが追加されました。現在の機能は基本的に完了しており、今後の開発計画はありません...</i></p>
<p><i>v1.1.1 バージョンがリリースされました。GLM-ASR クラウド字幕モデルと OpenAI 互換モデル翻訳が追加されました...</i></p>
</div>
![](./assets/media/main_ja.png)
@@ -35,18 +35,24 @@ SOSV モデルダウンロード: [Shepra-ONNX SenseVoice Model](https://github.
[更新履歴](./docs/CHANGELOG.md)
## 👁️‍🗨️ プレビュー
https://github.com/user-attachments/assets/9c188d78-9520-4397-bacf-4c8fdcc54874
## ✨ 特徴
- 音声出力またはマイク入力からの字幕生成
- ローカルのOllamaモデルまたはクラウドベースのGoogle翻訳APIを呼び出して翻訳をサポート
- ローカルのOllamaモデル、クラウド上のOpenAI互換モデル、またはクラウドのGoogle翻訳APIを呼び出して翻訳を行うことをサポートしています
- クロスプラットフォームWindows、macOS、Linux、多言語インターフェース中国語、英語、日本語対応
- 豊富な字幕スタイル設定(フォント、フォントサイズ、フォント太さ、フォント色、背景色など)
- 柔軟な字幕エンジン選択阿里云Gummyクラウドモデル、ローカルVoskモデル、ローカルSOSVモデル、または独自にモデルを開発可能
- 柔軟な字幕エンジン選択阿里云Gummyクラウドモデル、GLM-ASRクラウドモデル、ローカルVoskモデル、ローカルSOSVモデル、または独自にモデルを開発可能
- 多言語認識と翻訳(下記「⚙️ 字幕エンジン説明」参照)
- 字幕記録表示とエクスポート(`.srt` および `.json` 形式のエクスポートに対応)
## 📖 基本使い方
> ⚠️ 注意現在、Windowsプラットフォームのソフトウェアの最新バージョンのみがメンテナンスされており、他のプラットフォームの最終バージョンはv1.0.0のままです。
このソフトウェアは Windows、macOS、Linux プラットフォームに対応しています。テスト済みのプラットフォーム情報は以下の通りです:
| OS バージョン | アーキテクチャ | システムオーディオ入力 | システムオーディオ出力 |
@@ -61,14 +67,15 @@ macOS および Linux プラットフォームでシステムオーディオ出
ソフトウェアをダウンロードした後、自分のニーズに応じて対応するモデルを選択し、モデルを設定する必要があります。
| | 認識効果 | デプロイタイプ | 対応言語 | 翻訳 | 備考 |
| ------------------------------------------------------------ | -------- | ----------------- | ---------- | ---------- | ---------------------------------------------------------- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | 良好😊 | クラウド / 阿里云 | 10種 | 内蔵翻訳 | 有料、0.54CNY / 時間 |
| [Vosk](https://alphacephei.com/vosk) | 不良😞 | ローカル / CPU | 30種以上 | 追加設定必要 | 対応言語が非常に多い |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 一般😐 | ローカル / CPU | 5種 | 追加設定必要 | モデルは一つのみ |
| 自前開発 | 🤔 | カスタム | カスタム | カスタム | [ドキュメント](./docs/engine-manual/zh.md)に従ってPythonで自前開発 |
| | 正確性 | 実時間性 | デプロイタイプ | 対応言語 | 翻訳 | 備考 |
| ------------------------------------------------------------ | -------- | --------- | -------------- | -------- | ---- | ---- |
| [Gummy](https://help.aliyun.com/zh/model-studio/gummy-speech-recognition-translation) | とても良い😊 | とても良い😊 | クラウド / アリババクラウド | 10言語 | 内蔵翻訳 | 有料、0.54元/時間 |
| [glm-asr-2512](https://docs.bigmodel.cn/cn/guide/models/sound-and-video/glm-asr-2512) | とても良い😊 | 悪い😞 | クラウド / Zhipu AI | 4言語 | 追加設定必要 | 有料、約0.72元/時間 |
| [Vosk](https://alphacephei.com/vosk) | 悪い😞 | とても良い😊 | ローカル / CPU | 30言語以上 | 追加設定必要 | 多くの言語に対応 |
| [SOSV](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html) | 普通😐 | 普通😐 | ローカル / CPU | 5言語 | 追加設定が必要 | 1つのモデルのみ |
| 自分で開発 | 🤔 | 🤔 | カスタム | カスタム | カスタム | [ドキュメント](./docs/engine-manual/ja.md)に従ってPythonを使用して自分で開発 |
VoskまたはSOSVモデルを使用する場合、独自の翻訳モデル設定する必要があります。
Gummyモデル以外を選択した場合、独自の翻訳モデル設定する必要があります。
### 翻訳モデルの設定
@@ -80,7 +87,18 @@ VoskまたはSOSVモデルを使用する場合、独自の翻訳モデルも設
> 注意パラメータ数が多すぎるモデルを使用すると、リソース消費と翻訳遅延が大きくなります。1B未満のパラメータ数のモデルを使用することを推奨します。例`qwen2.5:0.5b`、`qwen3:0.6b`。
このモデルを使用する前に、ローカルマシンに[Ollama](https://ollama.com/)ソフトウェアがインストールされ、必要な大規模言語モデルダウンロードされていることを確認してください。必要な大規模モデル名を設定の`Ollama`フィールドに追加するだけでOKです
このモデルを使用する前に、ローカルマシンに[Ollama](https://ollama.com/)ソフトウェアがインストールされており、必要な大規模言語モデルダウンロード済みであることを確認してください。設定で呼び出す必要がある大規模モデル名を「モデル名」フィールドに入力し、「Base URL」フィールドが空であることを確認してください
#### OpenAI互換モデル
ローカルのOllamaモデルの翻訳効果が良くないと感じる場合や、ローカルにOllamaモデルをインストールしたくない場合は、クラウド上のOpenAI互換モデルを使用できます。
いくつかのモデルプロバイダの「Base URL」
- OpenAI: https://api.openai.com/v1
- DeepSeek: https://api.deepseek.com
- アリババクラウド: https://dashscope.aliyuncs.com/compatible-mode/v1
API Keyは対応するモデルプロバイダから取得する必要があります。
#### Google翻訳API
@@ -97,6 +115,12 @@ VoskまたはSOSVモデルを使用する場合、独自の翻訳モデルも設
- [API KEYの取得](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数へのAPI Keyの設定](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
### GLM-ASR モデルの使用
使用前に、Zhipu AI プラットフォームから API キーを取得し、それをソフトウェアの設定に追加する必要があります。
API キーの取得についてはこちらをご覧ください:[クイックスタート](https://docs.bigmodel.cn/ja/guide/start/quick-start)。
### Voskモデルの使用
> Voskモデルの認識効果は不良のため、注意して使用してください。
@@ -134,7 +158,7 @@ python main.py \
## ⚙️ 字幕エンジン説明
現在、ソフトウェアには3つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。
現在、ソフトウェアには4つの字幕エンジンが搭載されており、新しいエンジンが計画されています。それらの詳細情報は以下の通りです。
### Gummy 字幕エンジン(クラウド)
@@ -161,6 +185,10 @@ $$
また、エンジンはオーディオストームを取得したときのみデータをアップロードするため、実際のアップロードレートはさらに小さくなる可能性があります。モデル結果の返信トラフィック消費量は小さく、ここでは考慮していません。
### GLM-ASR 字幕エンジン(クラウド)
https://docs.bigmodel.cn/ja/guide/models/sound-and-video/glm-asr-2512
### Vosk字幕エンジンローカル
[vosk-api](https://github.com/alphacep/vosk-api)をベースに開発。この字幕エンジンの利点は選択可能な言語モデルが非常に多く30言語以上、欠点は認識効果が比較的悪く、生成内容に句読点がないことです。
@@ -169,16 +197,6 @@ $$
[SOSV](https://github.com/HiMeditator/auto-caption/releases/tag/sosv-model)は統合パッケージで、主に[Shepra-ONNX SenseVoice](https://k2-fsa.github.io/sherpa/onnx/sense-voice/index.html)をベースにし、エンドポイント検出モデルと句読点復元モデルを追加しています。このモデルが認識をサポートする言語は:英語、中国語、日本語、韓国語、広東語です。
### 新規計画字幕エンジン
以下は候補モデルであり、モデルの性能と統合の容易さに基づいて選択されます。
- [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
- [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
- [FunASR](https://github.com/modelscope/FunASR)
- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)
## 🚀 プロジェクト実行
![](./assets/media/structure_ja.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 68 KiB

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 69 KiB

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 72 KiB

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 62 KiB

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 81 KiB

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 404 KiB

After

Width:  |  Height:  |  Size: 476 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 417 KiB

After

Width:  |  Height:  |  Size: 488 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 417 KiB

After

Width:  |  Height:  |  Size: 486 KiB

View File

@@ -8,5 +8,9 @@
<true/>
<key>com.apple.security.cs.allow-dyld-environment-variables</key>
<true/>
<key>com.apple.security.cs.disable-library-validation</key>
<true/>
<key>com.apple.security.device.audio-input</key>
<true/>
</dict>
</plist>
</plist>

View File

@@ -172,4 +172,19 @@
- 优化部分提示信息显示位置
- 替换重采样模型,提高音频重采样质量
- 带有额外信息的标签颜色改为与主题色一致
- 带有额外信息的标签颜色改为与主题色一致
## v1.1.0
### 新增功能
- 添加基于 GLM-ASR 的字幕引擎
- 添加 OpenAI API 兼容模型作为新的翻译模型
## v1.1.1
### 优化体验
- 取消字幕窗口的顶置选项,字幕窗口将始终处于顶置状态
- 将字幕窗口顶置选项改为鼠标穿透选项,当图钉图标为实心时,表示启用鼠标穿透

View File

@@ -23,17 +23,8 @@
- [x] 前端页面添加日志内容展示 *2025/08/19*
- [x] 添加 Ollama 模型用于本地字幕引擎的翻译 *2025/09/04*
- [x] 验证 / 添加基于 sherpa-onnx 的字幕引擎 *2025/09/06*
- [x] 添加 GLM-ASR 模型 *2026/01/10*
## 待完成
## TODO
- [ ] 调研更多的云端模型火山、OpenAI、Google等
- [ ] 验证 / 添加基于 sherpa-onnx 的字幕引擎
## 后续计划
- [ ] 验证 / 添加基于 FunASR 的字幕引擎
- [ ] 减小软件不必要的体积
## 遥远的未来
- [ ] 使用 Tauri 框架重新开发
暂无

View File

@@ -202,9 +202,9 @@
**数据类型:** `number`
### `caption.pin.set`
### `caption.mouseEvents.ignore`
**介绍:** 是否将窗口置顶
**介绍:** 是否设置鼠标穿透
**发起方:** 前端字幕窗口

Binary file not shown.

Before

Width:  |  Height:  |  Size: 118 KiB

After

Width:  |  Height:  |  Size: 148 KiB

View File

@@ -1,6 +1,6 @@
# Auto Caption User Manual
Corresponding Version: v1.0.0
Corresponding Version: v1.1.1
**Note: Due to limited personal resources, the English and Japanese documentation files for this project (except for the README document) will no longer be maintained. The content of this document may not be consistent with the latest version of the project. If you are willing to help with translation, please submit relevant Pull Requests.**
@@ -41,6 +41,11 @@ Alibaba Cloud provides detailed tutorials for this part, which can be referenced
- [Obtaining API KEY (Chinese)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [Configuring API Key through Environment Variables (Chinese)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
## Preparation for GLM Engine
You need to obtain an API KEY first, refer to: [Quick Start](https://docs.bigmodel.cn/en/guide/start/quick-start).
## Preparation for Using Vosk Engine
To use the Vosk local caption engine, first download your required model from the [Vosk Models](https://alphacephei.com/vosk/models) page. Then extract the downloaded model package locally and add the corresponding model folder path to the software settings.
@@ -111,7 +116,7 @@ After completing all configurations, click the "Start Caption Engine" button on
### Adjusting the Caption Display Window
The following image shows the caption display window, which displays the latest captions in real-time. The three buttons in the upper right corner of the window have the following functions: pin the window to the front, open the caption control window, and close the caption display window. The width of the window can be adjusted by moving the mouse to the left or right edge of the window and dragging the mouse.
The following image shows the caption display window, which displays the latest captions in real-time. The functions of the three buttons in the upper right corner of the window are: to close the caption display window, to open the caption control window, and to enable mouse pass-through. The width of the window can be adjusted by moving the mouse to the left or right edge of the window and dragging the mouse.
![](../img/01.png)
@@ -147,7 +152,7 @@ The following parameter descriptions only include necessary parameters.
#### `-e , --caption_engine`
The caption engine model to select, currently three options are available: `gummy, vosk, sosv`.
The caption engine model to select, currently three options are available: `gummy, glm, vosk, sosv`.
The default value is `gummy`.
@@ -199,10 +204,12 @@ Source language for recognition. Default value is `auto`, meaning no specific so
Specifying the source language can improve recognition accuracy to some extent. You can specify the source language using the language codes above.
This only applies to Gummy and SOSV models.
This applies to Gummy, GLM and SOSV models.
The Gummy model can use all the languages mentioned above, plus Cantonese (`yue`).
The GLM model supports specifying the following languages: English, Chinese, Japanese, Korean.
The SOSV model supports specifying the following languages: English, Chinese, Japanese, Korean, and Cantonese.
#### `-k, --api_key`
@@ -213,6 +220,18 @@ Default value is empty.
This only applies to the Gummy model.
#### `-gkey, --glm_api_key`
Specifies the API KEY required for the `glm` model. The default value is empty.
#### `-gmodel, --glm_model`
Specifies the model name to be used for the `glm` model. The default value is `glm-asr-2512`.
#### `-gurl, --glm_url`
Specifies the API URL required for the `glm` model. The default value is: `https://open.bigmodel.cn/api/paas/v4/audio/transcriptions`.
#### `-tm, --translation_model`
Specify the translation method for Vosk and SOSV models. Default is `ollama`.
@@ -226,13 +245,23 @@ This only applies to Vosk and SOSV models.
#### `-omn, --ollama_name`
Specify the Ollama model to call for translation. Default value is empty.
Specifies the name of the translation model to be used, which can be either a local Ollama model or a cloud model compatible with the OpenAI API. If the Base URL field is not filled in, the local Ollama service will be called by default; otherwise, the API service at the specified address will be invoked via the Python OpenAI library.
It's recommended to use models with less than 1B parameters, such as: `qwen2.5:0.5b`, `qwen3:0.6b`.
If using an Ollama model, it is recommended to use a model with fewer than 1B parameters, such as `qwen2.5:0.5b` or `qwen3:0.6b`. The corresponding model must be downloaded in Ollama for normal use.
Users need to download the corresponding model in Ollama to use it properly.
The default value is empty and applies to models other than Gummy.
This only applies to Vosk and SOSV models.
#### `-ourl, --ollama_url`
The base request URL for calling the OpenAI API. If left blank, the local Ollama model on the default port will be called.
The default value is empty and applies to models other than Gummy.
#### `-okey, --ollama_api_key`
Specifies the API KEY for calling OpenAI-compatible models.
The default value is empty and applies to models other than Gummy.
#### `-vosk, --vosk_model`

View File

@@ -1,6 +1,6 @@
# Auto Caption ユーザーマニュアル
対応バージョンv1.0.0
対応バージョンv1.1.1
この文書は大規模モデルを使用して翻訳されていますので、内容に正確でない部分があるかもしれません。
@@ -41,6 +41,10 @@ macOS プラットフォームでオーディオ出力を取得するには追
- [API KEY の取得(中国語)](https://help.aliyun.com/zh/model-studio/get-api-key)
- [環境変数を通じて API Key を設定(中国語)](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
## GLM エンジン使用前の準備
まずAPI KEYを取得する必要があります。参考[クイックスタート](https://docs.bigmodel.cn/en/guide/start/quick-start)。
## Voskエンジン使用前の準備
Voskローカル字幕エンジンを使用するには、まず[Vosk Models](https://alphacephei.com/vosk/models)ページから必要なモデルをダウンロードしてください。その後、ダウンロードしたモデルパッケージをローカルに解凍し、対応するモデルフォルダのパスをソフトウェア設定に追加します。
@@ -112,7 +116,7 @@ sudo yum install pulseaudio pavucontrol
### 字幕表示ウィンドウの調整
下の図は字幕表示ウィンドウです。このウィンドウは現在の最新の字幕をリアルタイムで表示します。ウィンドウの右上にある3つのボタンの機能はそれぞれ次の通りです:ウィンドウを最前面に固定する、字幕制御ウィンドウを開く、字幕表示ウィンドウを閉じる。このウィンドウの幅は調整可能です。マウスをウィンドウの左右の端に移動し、ドラッグして幅を調整します。
下の図は字幕表示ウィンドウです。このウィンドウは現在の最新の字幕をリアルタイムで表示します。ウィンドウの右上にある3つのボタンの機能はそれぞれ字幕表示ウィンドウを閉じる、字幕制御ウィンドウを開く、マウス透過を有効化することです。このウィンドウの幅は調整可能です。マウスをウィンドウの左右の端に移動し、ドラッグして幅を調整します。
![](../img/01.png)

View File

@@ -1,6 +1,6 @@
# Auto Caption 用户手册
对应版本v1.0.0
对应版本v1.1.1
## 软件简介
@@ -39,6 +39,10 @@ Auto Caption 是一个跨平台的字幕显示软件,能够实时获取系统
- [获取 API KEY](https://help.aliyun.com/zh/model-studio/get-api-key)
- [将 API Key 配置到环境变量](https://help.aliyun.com/zh/model-studio/configure-api-key-through-environment-variables)
## GLM 引擎使用前准备
需要先获取 API KEY参考[Quick Start](https://docs.bigmodel.cn/en/guide/start/quick-start)。
## Vosk 引擎使用前准备
如果要使用 Vosk 本地字幕引擎,首先需要在 [Vosk Models](https://alphacephei.com/vosk/models) 页面下载你需要的模型。然后将下载的模型安装包解压到本地,并将对应的模型文件夹的路径添加到软件的设置中。
@@ -109,7 +113,7 @@ sudo yum install pulseaudio pavucontrol
### 调整字幕展示窗口
如下图为字幕展示窗口,该窗口实时展示当前最新字幕。窗口右上角三个按钮的功能分别是:将窗口固定在最前面、打开字幕控制窗口、关闭字幕展示窗口。该窗口宽度可以调整,将鼠标移动至窗口的左右边缘,拖动鼠标即可调整宽度。
如下图为字幕展示窗口,该窗口实时展示当前最新字幕。窗口右上角三个按钮的功能分别是:关闭字幕展示窗口、打开字幕控制窗口、启用鼠标穿透。该窗口宽度可以调整,将鼠标移动至窗口的左右边缘,拖动鼠标即可调整宽度。
![](../img/01.png)
@@ -145,7 +149,7 @@ sudo yum install pulseaudio pavucontrol
#### `-e , --caption_engine`
需要选择的字幕引擎模型,目前有个可用,分别为:`gummy, vosk, sosv`
需要选择的字幕引擎模型,目前有个可用,分别为:`gummy, glm, vosk, sosv`
该项的默认值为 `gummy`
@@ -197,11 +201,13 @@ sudo yum install pulseaudio pavucontrol
但是指定源语言能在一定程度上提高识别准确率,可用使用上面的语言代码指定源语言。
该项适用于 Gummy 和 SOSV 模型。
该项适用于 Gummy、GLM 和 SOSV 模型。
其中 Gummy 模型可用使用上述全部的语言,在加上粤语(`yue`)。
而 SOSV 模型支持指定的语言有:英语、中文、日语、韩语、粤语
GLM 模型支持指定的语言有:英语、中文、日语、韩语。
SOSV 模型支持指定的语言有:英语、中文、日语、韩语、粤语。
#### `-k, --api_key`
@@ -211,6 +217,18 @@ sudo yum install pulseaudio pavucontrol
该项仅适用于 Gummy 模型。
#### `-gkey, --glm_api_key`
指定 `glm` 模型需要使用的 API KEY默认为空。
#### `-gmodel, --glm_model`
指定 `glm` 模型需要使用的模型名称,默认为 `glm-asr-2512`
#### `-gurl, --glm_url`
指定 `glm` 模型需要使用的 API URL默认值为`https://open.bigmodel.cn/api/paas/v4/audio/transcriptions`
#### `-tm, --translation_model`
指定 Vosk 和 SOSV 模型的翻译方式,默认为 `ollama`
@@ -224,13 +242,23 @@ sudo yum install pulseaudio pavucontrol
#### `-omn, --ollama_name`
指定需要调用进行翻译的 Ollama 模型。该项默认值为空
指定要使用的翻译模型名称,可以是 Ollama 本地模型,也可以是 OpenAI API 兼容的云端模型。若未填写 Base URL 字段,则默认调用本地 Ollama 服务,否则会通过 Python OpenAI 库调用该地址指向的 API 服务
建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`
如果使用 Ollama 模型,建议使用参数量小于 1B 的模型,比如: `qwen2.5:0.5b`, `qwen3:0.6b`需要在 Ollama 中下载了对应的模型才能正常使用。
用户需要在 Ollama 中下载了对应的模型才能正常使用
默认值为空,适用于除了 Gummy 外的其他模型
该项仅适用于 Vosk 和 SOSV 模型。
#### `-ourl, --ollama_url`
调用 OpenAI API 的基础请求地址,如果不填写则调用本地默认端口的 Ollama 模型。
默认值为空,适用于除了 Gummy 外的其他模型。
#### `-okey, --ollama_api_key`
指定调用 OpenAI 兼容模型的 API KEY。
默认值为空,适用于除了 Gummy 外的其他模型。
#### `-vosk, --vosk_model`

View File

@@ -1,3 +1,4 @@
from .gummy import GummyRecognizer
from .vosk import VoskRecognizer
from .sosv import SosvRecognizer
from .sosv import SosvRecognizer
from .glm import GlmRecognizer

163
engine/audio2text/glm.py Normal file
View File

@@ -0,0 +1,163 @@
import threading
import io
import wave
import struct
import math
import audioop
import requests
from datetime import datetime
from utils import shared_data
from utils import stdout_cmd, stdout_obj, google_translate, ollama_translate
class GlmRecognizer:
"""
使用 GLM-ASR 引擎处理音频数据,并在标准输出中输出 Auto Caption 软件可读取的 JSON 字符串数据
初始化参数:
url: GLM-ASR API URL
model: GLM-ASR 模型名称
api_key: GLM-ASR API Key
source: 源语言
target: 目标语言
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, url: str, model: str, api_key: str, source: str, target: str | None, trans_model: str, ollama_name: str, ollama_url: str = '', ollama_api_key: str = ''):
self.url = url
self.model = model
self.api_key = api_key
self.source = source
self.target = target
if trans_model == 'google':
self.trans_func = google_translate
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.ollama_url = ollama_url
self.ollama_api_key = ollama_api_key
self.audio_buffer = []
self.is_speech = False
self.silence_frames = 0
self.speech_start_time = None
self.time_str = ''
self.cur_id = 0
# VAD settings (假设 16k 16bit, chunk size 1024 or similar)
# 16bit = 2 bytes per sample.
# RMS threshold needs tuning. 500 is a conservative guess for silence.
self.threshold = 500
self.silence_limit = 15 # frames (approx 0.5-1s depending on chunk size)
self.min_speech_frames = 10 # frames
def start(self):
"""启动 GLM 引擎"""
stdout_cmd('info', 'GLM-ASR recognizer started.')
def stop(self):
"""停止 GLM 引擎"""
stdout_cmd('info', 'GLM-ASR recognizer stopped.')
def process_audio(self, chunk):
# chunk is bytes (int16)
rms = audioop.rms(chunk, 2)
if rms > self.threshold:
if not self.is_speech:
self.is_speech = True
self.time_str = datetime.now().strftime('%H:%M:%S.%f')[:-3]
self.audio_buffer = []
self.audio_buffer.append(chunk)
self.silence_frames = 0
else:
if self.is_speech:
self.audio_buffer.append(chunk)
self.silence_frames += 1
if self.silence_frames > self.silence_limit:
# Speech ended
if len(self.audio_buffer) > self.min_speech_frames:
self.recognize(self.audio_buffer, self.time_str)
self.is_speech = False
self.audio_buffer = []
self.silence_frames = 0
def recognize(self, audio_frames, time_s):
audio_bytes = b''.join(audio_frames)
wav_io = io.BytesIO()
with wave.open(wav_io, 'wb') as wav_file:
wav_file.setnchannels(1)
wav_file.setsampwidth(2)
wav_file.setframerate(16000)
wav_file.writeframes(audio_bytes)
wav_io.seek(0)
threading.Thread(
target=self._do_request,
args=(wav_io.read(), time_s, self.cur_id)
).start()
self.cur_id += 1
def _do_request(self, audio_content, time_s, index):
try:
files = {
'file': ('audio.wav', audio_content, 'audio/wav')
}
data = {
'model': self.model,
'stream': 'false'
}
headers = {
'Authorization': f'Bearer {self.api_key}'
}
response = requests.post(self.url, headers=headers, data=data, files=files, timeout=15)
if response.status_code == 200:
res_json = response.json()
text = res_json.get('text', '')
if text:
self.output_caption(text, time_s, index)
else:
try:
err_msg = response.json()
stdout_cmd('error', f"GLM API Error: {err_msg}")
except:
stdout_cmd('error', f"GLM API Error: {response.text}")
except Exception as e:
stdout_cmd('error', f"GLM Request Failed: {str(e)}")
def output_caption(self, text, time_s, index):
caption = {
'command': 'caption',
'index': index,
'time_s': time_s,
'time_t': datetime.now().strftime('%H:%M:%S.%f')[:-3],
'text': text,
'translation': ''
}
if self.target:
if self.trans_func == ollama_translate:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], time_s, self.ollama_url, self.ollama_api_key),
daemon=True
)
else:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], time_s),
daemon=True
)
th.start()
stdout_obj(caption)
def translate(self):
global shared_data
while shared_data.status == 'running':
chunk = shared_data.chunk_queue.get()
self.process_audio(chunk)

View File

@@ -29,7 +29,7 @@ class SosvRecognizer:
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, model_path: str, source: str, target: str | None, trans_model: str, ollama_name: str):
def __init__(self, model_path: str, source: str, target: str | None, trans_model: str, ollama_name: str, ollama_url: str = '', ollama_api_key: str = ''):
if model_path.startswith('"'):
model_path = model_path[1:]
if model_path.endswith('"'):
@@ -45,6 +45,8 @@ class SosvRecognizer:
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.ollama_url = ollama_url
self.ollama_api_key = ollama_api_key
self.time_str = ''
self.cur_id = 0
self.prev_content = ''
@@ -152,7 +154,7 @@ class SosvRecognizer:
if self.target:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], self.time_str),
args=(self.ollama_name, self.target, caption['text'], self.time_str, self.ollama_url, self.ollama_api_key),
daemon=True
)
th.start()

View File

@@ -18,7 +18,7 @@ class VoskRecognizer:
trans_model: 翻译模型名称
ollama_name: Ollama 模型名称
"""
def __init__(self, model_path: str, target: str | None, trans_model: str, ollama_name: str):
def __init__(self, model_path: str, target: str | None, trans_model: str, ollama_name: str, ollama_url: str = '', ollama_api_key: str = ''):
SetLogLevel(-1)
if model_path.startswith('"'):
model_path = model_path[1:]
@@ -31,6 +31,8 @@ class VoskRecognizer:
else:
self.trans_func = ollama_translate
self.ollama_name = ollama_name
self.ollama_url = ollama_url
self.ollama_api_key = ollama_api_key
self.time_str = ''
self.cur_id = 0
self.prev_content = ''
@@ -66,7 +68,7 @@ class VoskRecognizer:
if self.target:
th = threading.Thread(
target=self.trans_func,
args=(self.ollama_name, self.target, caption['text'], self.time_str),
args=(self.ollama_name, self.target, caption['text'], self.time_str, self.ollama_url, self.ollama_api_key),
daemon=True
)
th.start()

View File

@@ -8,6 +8,7 @@ from utils import merge_chunk_channels, resample_chunk_mono
from audio2text import GummyRecognizer
from audio2text import VoskRecognizer
from audio2text import SosvRecognizer
from audio2text import GlmRecognizer
from sysaudio import AudioStream
@@ -74,7 +75,7 @@ def main_gummy(s: str, t: str, a: int, c: int, k: str, r: bool, rp: str):
engine.stop()
def main_vosk(a: int, c: int, vosk: str, t: str, tm: str, omn: str, r: bool, rp: str):
def main_vosk(a: int, c: int, vosk: str, t: str, tm: str, omn: str, ourl: str, okey: str, r: bool, rp: str):
"""
Parameters:
a: Audio source: 0 for output, 1 for input
@@ -83,14 +84,16 @@ def main_vosk(a: int, c: int, vosk: str, t: str, tm: str, omn: str, r: bool, rp:
t: Target language
tm: Translation model type, ollama or google
omn: Ollama model name
ourl: Ollama Base URL
okey: Ollama API Key
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
if t == 'none':
engine = VoskRecognizer(vosk, None, tm, omn)
engine = VoskRecognizer(vosk, None, tm, omn, ourl, okey)
else:
engine = VoskRecognizer(vosk, t, tm, omn)
engine = VoskRecognizer(vosk, t, tm, omn, ourl, okey)
engine.start()
stream_thread = threading.Thread(
@@ -106,7 +109,7 @@ def main_vosk(a: int, c: int, vosk: str, t: str, tm: str, omn: str, r: bool, rp:
engine.stop()
def main_sosv(a: int, c: int, sosv: str, s: str, t: str, tm: str, omn: str, r: bool, rp: str):
def main_sosv(a: int, c: int, sosv: str, s: str, t: str, tm: str, omn: str, ourl: str, okey: str, r: bool, rp: str):
"""
Parameters:
a: Audio source: 0 for output, 1 for input
@@ -116,14 +119,16 @@ def main_sosv(a: int, c: int, sosv: str, s: str, t: str, tm: str, omn: str, r: b
t: Target language
tm: Translation model type, ollama or google
omn: Ollama model name
ourl: Ollama API URL
okey: Ollama API Key
r: Whether to record the audio
rp: Path to save the recorded audio
"""
stream = AudioStream(a, c)
if t == 'none':
engine = SosvRecognizer(sosv, s, None, tm, omn)
engine = SosvRecognizer(sosv, s, None, tm, omn, ourl, okey)
else:
engine = SosvRecognizer(sosv, s, t, tm, omn)
engine = SosvRecognizer(sosv, s, t, tm, omn, ourl, okey)
engine.start()
stream_thread = threading.Thread(
@@ -139,38 +144,80 @@ def main_sosv(a: int, c: int, sosv: str, s: str, t: str, tm: str, omn: str, r: b
engine.stop()
def main_glm(a: int, c: int, url: str, model: str, key: str, s: str, t: str, tm: str, omn: str, ourl: str, okey: str, r: bool, rp: str):
"""
Parameters:
a: Audio source
c: Chunk rate
url: GLM API URL
model: GLM Model Name
key: GLM API Key
s: Source language
t: Target language
tm: Translation model
omn: Ollama model name
ourl: Ollama API URL
okey: Ollama API Key
r: Record
rp: Record path
"""
stream = AudioStream(a, c)
if t == 'none':
engine = GlmRecognizer(url, model, key, s, None, tm, omn, ourl, okey)
else:
engine = GlmRecognizer(url, model, key, s, t, tm, omn, ourl, okey)
engine.start()
stream_thread = threading.Thread(
target=audio_recording,
args=(stream, True, r, rp),
daemon=True
)
stream_thread.start()
try:
engine.translate()
except KeyboardInterrupt:
stdout("Keyboard interrupt detected. Exiting...")
engine.stop()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Convert system audio stream to text')
# all
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy or vosk or sosv')
parser.add_argument('-a', '--audio_type', default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-d', '--display_caption', default=0, help='Display caption on terminal, 0 for no display, 1 for display')
parser.add_argument('-e', '--caption_engine', default='gummy', help='Caption engine: gummy, glm, vosk or sosv')
parser.add_argument('-a', '--audio_type', type=int, default=0, help='Audio stream source: 0 for output, 1 for input')
parser.add_argument('-c', '--chunk_rate', type=int, default=10, help='Number of audio stream chunks collected per second')
parser.add_argument('-p', '--port', type=int, default=0, help='The port to run the server on, 0 for no server')
parser.add_argument('-d', '--display_caption', type=int, default=0, help='Display caption on terminal, 0 for no display, 1 for display')
parser.add_argument('-t', '--target_language', default='none', help='Target language code, "none" for no translation')
parser.add_argument('-r', '--record', default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-r', '--record', type=int, default=0, help='Whether to record the audio, 0 for no recording, 1 for recording')
parser.add_argument('-rp', '--record_path', default='', help='Path to save the recorded audio')
# gummy and sosv
# gummy and sosv and glm
parser.add_argument('-s', '--source_language', default='auto', help='Source language code')
# gummy only
parser.add_argument('-k', '--api_key', default='', help='API KEY for Gummy model')
# vosk and sosv
parser.add_argument('-tm', '--translation_model', default='ollama', help='Model for translation: ollama or google')
parser.add_argument('-omn', '--ollama_name', default='', help='Ollama model name for translation')
parser.add_argument('-ourl', '--ollama_url', default='', help='Ollama API URL')
parser.add_argument('-okey', '--ollama_api_key', default='', help='Ollama API Key')
# vosk only
parser.add_argument('-vosk', '--vosk_model', default='', help='The path to the vosk model.')
# sosv only
parser.add_argument('-sosv', '--sosv_model', default=None, help='The SenseVoice model path')
# glm only
parser.add_argument('-gurl', '--glm_url', default='https://open.bigmodel.cn/api/paas/v4/audio/transcriptions', help='GLM API URL')
parser.add_argument('-gmodel', '--glm_model', default='glm-asr-2512', help='GLM Model Name')
parser.add_argument('-gkey', '--glm_api_key', default='', help='GLM API Key')
args = parser.parse_args()
if int(args.port) == 0:
shared_data.status = "running"
else:
start_server(int(args.port))
if int(args.display_caption) != 0:
if args.port != 0:
threading.Thread(target=start_server, args=(args.port,), daemon=True).start()
if args.display_caption == '1':
change_caption_display(True)
print("Caption will be displayed on terminal")
if args.caption_engine == 'gummy':
main_gummy(
@@ -179,7 +226,7 @@ if __name__ == "__main__":
int(args.audio_type),
int(args.chunk_rate),
args.api_key,
True if int(args.record) == 1 else False,
bool(int(args.record)),
args.record_path
)
elif args.caption_engine == 'vosk':
@@ -190,7 +237,9 @@ if __name__ == "__main__":
args.target_language,
args.translation_model,
args.ollama_name,
True if int(args.record) == 1 else False,
args.ollama_url,
args.ollama_api_key,
bool(int(args.record)),
args.record_path
)
elif args.caption_engine == 'sosv':
@@ -202,7 +251,25 @@ if __name__ == "__main__":
args.target_language,
args.translation_model,
args.ollama_name,
True if int(args.record) == 1 else False,
args.ollama_url,
args.ollama_api_key,
bool(int(args.record)),
args.record_path
)
elif args.caption_engine == 'glm':
main_glm(
int(args.audio_type),
int(args.chunk_rate),
args.glm_url,
args.glm_model,
args.glm_api_key,
args.source_language,
args.target_language,
args.translation_model,
args.ollama_name,
args.ollama_url,
args.ollama_api_key,
bool(int(args.record)),
args.record_path
)
else:

View File

@@ -6,7 +6,12 @@ import sys
if sys.platform == 'win32':
vosk_path = str(Path('./.venv/Lib/site-packages/vosk').resolve())
else:
vosk_path = str(Path('./.venv/lib/python3.12/site-packages/vosk').resolve())
venv_lib = Path('./.venv/lib')
python_dirs = list(venv_lib.glob('python*'))
if python_dirs:
vosk_path = str((python_dirs[0] / 'site-packages' / 'vosk').resolve())
else:
vosk_path = str(Path('./.venv/lib/python3.12/site-packages/vosk').resolve())
a = Analysis(
['main.py'],

View File

@@ -7,4 +7,6 @@ pyaudio; sys_platform == 'darwin'
pyaudiowpatch; sys_platform == 'win32'
googletrans
ollama
sherpa_onnx
sherpa_onnx
requests
openai

View File

@@ -47,7 +47,6 @@ def translation_display(obj):
def stdout_obj(obj):
global display_caption
print(obj['command'], display_caption)
if obj['command'] == 'caption' and display_caption:
caption_display(obj)
return

View File

@@ -1,5 +1,9 @@
from ollama import chat
from ollama import chat, Client
from ollama import ChatResponse
try:
from openai import OpenAI
except ImportError:
OpenAI = None
import asyncio
from googletrans import Translator
from .sysout import stdout_cmd, stdout_obj
@@ -17,15 +21,43 @@ lang_map = {
'zh-cn': 'Chinese'
}
def ollama_translate(model: str, target: str, text: str, time_s: str):
response: ChatResponse = chat(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = response.message.content or ""
def ollama_translate(model: str, target: str, text: str, time_s: str, url: str = '', key: str = ''):
content = ""
try:
if url:
if OpenAI:
client = OpenAI(base_url=url, api_key=key if key else "ollama")
openai_response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = openai_response.choices[0].message.content or ""
else:
client = Client(host=url)
response: ChatResponse = client.chat(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = response.message.content or ""
else:
response: ChatResponse = chat(
model=model,
messages=[
{"role": "system", "content": f"/no_think Translate the following content into {lang_map[target]}, and do not output any additional information."},
{"role": "user", "content": text}
]
)
content = response.message.content or ""
except Exception as e:
stdout_cmd("warn", f"Translation failed: {str(e)}")
return
if content.startswith('<think>'):
index = content.find('</think>')
if index != -1:

67
package-lock.json generated
View File

@@ -110,6 +110,7 @@
"integrity": "sha512-IaaGWsQqfsQWVLqMn9OB92MNN7zukfVA4s7KKAI0KfrrDsZ0yhi5uV4baBuLuN7n3vsZpwP8asPPcVwApxvjBQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@ampproject/remapping": "^2.2.0",
"@babel/code-frame": "^7.27.1",
@@ -2274,6 +2275,7 @@
"resolved": "https://registry.npmmirror.com/@types/node/-/node-22.15.17.tgz",
"integrity": "sha512-wIX2aSZL5FE+MR0JlvF87BNVrtFWf6AE6rxSE9X7OwnVvoyCQjpzSRJ+M87se/4QCkCiebQAqrJ0y6fwIyi7nw==",
"license": "MIT",
"peer": true,
"dependencies": {
"undici-types": "~6.21.0"
}
@@ -2360,6 +2362,7 @@
"integrity": "sha512-B2MdzyWxCE2+SqiZHAjPphft+/2x2FlO9YBx7eKE1BCb+rqBlQdhtAEhzIEdozHd55DXPmxBdpMygFJjfjjA9A==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@typescript-eslint/scope-manager": "8.32.0",
"@typescript-eslint/types": "8.32.0",
@@ -2791,6 +2794,7 @@
"integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==",
"dev": true,
"license": "MIT",
"peer": true,
"bin": {
"acorn": "bin/acorn"
},
@@ -2851,6 +2855,7 @@
"integrity": "sha512-j3fVLgvTo527anyYyJOGTYJbG+vnnQYvE0m5mmkc1TK+nxAppkCLMIL0aZ4dblVCNoGShhm+kzE4ZUykBoMg4g==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"fast-deep-equal": "^3.1.1",
"fast-json-stable-stringify": "^2.0.0",
@@ -3064,7 +3069,6 @@
"integrity": "sha512-+25nxyyznAXF7Nef3y0EbBeqmGZgeN/BxHX29Rs39djAfaFalmQ89SE6CWyDCHzGL0yt/ycBtNOmGTW0FyGWNw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"archiver-utils": "^2.1.0",
"async": "^3.2.4",
@@ -3084,7 +3088,6 @@
"integrity": "sha512-bEL/yUb/fNNiNTuUz979Z0Yg5L+LzLxGJz8x79lYmR54fmTIb6ob/hNQgkQnIUDWIFjZVQwl9Xs356I6BAMHfw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"glob": "^7.1.4",
"graceful-fs": "^4.2.0",
@@ -3107,7 +3110,6 @@
"integrity": "sha512-8p0AUk4XODgIewSi0l8Epjs+EVnWiK7NoDIEGU0HhE7+ZyY8D1IMY7odu5lRrFXGg71L15KG8QrPmum45RTtdA==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"core-util-is": "~1.0.0",
"inherits": "~2.0.3",
@@ -3123,8 +3125,7 @@
"resolved": "https://registry.npmmirror.com/safe-buffer/-/safe-buffer-5.1.2.tgz",
"integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/archiver-utils/node_modules/string_decoder": {
"version": "1.1.1",
@@ -3132,7 +3133,6 @@
"integrity": "sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"safe-buffer": "~5.1.0"
}
@@ -3351,6 +3351,7 @@
}
],
"license": "MIT",
"peer": true,
"dependencies": {
"caniuse-lite": "^1.0.30001716",
"electron-to-chromium": "^1.5.149",
@@ -3848,7 +3849,6 @@
"integrity": "sha512-D3uMHtGc/fcO1Gt1/L7i1e33VOvD4A9hfQLP+6ewd+BvG/gQ84Yh4oftEhAdjSMgBgwGL+jsppT7JYNpo6MHHg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"buffer-crc32": "^0.2.13",
"crc32-stream": "^4.0.2",
@@ -3994,7 +3994,6 @@
"integrity": "sha512-ROmzCKrTnOwybPcJApAA6WBWij23HVfGVNKqqrZpuyZOHqK2CwHSvpGuyt/UNNvaIjEd8X5IFGp4Mh+Ie1IHJQ==",
"dev": true,
"license": "Apache-2.0",
"peer": true,
"bin": {
"crc32": "bin/crc32.njs"
},
@@ -4008,7 +4007,6 @@
"integrity": "sha512-NT7w2JVU7DFroFdYkeq8cywxrgjPHWkdX1wjpRQXPX5Asews3tA+Ght6lddQO5Mkumffp3X7GEqku3epj2toIw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"crc-32": "^1.2.0",
"readable-stream": "^3.4.0"
@@ -4248,6 +4246,7 @@
"integrity": "sha512-NoXo6Liy2heSklTI5OIZbCgXC1RzrDQsZkeEwXhdOro3FT1VBOvbubvscdPnjVuQ4AMwwv61oaH96AbiYg9EnQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"app-builder-lib": "25.1.8",
"builder-util": "25.1.7",
@@ -4410,6 +4409,7 @@
"integrity": "sha512-6dLslJrQYB1qvqVPYRv1PhAA/uytC66nUeiTcq2JXiBzrmTWCHppqtGUjZhvnSRVatBCT5/SFdizdzcBiEiYUg==",
"hasInstallScript": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@electron/get": "^2.0.0",
"@types/node": "^22.7.7",
@@ -4454,7 +4454,6 @@
"integrity": "sha512-2ntkJ+9+0GFP6nAISiMabKt6eqBB0kX1QqHNWFWAXgi0VULKGisM46luRFpIBiU3u/TDmhZMM8tzvo2Abn3ayg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"app-builder-lib": "25.1.8",
"archiver": "^5.3.1",
@@ -4468,7 +4467,6 @@
"integrity": "sha512-oRXApq54ETRj4eMiFzGnHWGy+zo5raudjuxN0b8H7s/RU2oW0Wvsx9O0ACRN/kRq9E8Vu/ReskGB5o3ji+FzHQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"graceful-fs": "^4.2.0",
"jsonfile": "^6.0.1",
@@ -4484,7 +4482,6 @@
"integrity": "sha512-5dgndWOriYSm5cnYaJNhalLNDKOqFwyDB/rr1E9ZsGciGvKPs8R2xYGCacuf3z6K1YKDz182fd+fY3cn3pMqXQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"universalify": "^2.0.0"
},
@@ -4498,7 +4495,6 @@
"integrity": "sha512-gptHNQghINnc/vTGIk0SOFGFNXw7JVrlRUtConJRlvaw6DuX0wO5Jeko9sWrMBhh+PsYAZ7oXAiOnf/UKogyiw==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">= 10.0.0"
}
@@ -4813,6 +4809,7 @@
"integrity": "sha512-LSehfdpgMeWcTZkWZVIJl+tkZ2nuSkyyB9C27MZqFWXuph7DvaowgcTvKqxvpLW1JZIk8PN7hFY3Rj9LQ7m7lg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@eslint-community/eslint-utils": "^4.2.0",
"@eslint-community/regexpp": "^4.12.1",
@@ -4874,6 +4871,7 @@
"integrity": "sha512-zc1UmCpNltmVY34vuLRV61r1K27sWuX39E+uyUnY8xS2Bex88VV9cugG+UZbRSRGtGyFboj+D8JODyme1plMpw==",
"dev": true,
"license": "MIT",
"peer": true,
"bin": {
"eslint-config-prettier": "bin/cli.js"
},
@@ -5351,8 +5349,7 @@
"resolved": "https://registry.npmmirror.com/fs-constants/-/fs-constants-1.0.0.tgz",
"integrity": "sha512-y6OAwoSIf7FyjMIv94u+b5rdheZEjzR63GTyZJm5qh4Bi+2YgwLCcI/fPFZkL5PSixOt6ZNKm+w+Hfp/Bciwow==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/fs-extra": {
"version": "8.1.0",
@@ -6108,8 +6105,7 @@
"resolved": "https://registry.npmmirror.com/isarray/-/isarray-1.0.0.tgz",
"integrity": "sha512-VLghIWNM6ELQzo7zwmcg0NmTVyWKYjvIeM83yjp0wRDTmUnrM678fQbcKBo6n2CJEF0szoG//ytg+TKla89ALQ==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/isbinaryfile": {
"version": "5.0.4",
@@ -6300,7 +6296,6 @@
"integrity": "sha512-b94GiNHQNy6JNTrt5w6zNyffMrNkXZb3KTkCZJb2V1xaEGCk093vkZ2jk3tpaeP33/OiXC+WvK9AxUebnf5nbw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"readable-stream": "^2.0.5"
},
@@ -6314,7 +6309,6 @@
"integrity": "sha512-8p0AUk4XODgIewSi0l8Epjs+EVnWiK7NoDIEGU0HhE7+ZyY8D1IMY7odu5lRrFXGg71L15KG8QrPmum45RTtdA==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"core-util-is": "~1.0.0",
"inherits": "~2.0.3",
@@ -6330,8 +6324,7 @@
"resolved": "https://registry.npmmirror.com/safe-buffer/-/safe-buffer-5.1.2.tgz",
"integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lazystream/node_modules/string_decoder": {
"version": "1.1.1",
@@ -6339,7 +6332,6 @@
"integrity": "sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"safe-buffer": "~5.1.0"
}
@@ -6391,32 +6383,28 @@
"resolved": "https://registry.npmmirror.com/lodash.defaults/-/lodash.defaults-4.2.0.tgz",
"integrity": "sha512-qjxPLHd3r5DnsdGacqOMU6pb/avJzdh9tFX2ymgoZE27BmjXrNy/y4LoaiTeAb+O3gL8AfpJGtqfX/ae2leYYQ==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.difference": {
"version": "4.5.0",
"resolved": "https://registry.npmmirror.com/lodash.difference/-/lodash.difference-4.5.0.tgz",
"integrity": "sha512-dS2j+W26TQ7taQBGN8Lbbq04ssV3emRw4NY58WErlTO29pIqS0HmoT5aJ9+TUQ1N3G+JOZSji4eugsWwGp9yPA==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.flatten": {
"version": "4.4.0",
"resolved": "https://registry.npmmirror.com/lodash.flatten/-/lodash.flatten-4.4.0.tgz",
"integrity": "sha512-C5N2Z3DgnnKr0LOpv/hKCgKdb7ZZwafIrsesve6lmzvZIRZRGaZ/l6Q8+2W7NaT+ZwO3fFlSCzCzrDCFdJfZ4g==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.isplainobject": {
"version": "4.0.6",
"resolved": "https://registry.npmmirror.com/lodash.isplainobject/-/lodash.isplainobject-4.0.6.tgz",
"integrity": "sha512-oSXzaWypCMHkPC3NvBEaPHf0KsA5mvPrOPgQWDsbg8n7orZ290M0BmC/jgRZ4vcJ6DTAhjrsSYgdsW/F+MFOBA==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/lodash.merge": {
"version": "4.6.2",
@@ -6430,8 +6418,7 @@
"resolved": "https://registry.npmmirror.com/lodash.union/-/lodash.union-4.6.0.tgz",
"integrity": "sha512-c4pB2CdGrGdjMKYLA+XiRDO7Y0PRQbm/Gzg8qMj+QH+pFVAoTp5sBpO0odL3FjoPCGjK96p6qsP+yQoiLoOBcw==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/log-symbols": {
"version": "4.1.0",
@@ -6984,7 +6971,6 @@
"integrity": "sha512-6eZs5Ls3WtCisHWp9S2GUy8dqkpGi4BVSz3GaqiE6ezub0512ESztXUwUB6C6IKbQkY2Pnb/mD4WYojCRwcwLA==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">=0.10.0"
}
@@ -7408,6 +7394,7 @@
"integrity": "sha512-QQtaxnoDJeAkDvDKWCLiwIXkTgRhwYDEQCghU9Z6q03iyek/rxRh/2lC3HB7P8sWT2xC/y5JDctPLBIGzHKbhw==",
"dev": true,
"license": "MIT",
"peer": true,
"bin": {
"prettier": "bin/prettier.cjs"
},
@@ -7436,8 +7423,7 @@
"resolved": "https://registry.npmmirror.com/process-nextick-args/-/process-nextick-args-2.0.1.tgz",
"integrity": "sha512-3ouUOpQhtgrbOa17J7+uxOTpITYWaGP7/AhoR3+A+/1e9skrzelGi/dXzEYyvbxubEF6Wn2ypscTKiKJFFn1ag==",
"dev": true,
"license": "MIT",
"peer": true
"license": "MIT"
},
"node_modules/progress": {
"version": "2.0.3",
@@ -7556,7 +7542,6 @@
"integrity": "sha512-v05I2k7xN8zXvPD9N+z/uhXPaj0sUFCe2rcWZIpBsqxfP7xXFQ0tipAd/wjj1YxWyWtUS5IDJpOG82JKt2EAVA==",
"dev": true,
"license": "Apache-2.0",
"peer": true,
"dependencies": {
"minimatch": "^5.1.0"
}
@@ -7567,7 +7552,6 @@
"integrity": "sha512-lKwV/1brpG6mBUFHtb7NUmtABCb2WZZmm2wNiOA5hAb8VdCS4B3dtMWyvcoViccwAW/COERjXLt0zP1zXUN26g==",
"dev": true,
"license": "ISC",
"peer": true,
"dependencies": {
"brace-expansion": "^2.0.1"
},
@@ -8235,7 +8219,6 @@
"integrity": "sha512-ujeqbceABgwMZxEJnk2HDY2DlnUZ+9oEcb1KzTVfYHio0UE6dG71n60d8D2I4qNvleWrrXpmjpt7vZeF1LnMZQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"bl": "^4.0.3",
"end-of-stream": "^1.4.1",
@@ -8360,6 +8343,7 @@
"integrity": "sha512-M7BAV6Rlcy5u+m6oPhAPFgJTzAioX/6B0DxyvDlo9l8+T3nLKbrczg2WLUyzd45L8RqfUMyGPzekbMvX2Ldkwg==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">=12"
},
@@ -8462,6 +8446,7 @@
"integrity": "sha512-p1diW6TqL9L07nNxvRMM7hMMw4c5XOo/1ibL4aAIGmSAt9slTE1Xgw5KWuof2uTOvCg9BY7ZRi+GaF+7sfgPeQ==",
"devOptional": true,
"license": "Apache-2.0",
"peer": true,
"bin": {
"tsc": "bin/tsc",
"tsserver": "bin/tsserver"
@@ -8611,6 +8596,7 @@
"integrity": "sha512-cZn6NDFE7wdTpINgs++ZJ4N49W2vRp8LCKrn3Ob1kYNtOo21vfDoaV5GzBfLU4MovSAB8uNRm4jgzVQZ+mBzPQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"esbuild": "^0.25.0",
"fdir": "^6.4.4",
@@ -8701,6 +8687,7 @@
"integrity": "sha512-M7BAV6Rlcy5u+m6oPhAPFgJTzAioX/6B0DxyvDlo9l8+T3nLKbrczg2WLUyzd45L8RqfUMyGPzekbMvX2Ldkwg==",
"dev": true,
"license": "MIT",
"peer": true,
"engines": {
"node": ">=12"
},
@@ -8720,6 +8707,7 @@
"resolved": "https://registry.npmmirror.com/vue/-/vue-3.5.13.tgz",
"integrity": "sha512-wmeiSMxkZCSc+PM2w2VRsOYAZC8GdipNFRTsLSfodVqI9mbejKeXEGr8SckuLnrQPGe3oJN5c3K0vpoU9q/wCQ==",
"license": "MIT",
"peer": true,
"dependencies": {
"@vue/compiler-dom": "3.5.13",
"@vue/compiler-sfc": "3.5.13",
@@ -8742,6 +8730,7 @@
"integrity": "sha512-dbCBnd2e02dYWsXoqX5yKUZlOt+ExIpq7hmHKPb5ZqKcjf++Eo0hMseFTZMLKThrUk61m+Uv6A2YSBve6ZvuDQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"debug": "^4.4.0",
"eslint-scope": "^8.2.0",
@@ -9046,7 +9035,6 @@
"integrity": "sha512-9qv4rlDiopXg4E69k+vMHjNN63YFMe9sZMrdlvKnCjlCRWeCBswPPMPUfx+ipsAWq1LXHe70RcbaHdJJpS6hyQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"archiver-utils": "^3.0.4",
"compress-commons": "^4.1.2",
@@ -9062,7 +9050,6 @@
"integrity": "sha512-KVgf4XQVrTjhyWmx6cte4RxonPLR9onExufI1jhvw/MQ4BB6IsZD5gT8Lq+u/+pRkWna/6JoHpiQioaqFP5Rzw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"glob": "^7.2.3",
"graceful-fs": "^4.2.0",

View File

@@ -1,7 +1,7 @@
{
"name": "auto-caption",
"productName": "Auto Caption",
"version": "1.0.0",
"version": "1.1.1",
"description": "A cross-platform subtitle display software.",
"main": "./out/main/index.js",
"author": "himeditator",

View File

@@ -77,10 +77,9 @@ class CaptionWindow {
}
})
ipcMain.on('caption.pin.set', (_, pinned) => {
ipcMain.on('caption.mouseEvents.ignore', (_, ignore: boolean) => {
if(this.window){
if(pinned) this.window.setAlwaysOnTop(true, 'screen-saver')
else this.window.setAlwaysOnTop(false)
this.window.setIgnoreMouseEvents(ignore, { forward: ignore })
}
})
}

View File

@@ -8,6 +8,8 @@ export interface Controls {
targetLang: string,
transModel: string,
ollamaName: string,
ollamaUrl: string,
ollamaApiKey: string,
engine: string,
audio: 0 | 1,
translation: boolean,
@@ -15,6 +17,9 @@ export interface Controls {
API_KEY: string,
voskModelPath: string,
sosvModelPath: string,
glmUrl: string,
glmModel: string,
glmApiKey: string,
recordingPath: string,
customized: boolean,
customizedApp: string,

View File

@@ -4,9 +4,10 @@ import {
} from '../types'
import { Log } from './Log'
import { app, BrowserWindow } from 'electron'
import { passwordMaskingForObject } from './UtilsFunc'
import * as path from 'path'
import * as fs from 'fs'
import os from 'os'
import * as os from 'os'
interface CaptionTranslation {
time_s: string,
@@ -44,13 +45,18 @@ const defaultControls: Controls = {
sourceLang: 'en',
targetLang: 'zh',
transModel: 'ollama',
ollamaName: '',
ollamaName: 'qwen2.5:0.5b',
ollamaUrl: 'http://localhost:11434',
ollamaApiKey: '',
engine: 'gummy',
audio: 0,
engineEnabled: false,
API_KEY: '',
voskModelPath: '',
sosvModelPath: '',
glmUrl: 'https://open.bigmodel.cn/api/paas/v4/audio/transcriptions',
glmModel: 'glm-asr-2512',
glmApiKey: '',
recordingPath: getDesktopPath(),
translation: true,
recording: false,
@@ -146,9 +152,7 @@ class AllConfig {
}
}
this.controls.engineEnabled = engineEnabled
let _controls = {...this.controls}
_controls.API_KEY = _controls.API_KEY.replace(/./g, '*')
Log.info('Set Controls:', _controls)
Log.info('Set Controls:', passwordMaskingForObject(this.controls))
}
public sendControls(window: BrowserWindow, info = true) {

View File

@@ -1,12 +1,13 @@
import { exec, spawn } from 'child_process'
import { app } from 'electron'
import { is } from '@electron-toolkit/utils'
import path from 'path'
import net from 'net'
import * as path from 'path'
import * as net from 'net'
import { controlWindow } from '../ControlWindow'
import { allConfig } from './AllConfig'
import { i18n } from '../i18n'
import { Log } from './Log'
import { passwordMaskingForList } from './UtilsFunc'
export class CaptionEngine {
appPath: string = ''
@@ -60,7 +61,7 @@ export class CaptionEngine {
this.appPath = path.join(process.resourcesPath, 'engine', 'main.exe')
}
else {
this.appPath = path.join(process.resourcesPath, 'engine', 'main')
this.appPath = path.join(process.resourcesPath, 'engine', 'main', 'main')
}
}
this.command.push('-a', allConfig.controls.audio ? '1' : '0')
@@ -87,6 +88,8 @@ export class CaptionEngine {
this.command.push('-vosk', `"${allConfig.controls.voskModelPath}"`)
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
if(allConfig.controls.ollamaUrl) this.command.push('-ourl', allConfig.controls.ollamaUrl)
if(allConfig.controls.ollamaApiKey) this.command.push('-okey', allConfig.controls.ollamaApiKey)
}
else if(allConfig.controls.engine === 'sosv'){
this.command.push('-e', 'sosv')
@@ -94,15 +97,25 @@ export class CaptionEngine {
this.command.push('-sosv', `"${allConfig.controls.sosvModelPath}"`)
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
if(allConfig.controls.ollamaUrl) this.command.push('-ourl', allConfig.controls.ollamaUrl)
if(allConfig.controls.ollamaApiKey) this.command.push('-okey', allConfig.controls.ollamaApiKey)
}
else if(allConfig.controls.engine === 'glm'){
this.command.push('-e', 'glm')
this.command.push('-s', allConfig.controls.sourceLang)
this.command.push('-gurl', allConfig.controls.glmUrl)
this.command.push('-gmodel', allConfig.controls.glmModel)
if(allConfig.controls.glmApiKey) {
this.command.push('-gkey', allConfig.controls.glmApiKey)
}
this.command.push('-tm', allConfig.controls.transModel)
this.command.push('-omn', allConfig.controls.ollamaName)
if(allConfig.controls.ollamaUrl) this.command.push('-ourl', allConfig.controls.ollamaUrl)
if(allConfig.controls.ollamaApiKey) this.command.push('-okey', allConfig.controls.ollamaApiKey)
}
}
Log.info('Engine Path:', this.appPath)
if(this.command.length > 2 && this.command.at(-2) === '-k') {
const _command = [...this.command]
_command[_command.length -1] = _command[_command.length -1].replace(/./g, '*')
Log.info('Engine Command:', _command)
}
else Log.info('Engine Command:', this.command)
Log.info('Engine Command:', passwordMaskingForList(this.command))
return true
}
@@ -165,7 +178,7 @@ export class CaptionEngine {
const data_obj = JSON.parse(line)
handleEngineData(data_obj)
} catch (e) {
controlWindow.sendErrorMessage(i18n('engine.output.parse.error') + e)
// controlWindow.sendErrorMessage(i18n('engine.output.parse.error') + e)
Log.error('Error parsing JSON:', e)
}
}

View File

@@ -0,0 +1,24 @@
function passwordMasking(pwd: string) {
return pwd.replace(/./g, '*')
}
export function passwordMaskingForList(args: string[]) {
const maskedArgs = [...args]
for(let i = 1; i < maskedArgs.length; i++) {
if(maskedArgs[i-1] === '-k' || maskedArgs[i-1] === '-okey' || maskedArgs[i-1] === '-gkey') {
maskedArgs[i] = passwordMasking(maskedArgs[i])
}
}
return maskedArgs
}
export function passwordMaskingForObject(args: Record<string, any>) {
const maskedArgs = {...args}
for(const key in maskedArgs) {
const lKey = key.toLowerCase()
if(lKey.includes('api') && lKey.includes('key')) {
maskedArgs[key] = passwordMasking(maskedArgs[key])
}
}
return maskedArgs
}

View File

@@ -2,7 +2,7 @@
<html>
<head>
<meta charset="UTF-8" />
<title>Auto Caption v1.0.0</title>
<title>Auto Caption v1.1.1</title>
<!-- https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP -->
<meta
http-equiv="Content-Security-Policy"

View File

@@ -41,17 +41,63 @@
<div class="input-item" v-if="transModel && currentTransModel === 'ollama'">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.ollamaNote') }}</p>
<p class="label-hover-info">{{ $t('engine.modelNameNote') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.ollama') }}</span>
>{{ $t('engine.modelName') }}</span>
</a-popover>
<a-input
class="input-area"
v-model:value="currentOllamaName"
></a-input>
</div>
<div class="input-item" v-if="transModel && currentTransModel === 'ollama'">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.baseURL') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>Base URL</span>
</a-popover>
<a-input
class="input-area"
v-model:value="currentOllamaUrl"
placeholder="http://localhost:11434"
></a-input>
</div>
<div class="input-item" v-if="transModel && currentTransModel === 'ollama'">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.apiKey') }}</p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>API Key</span>
</a-popover>
<a-input
class="input-area"
type="password"
v-model:value="currentOllamaApiKey"
/>
</div>
<div class="input-item" v-if="currentEngine === 'glm'">
<span class="input-label">GLM API URL</span>
<a-input
class="input-area"
v-model:value="currentGlmUrl"
placeholder="https://open.bigmodel.cn/api/paas/v4/audio/transcriptions"
></a-input>
</div>
<div class="input-item" v-if="currentEngine === 'glm'">
<span class="input-label">GLM Model Name</span>
<a-input
class="input-area"
v-model:value="currentGlmModel"
placeholder="glm-asr-2512"
></a-input>
</div>
<div class="input-item">
<span class="input-label">{{ $t('engine.audioType') }}</span>
<a-select
@@ -115,7 +161,7 @@
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>{{ $t('engine.apikey') }}</span>
>ALI {{ $t('engine.apikey') }}</span>
</a-popover>
<a-input
class="input-area"
@@ -123,6 +169,24 @@
v-model:value="currentAPI_KEY"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
<p class="label-hover-info">{{ $t('engine.glmApikeyInfo') }}</p>
<p><a href="https://open.bigmodel.cn/" target="_blank">
https://open.bigmodel.cn
</a></p>
</template>
<span class="input-label info-label"
:style="{color: uiColor}"
>GLM {{ $t('engine.apikey') }}</span>
</a-popover>
<a-input
class="input-area"
type="password"
v-model:value="currentGlmApiKey"
/>
</div>
<div class="input-item">
<a-popover placement="right">
<template #content>
@@ -239,9 +303,14 @@ const currentTranslation = ref<boolean>(true)
const currentRecording = ref<boolean>(false)
const currentTransModel = ref('ollama')
const currentOllamaName = ref('')
const currentOllamaUrl = ref('')
const currentOllamaApiKey = ref('')
const currentAPI_KEY = ref<string>('')
const currentVoskModelPath = ref<string>('')
const currentSosvModelPath = ref<string>('')
const currentGlmUrl = ref<string>('')
const currentGlmModel = ref<string>('')
const currentGlmApiKey = ref<string>('')
const currentRecordingPath = ref<string>('')
const currentCustomized = ref<boolean>(false)
const currentCustomizedApp = ref('')
@@ -294,12 +363,17 @@ function applyChange(){
engineControl.transModel = currentTransModel.value
engineControl.ollamaName = currentOllamaName.value
engineControl.engine = currentEngine.value
engineControl.ollamaUrl = currentOllamaUrl.value ?? "http://localhost:11434"
engineControl.ollamaApiKey = currentOllamaApiKey.value
engineControl.audio = currentAudio.value
engineControl.translation = currentTranslation.value
engineControl.recording = currentRecording.value
engineControl.API_KEY = currentAPI_KEY.value
engineControl.voskModelPath = currentVoskModelPath.value
engineControl.sosvModelPath = currentSosvModelPath.value
engineControl.glmUrl = currentGlmUrl.value ?? "https://open.bigmodel.cn/api/paas/v4/audio/transcriptions"
engineControl.glmModel = currentGlmModel.value ?? "glm-asr-2512"
engineControl.glmApiKey = currentGlmApiKey.value
engineControl.recordingPath = currentRecordingPath.value
engineControl.customized = currentCustomized.value
engineControl.customizedApp = currentCustomizedApp.value
@@ -320,6 +394,8 @@ function cancelChange(){
currentTargetLang.value = engineControl.targetLang
currentTransModel.value = engineControl.transModel
currentOllamaName.value = engineControl.ollamaName
currentOllamaUrl.value = engineControl.ollamaUrl
currentOllamaApiKey.value = engineControl.ollamaApiKey
currentEngine.value = engineControl.engine
currentAudio.value = engineControl.audio
currentTranslation.value = engineControl.translation
@@ -327,6 +403,9 @@ function cancelChange(){
currentAPI_KEY.value = engineControl.API_KEY
currentVoskModelPath.value = engineControl.voskModelPath
currentSosvModelPath.value = engineControl.sosvModelPath
currentGlmUrl.value = engineControl.glmUrl
currentGlmModel.value = engineControl.glmModel
currentGlmApiKey.value = engineControl.glmApiKey
currentRecordingPath.value = engineControl.recordingPath
currentCustomized.value = engineControl.customized
currentCustomizedApp.value = engineControl.customizedApp

View File

@@ -101,7 +101,7 @@
<p class="about-desc">{{ $t('status.about.desc') }}</p>
<a-divider />
<div class="about-info">
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v1.0.0</a-tag></p>
<p><b>{{ $t('status.about.version') }}</b><a-tag color="green">v1.1.1</a-tag></p>
<p>
<b>{{ $t('status.about.author') }}</b>
<a

View File

@@ -34,7 +34,7 @@ export const engines = {
{ value: 'it', type: 1, label: '意大利语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 本地模型' },
{ value: 'ollama', label: 'Ollama 模型或 OpenAI 兼容模型' },
{ value: 'google', label: 'Google API 调用' },
]
},
@@ -55,7 +55,22 @@ export const engines = {
{ value: 'it', type: 1, label: '意大利语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 本地模型' },
{ value: 'ollama', label: 'Ollama 模型或 OpenAI 兼容模型' },
{ value: 'google', label: 'Google API 调用' },
]
},
{
value: 'glm',
label: '云端 / 智谱AI / GLM-ASR',
languages: [
{ value: 'auto', type: -1, label: '自动检测' },
{ value: 'en', type: 0, label: '英语' },
{ value: 'zh', type: 0, label: '中文' },
{ value: 'ja', type: 0, label: '日语' },
{ value: 'ko', type: 0, label: '韩语' },
],
transModel: [
{ value: 'ollama', label: 'Ollama 模型或 OpenAI 兼容模型' },
{ value: 'google', label: 'Google API 调用' },
]
}
@@ -94,7 +109,7 @@ export const engines = {
{ value: 'it', type: 1, label: 'Italian' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Local Model' },
{ value: 'ollama', label: 'Ollama Model or OpenAI-compatible Model' },
{ value: 'google', label: 'Google API Call' },
]
},
@@ -115,7 +130,22 @@ export const engines = {
{ value: 'it', type: 1, label: 'Italian' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Local Model' },
{ value: 'ollama', label: 'Ollama Model or OpenAI-compatible Model' },
{ value: 'google', label: 'Google API Call' },
]
},
{
value: 'glm',
label: 'Cloud / Zhipu AI / GLM-ASR',
languages: [
{ value: 'auto', type: -1, label: 'Auto Detect' },
{ value: 'en', type: 0, label: 'English' },
{ value: 'zh', type: 0, label: 'Chinese' },
{ value: 'ja', type: 0, label: 'Japanese' },
{ value: 'ko', type: 0, label: 'Korean' },
],
transModel: [
{ value: 'ollama', label: 'Ollama Model or OpenAI-compatible Model' },
{ value: 'google', label: 'Google API Call' },
]
}
@@ -154,7 +184,7 @@ export const engines = {
{ value: 'it', type: 1, label: 'イタリア語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama ローカルモデル' },
{ value: 'ollama', label: 'Ollama モデルまたは OpenAI 互換モデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
},
@@ -175,7 +205,22 @@ export const engines = {
{ value: 'it', type: 1, label: 'イタリア語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama ローカルモデル' },
{ value: 'ollama', label: 'Ollama モデルまたは OpenAI 互換モデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
},
{
value: 'glm',
label: 'クラウド / 智譜AI / GLM-ASR',
languages: [
{ value: 'auto', type: -1, label: '自動検出' },
{ value: 'en', type: 0, label: '英語' },
{ value: 'zh', type: 0, label: '中国語' },
{ value: 'ja', type: 0, label: '日本語' },
{ value: 'ko', type: 0, label: '韓国語' },
],
transModel: [
{ value: 'ollama', label: 'Ollama モデルまたは OpenAI 互換モデル' },
{ value: 'google', label: 'Google API 呼び出し' },
]
}

View File

@@ -22,7 +22,7 @@ export default {
"stopped": "Caption Engine Stopped",
"stoppedInfo": "The caption engine has stopped. You can click the 'Start Caption Engine' button to restart it.",
"error": "An error occurred",
"engineError": "The subtitle engine encountered an error and requested a forced exit.",
"engineError": "The caption engine encountered an error and requested a forced exit.",
"socketError": "The Socket connection between the main program and the caption engine failed",
"engineChange": "Cpation Engine Configuration Changed",
"changeInfo": "If the caption engine is already running, you need to restart it for the changes to take effect.",
@@ -50,8 +50,10 @@ export default {
"sourceLang": "Source",
"transLang": "Translation",
"transModel": "Model",
"ollama": "Ollama",
"ollamaNote": "To use for translation, the name of the local Ollama model that will call the service on the default port. It is recommended to use a non-inference model with less than 1B parameters.",
"modelName": "Model Name",
"modelNameNote": "Please enter the translation model name you wish to use, which can be either a local Ollama model or an OpenAI API compatible cloud model. If the Base URL field is left blank, the local Ollama service will be called by default; otherwise, the API service at the specified address will be called via the Python OpenAI library.",
"baseURL": "The base request URL for calling OpenAI API. If left empty, the local default port Ollama model will be used.",
"apiKey": "The API KEY required for the model corresponding to OpenAI API.",
"captionEngine": "Engine",
"audioType": "Audio Type",
"systemOutput": "System Audio Output (Speaker)",
@@ -65,9 +67,10 @@ export default {
"recordingPath": "Save Path",
"startTimeout": "Timeout",
"seconds": "seconds",
"apikeyInfo": "API KEY required for the Gummy subtitle engine, which needs to be obtained from the Alibaba Cloud Bailing platform. For more details, see the project user manual.",
"voskModelPathInfo": "The folder path of the model required by the Vosk subtitle engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"sosvModelPathInfo": "The folder path of the model required by the SOSV subtitle engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"apikeyInfo": "API KEY required for the Gummy caption engine, which needs to be obtained from the Alibaba Cloud Bailing platform. For more details, see the project user manual.",
"glmApikeyInfo": "API KEY required for GLM caption engine, which needs to be obtained from the Zhipu AI platform.",
"voskModelPathInfo": "The folder path of the model required by the Vosk caption engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"sosvModelPathInfo": "The folder path of the model required by the SOSV caption engine. You need to download the required model to your local machine in advance. For more details, see the project user manual.",
"recordingPathInfo": "The path to save recording files, requiring a folder path. The software will automatically name the recording file and save it as .wav file.",
"modelDownload": "Model Download Link",
"startTimeoutInfo": "Caption engine startup timeout duration. Engine will be forcefully stopped if startup exceeds this time. Recommended range: 10-120 seconds.",
@@ -143,7 +146,7 @@ export default {
"projLink": "Project Link",
"manual": "User Manual",
"engineDoc": "Caption Engine Manual",
"date": "September 8th, 2025"
"date": "January 31, 2026"
}
},
log: {

View File

@@ -50,8 +50,10 @@ export default {
"sourceLang": "ソース言語",
"transLang": "翻訳言語",
"transModel": "翻訳モデル",
"ollama": "Ollama",
"ollamaNote": "翻訳に使用する、デフォルトポートでサービスを呼び出すローカルOllamaモデルの名前。1B 未満のパラメータを持つ非推論モデルの使用を推奨します。",
"modelName": "モデル名",
"modelNameNote": "使用する翻訳モデル名を入力してください。Ollama のローカルモデルでも OpenAI API 互換のクラウドモデルでも可能です。Base URL フィールドが未入力の場合、デフォルトでローカルOllama サービスが呼び出され、それ以外の場合は Python OpenAI ライブラリ経由で指定されたアドレスの API サービスが呼び出されます。",
"baseURL": "OpenAI API を呼び出すための基本リクエスト URL です。未記入の場合、ローカルのデフォルトポートの Ollama モデルが呼び出されます。",
"apiKey": "OpenAI API に対応するモデルを使用するために必要な API キーです。",
"captionEngine": "エンジン",
"audioType": "オーディオ",
"systemOutput": "システムオーディオ出力(スピーカー)",
@@ -66,6 +68,7 @@ export default {
"startTimeout": "時間制限",
"seconds": "秒",
"apikeyInfo": "Gummy 字幕エンジンに必要な API KEY は、アリババクラウド百煉プラットフォームから取得する必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"glmApikeyInfo": "GLM 字幕エンジンに必要な API KEY で、智譜 AI プラットフォームから取得する必要があります。",
"voskModelPathInfo": "Vosk 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"sosvModelPathInfo": "SOSV 字幕エンジンに必要なモデルのフォルダパスです。必要なモデルを事前にローカルマシンにダウンロードする必要があります。詳細情報はプロジェクトのユーザーマニュアルをご覧ください。",
"recordingPathInfo": "録音ファイルの保存パスで、フォルダパスを指定する必要があります。ソフトウェアが自動的に録音ファイルに名前を付けて .wav ファイルとして保存します。",
@@ -142,7 +145,7 @@ export default {
"projLink": "プロジェクトリンク",
"manual": "ユーザーマニュアル",
"engineDoc": "字幕エンジンマニュアル",
"date": "202598 日"
"date": "2026131 日"
}
},
log: {

View File

@@ -50,8 +50,10 @@ export default {
"sourceLang": "源语言",
"transLang": "翻译语言",
"transModel": "翻译模型",
"ollama": "Ollama",
"ollamaNote": "要使用的进行翻译的本地 Ollama 模型的名称,将调用默认端口的服务,建议使用参数量小于 1B 的非推理模型。",
"modelName": "模型名称",
"modelNameNote": "请输入要使用的翻译模型名称,可以是 Ollama 本地模型,也可以是 OpenAI API 兼容的云端模型。若未填写 Base URL 字段,则默认调用本地 Ollama 服务,否则会通过 Python OpenAI 库调用该地址指向的 API 服务。",
"baseURL": "调用 OpenAI API 的基础请求地址,如果不填写则调用本地默认端口的 Ollama 模型。",
"apiKey": "调用 OpenAI API 对应的模型需要使用的 API KEY。",
"captionEngine": "字幕引擎",
"audioType": "音频类型",
"systemOutput": "系统音频输出(扬声器)",
@@ -66,6 +68,7 @@ export default {
"startTimeout": "启动超时",
"seconds": "秒",
"apikeyInfo": "Gummy 字幕引擎需要的 API KEY需要在阿里云百炼平台获取。详细信息见项目用户手册。",
"glmApikeyInfo": "GLM 字幕引擎需要的 API KEY需要在智谱 AI 平台获取。",
"voskModelPathInfo": "Vosk 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"sosvModelPathInfo": "SOSV 字幕引擎需要的模型的文件夹路径,需要提前下载需要的模型到本地。信息详情见项目用户手册。",
"recordingPathInfo": "录音文件保存路径,需要提供文件夹路径。软件会自动命名录音文件并保存为 .wav 文件。",
@@ -142,7 +145,7 @@ export default {
"projLink": "项目链接",
"manual": "用户手册",
"engineDoc": "字幕引擎手册",
"date": "202598 日"
"date": "2026131 日"
}
},
log: {

View File

@@ -21,6 +21,8 @@ export const useEngineControlStore = defineStore('engineControl', () => {
const targetLang = ref<string>('zh')
const transModel = ref<string>('ollama')
const ollamaName = ref<string>('')
const ollamaUrl = ref<string>('')
const ollamaApiKey = ref<string>('')
const engine = ref<string>('gummy')
const audio = ref<0 | 1>(0)
const translation = ref<boolean>(true)
@@ -28,6 +30,9 @@ export const useEngineControlStore = defineStore('engineControl', () => {
const API_KEY = ref<string>('')
const voskModelPath = ref<string>('')
const sosvModelPath = ref<string>('')
const glmUrl = ref<string>('https://open.bigmodel.cn/api/paas/v4/audio/transcriptions')
const glmModel = ref<string>('glm-asr-2512')
const glmApiKey = ref<string>('')
const recordingPath = ref<string>('')
const customized = ref<boolean>(false)
const customizedApp = ref<string>('')
@@ -44,6 +49,8 @@ export const useEngineControlStore = defineStore('engineControl', () => {
targetLang: targetLang.value,
transModel: transModel.value,
ollamaName: ollamaName.value,
ollamaUrl: ollamaUrl.value,
ollamaApiKey: ollamaApiKey.value,
engine: engine.value,
audio: audio.value,
translation: translation.value,
@@ -51,6 +58,9 @@ export const useEngineControlStore = defineStore('engineControl', () => {
API_KEY: API_KEY.value,
voskModelPath: voskModelPath.value,
sosvModelPath: sosvModelPath.value,
glmUrl: glmUrl.value,
glmModel: glmModel.value,
glmApiKey: glmApiKey.value,
recordingPath: recordingPath.value,
customized: customized.value,
customizedApp: customizedApp.value,
@@ -80,6 +90,8 @@ export const useEngineControlStore = defineStore('engineControl', () => {
targetLang.value = controls.targetLang
transModel.value = controls.transModel
ollamaName.value = controls.ollamaName
ollamaUrl.value = controls.ollamaUrl
ollamaApiKey.value = controls.ollamaApiKey
engine.value = controls.engine
audio.value = controls.audio
engineEnabled.value = controls.engineEnabled
@@ -88,6 +100,9 @@ export const useEngineControlStore = defineStore('engineControl', () => {
API_KEY.value = controls.API_KEY
voskModelPath.value = controls.voskModelPath
sosvModelPath.value = controls.sosvModelPath
glmUrl.value = controls.glmUrl || 'https://open.bigmodel.cn/api/paas/v4/audio/transcriptions'
glmModel.value = controls.glmModel || 'glm-asr-2512'
glmApiKey.value = controls.glmApiKey
recordingPath.value = controls.recordingPath
customized.value = controls.customized
customizedApp.value = controls.customizedApp
@@ -150,6 +165,8 @@ export const useEngineControlStore = defineStore('engineControl', () => {
targetLang, // 目标语言
transModel, // 翻译模型
ollamaName, // Ollama 模型
ollamaUrl,
ollamaApiKey,
engine, // 字幕引擎
audio, // 选择音频
translation, // 是否启用翻译
@@ -157,6 +174,9 @@ export const useEngineControlStore = defineStore('engineControl', () => {
API_KEY, // API KEY
voskModelPath, // vosk 模型路径
sosvModelPath, // sosv 模型路径
glmUrl, // GLM API URL
glmModel, // GLM 模型名称
glmApiKey, // GLM API Key
recordingPath, // 录音保存路径
customized, // 是否使用自定义字幕引擎
customizedApp, // 自定义字幕引擎的应用程序

View File

@@ -8,6 +8,8 @@ export interface Controls {
targetLang: string,
transModel: string,
ollamaName: string,
ollamaUrl: string,
ollamaApiKey: string,
engine: string,
audio: 0 | 1,
translation: boolean,
@@ -15,6 +17,9 @@ export interface Controls {
API_KEY: string,
voskModelPath: string,
sosvModelPath: string,
glmUrl: string,
glmModel: string,
glmApiKey: string,
recordingPath: string,
customized: boolean,
customizedApp: string,

View File

@@ -61,7 +61,11 @@
</template>
</div>
<div class="title-bar" :style="{color: captionStyle.fontColor}">
<div class="title-bar"
:style="{color: captionStyle.fontColor}"
@mouseenter="onTitleBarEnter()"
@mouseleave="onTitleBarLeave()"
>
<div class="option-item" @click="closeCaptionWindow">
<CloseOutlined />
</div>
@@ -96,7 +100,7 @@ const captionLog = useCaptionLogStore();
const { captionData } = storeToRefs(captionLog);
const caption = ref();
const windowHeight = ref(100);
const pinned = ref(true);
const pinned = ref(false);
onMounted(() => {
const resizeObserver = new ResizeObserver(entries => {
@@ -114,7 +118,7 @@ onMounted(() => {
function pinCaptionWindow() {
pinned.value = !pinned.value;
window.electron.ipcRenderer.send('caption.pin.set', pinned.value)
window.electron.ipcRenderer.send('caption.mouseEvents.ignore', pinned.value)
}
function openControlWindow() {
@@ -124,6 +128,18 @@ function openControlWindow() {
function closeCaptionWindow() {
window.electron.ipcRenderer.send('caption.window.close')
}
function onTitleBarEnter() {
if(pinned.value) {
window.electron.ipcRenderer.send('caption.mouseEvents.ignore', false)
}
}
function onTitleBarLeave() {
if(pinned.value) {
window.electron.ipcRenderer.send('caption.mouseEvents.ignore', true)
}
}
</script>
<style scoped>