refactor(engine): 字幕引擎文件夹重命名,字幕记录添加降序选择

- 字幕记录表格可以按时间降序排列
- 将 caption-engine 重命名为 engine
- 更新了相关文件和文件夹的路径
- 修改了 README 和 TODO 文档中的相关内容
- 更新了 Electron 构建配置
This commit is contained in:
himeditator
2025-07-26 21:29:16 +08:00
parent 697488ce84
commit 8e575a9ba3
32 changed files with 82 additions and 789 deletions

8
.gitignore vendored
View File

@@ -5,8 +5,8 @@ out
.eslintcache
*.log*
__pycache__
subenv
caption-engine/build
caption-engine/models
output.wav
.venv
subenv
engine/build
engine/models
engine/test

View File

@@ -9,6 +9,6 @@
"editor.defaultFormatter": "esbenp.prettier-vscode"
},
"python.analysis.extraPaths": [
"./caption-engine"
"./engine"
]
}

View File

@@ -122,10 +122,10 @@ npm install
### 构建字幕引擎
首先进入 `caption-engine` 文件夹,执行如下指令创建虚拟环境:
首先进入 `engine` 文件夹,执行如下指令创建虚拟环境:
```bash
# in ./caption-engine folder
# in ./engine folder
python -m venv subenv
# or
python3 -m venv subenv
@@ -173,7 +173,7 @@ vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
```
此时项目构建完成,在进入 `caption-engine/dist` 文件夹可见对应的可执行文件。即可进行后续操作。
此时项目构建完成,在进入 `engine/dist` 文件夹可见对应的可执行文件。即可进行后续操作。
### 运行项目
@@ -197,13 +197,13 @@ npm run build:linux
```yml
extraResources:
# For Windows
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
- from: ./engine/dist/main-gummy.exe
to: ./engine/main-gummy.exe
- from: ./engine/dist/main-vosk.exe
to: ./engine/main-vosk.exe
# For macOS and Linux
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
# - from: ./engine/dist/main-gummy
# to: ./engine/main-gummy
# - from: ./engine/dist/main-vosk
# to: ./engine/main-vosk
```

View File

@@ -122,10 +122,10 @@ npm install
### Build Subtitle Engine
First enter the `caption-engine` folder and execute the following commands to create a virtual environment:
First enter the `engine` folder and execute the following commands to create a virtual environment:
```bash
# in ./caption-engine folder
# in ./engine folder
python -m venv subenv
# or
python3 -m venv subenv
@@ -173,7 +173,7 @@ vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
```
After the build completes, you can find the executable file in the `caption-engine/dist` folder. Then proceed with subsequent operations.
After the build completes, you can find the executable file in the `engine/dist` folder. Then proceed with subsequent operations.
### Run Project
@@ -197,13 +197,13 @@ Note: You need to modify the configuration content in the `electron-builder.yml`
```yml
extraResources:
# For Windows
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
- from: ./engine/dist/main-gummy.exe
to: ./engine/main-gummy.exe
- from: ./engine/dist/main-vosk.exe
to: ./engine/main-vosk.exe
# For macOS and Linux
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
# - from: ./engine/dist/main-gummy
# to: ./engine/main-gummy
# - from: ./engine/dist/main-vosk
# to: ./engine/main-vosk
```

View File

@@ -122,10 +122,10 @@ npm install
### 字幕エンジンの構築
まず `caption-engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成します:
まず `engine` フォルダに入り、以下のコマンドを実行して仮想環境を作成します:
```bash
# ./caption-engine フォルダ内
# ./engine フォルダ内
python -m venv subenv
# または
python3 -m venv subenv
@@ -173,7 +173,7 @@ vosk_path = str(Path('./subenv/Lib/site-packages/vosk').resolve())
vosk_path = str(Path('./subenv/lib/python3.x/site-packages/vosk').resolve())
```
これでプロジェクトのビルドが完了し、`caption-engine/dist` フォルダ内に対応する実行可能ファイルが確認できます。その後、次の操作に進むことができます。
これでプロジェクトのビルドが完了し、`engine/dist` フォルダ内に対応する実行可能ファイルが確認できます。その後、次の操作に進むことができます。
### プロジェクト実行
@@ -197,13 +197,13 @@ npm run build:linux
```yml
extraResources:
# Windows用
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
- from: ./engine/dist/main-gummy.exe
to: ./engine/main-gummy.exe
- from: ./engine/dist/main-vosk.exe
to: ./engine/main-vosk.exe
# macOSとLinux用
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
# - from: ./engine/dist/main-gummy
# to: ./engine/main-gummy
# - from: ./engine/dist/main-vosk
# to: ./engine/main-vosk
```

View File

@@ -18,6 +18,8 @@
## 待完成
- [ ] 修改字幕记录展示逻辑
- [ ] 重构字幕引擎
- [ ] 验证 / 添加基于 sherpa-onnx 的字幕引擎
## 后续计划

View File

@@ -20,7 +20,7 @@ Generally, the captured audio stream data consists of short audio chunks, and th
The acquired audio stream may need preprocessing before being converted to text. For instance, Alibaba Cloud's Gummy model can only recognize single-channel audio streams, while the collected audio streams are typically dual-channel, thus requiring conversion from dual-channel to single-channel. Channel conversion can be achieved using methods in the NumPy library.
You can directly use the audio acquisition (`caption-engine/sysaudio`) and audio processing (`caption-engine/audioprcs`) modules I have developed.
You can directly use the audio acquisition (`engine/sysaudio`) and audio processing (`engine/audioprcs`) modules I have developed.
### Audio to Text Conversion
@@ -105,10 +105,10 @@ export interface CaptionItem {
If using Python, you can refer to the following method to pass data to the main program:
```python
# caption-engine\main-gummy.py
# engine\main-gummy.py
sys.stdout.reconfigure(line_buffering=True)
# caption-engine\audio2text\gummy.py
# engine\audio2text\gummy.py
...
def send_to_node(self, data):
"""
@@ -198,4 +198,4 @@ With a working caption engine, specify its path and runtime parameters in the ca
## Reference Code
The `main-gummy.py` file under the `caption-engine` folder in this project serves as the entry point for the default caption engine. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing data from the caption engine. You can read and understand the implementation details and the complete execution process of the caption engine as needed.
The `main-gummy.py` file under the `engine` folder in this project serves as the entry point for the default caption engine. The `src\main\utils\engine.ts` file contains the server-side code for acquiring and processing data from the caption engine. You can read and understand the implementation details and the complete execution process of the caption engine as needed.

View File

@@ -22,7 +22,7 @@
取得した音声ストリームは、テキストに変換する前に前処理が必要な場合があります。例えば、アリババクラウドのGummyモデルは単一チャンネルの音声ストリームしか認識できませんが、収集された音声ストリームは通常二重チャンネルであるため、二重チャンネルの音声ストリームを単一チャンネルに変換する必要があります。チャンネル数の変換はNumPyライブラリのメソッドを使って行うことができます。
あなたは私によって開発された音声の取得(`caption-engine/sysaudio`)と音声の処理(`caption-engine/audioprcs`)モジュールを直接使用することができます。
あなたは私によって開発された音声の取得(`engine/sysaudio`)と音声の処理(`engine/audioprcs`)モジュールを直接使用することができます。
### 音声からテキストへの変換
@@ -107,10 +107,10 @@ export interface CaptionItem {
Python言語を使用する場合、以下の方法でデータをメインプログラムに渡すことができます
```python
# caption-engine\main-gummy.py
# engine\main-gummy.py
sys.stdout.reconfigure(line_buffering=True)
# caption-engine\audio2text\gummy.py
# engine\audio2text\gummy.py
...
def send_to_node(self, data):
"""
@@ -198,4 +198,4 @@ python main-gummy.py -s ja -t zh -a 0 -c 10 -k <your-api-key>
## 参考コード
本プロジェクトの`caption-engine`フォルダにある`main-gummy.py`ファイルはデフォルトの字幕エンジンのエントリーコードです。`src\main\utils\engine.ts`はサーバー側で字幕エンジンのデータを取得・処理するコードです。必要に応じて字幕エンジンの実装詳細と完全な実行プロセスを理解するために参照してください。
本プロジェクトの`engine`フォルダにある`main-gummy.py`ファイルはデフォルトの字幕エンジンのエントリーコードです。`src\main\utils\engine.ts`はサーバー側で字幕エンジンのデータを取得・処理するコードです。必要に応じて字幕エンジンの実装詳細と完全な実行プロセスを理解するために参照してください。

View File

@@ -20,7 +20,7 @@
获取到的音频流在转文字之前可能需要进行预处理。比如阿里云的 Gummy 模型只能识别单通道的音频流,而收集的音频流一般是双通道的,因此要将双通道音频流转换为单通道。通道数的转换可以使用 NumPy 库中的方法实现。
你可以直接使用我开发好的音频获取(`caption-engine/sysaudio`)和音频处理(`caption-engine/audioprcs`)模块。
你可以直接使用我开发好的音频获取(`engine/sysaudio`)和音频处理(`engine/audioprcs`)模块。
### 音频转文字
@@ -105,10 +105,10 @@ export interface CaptionItem {
如果使用 python 语言,可以参考以下方式将数据传递给主程序:
```python
# caption-engine\main-gummy.py
# engine\main-gummy.py
sys.stdout.reconfigure(line_buffering=True)
# caption-engine\audio2text\gummy.py
# engine\audio2text\gummy.py
...
def send_to_node(self, data):
"""
@@ -198,4 +198,4 @@ python main-gummy.py -s ja -t zh -a 0 -c 10 -k <your-api-key>
## 参考代码
本项目 `caption-engine` 文件夹下的 `main-gummy.py` 文件为默认字幕引擎的入口代码。`src\main\utils\engine.ts` 为服务端获取字幕引擎数据和进行处理的代码。可以根据需要阅读了解字幕引擎的实现细节和完整运行过程。
本项目 `engine` 文件夹下的 `main-gummy.py` 文件为默认字幕引擎的入口代码。`src\main\utils\engine.ts` 为服务端获取字幕引擎数据和进行处理的代码。可以根据需要阅读了解字幕引擎的实现细节和完整运行过程。

View File

@@ -10,21 +10,21 @@ files:
- '!{LICENSE,README.md,README_en.md,README_ja.md}'
- '!{.env,.env.*,.npmrc,pnpm-lock.yaml}'
- '!{tsconfig.json,tsconfig.node.json,tsconfig.web.json}'
- '!caption-engine/*'
- '!engine/*'
- '!engine-test/*'
- '!docs/*'
- '!assets/*'
extraResources:
# For Windows
- from: ./caption-engine/dist/main-gummy.exe
to: ./caption-engine/main-gummy.exe
- from: ./caption-engine/dist/main-vosk.exe
to: ./caption-engine/main-vosk.exe
- from: ./engine/dist/main-gummy.exe
to: ./engine/main-gummy.exe
- from: ./engine/dist/main-vosk.exe
to: ./engine/main-vosk.exe
# For macOS and Linux
# - from: ./caption-engine/dist/main-gummy
# to: ./caption-engine/main-gummy
# - from: ./caption-engine/dist/main-vosk
# to: ./caption-engine/main-vosk
# - from: ./engine/dist/main-gummy
# to: ./engine/main-gummy
# - from: ./engine/dist/main-vosk
# to: ./engine/main-vosk
win:
executableName: auto-caption
icon: build/icon.png

View File

@@ -1,221 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dashscope.audio.asr import * # type: ignore\n",
"import pyaudiowpatch as pyaudio\n",
"import numpy as np\n",
"\n",
"\n",
"def getDefaultSpeakers(mic: pyaudio.PyAudio, info = True):\n",
" \"\"\"\n",
" 获取默认的系统音频输出的回环设备\n",
" Args:\n",
" mic (pyaudio.PyAudio): pyaudio对象\n",
" info (bool, optional): 是否打印设备信息. Defaults to True.\n",
"\n",
" Returns:\n",
" dict: 统音频输出的回环设备\n",
" \"\"\"\n",
" try:\n",
" WASAPI_info = mic.get_host_api_info_by_type(pyaudio.paWASAPI)\n",
" except OSError:\n",
" print(\"Looks like WASAPI is not available on the system. Exiting...\")\n",
" exit()\n",
"\n",
" default_speaker = mic.get_device_info_by_index(WASAPI_info[\"defaultOutputDevice\"])\n",
" if(info): print(\"wasapi_info:\\n\", WASAPI_info, \"\\n\")\n",
" if(info): print(\"default_speaker:\\n\", default_speaker, \"\\n\")\n",
"\n",
" if not default_speaker[\"isLoopbackDevice\"]:\n",
" for loopback in mic.get_loopback_device_info_generator():\n",
" if default_speaker[\"name\"] in loopback[\"name\"]:\n",
" default_speaker = loopback\n",
" if(info): print(\"Using loopback device:\\n\", default_speaker, \"\\n\")\n",
" break\n",
" else:\n",
" print(\"Default loopback output device not found.\")\n",
" print(\"Run `python -m pyaudiowpatch` to check available devices.\")\n",
" print(\"Exiting...\")\n",
" exit()\n",
" \n",
" if(info): print(f\"Recording Device: #{default_speaker['index']} {default_speaker['name']}\")\n",
" return default_speaker\n",
"\n",
"\n",
"class Callback(TranslationRecognizerCallback):\n",
" \"\"\"\n",
" 语音大模型流式传输回调对象\n",
" \"\"\"\n",
" def __init__(self):\n",
" super().__init__()\n",
" self.usage = 0\n",
" self.sentences = []\n",
" self.translations = []\n",
" \n",
" def on_open(self) -> None:\n",
" print(\"\\n流式翻译开始...\\n\")\n",
"\n",
" def on_close(self) -> None:\n",
" print(f\"\\nTokens消耗{self.usage}\")\n",
" print(f\"流式翻译结束...\\n\")\n",
" for i in range(len(self.sentences)):\n",
" print(f\"\\n{self.sentences[i]}\\n{self.translations[i]}\\n\")\n",
"\n",
" def on_event(\n",
" self,\n",
" request_id,\n",
" transcription_result: TranscriptionResult,\n",
" translation_result: TranslationResult,\n",
" usage\n",
" ) -> None:\n",
" if transcription_result is not None:\n",
" id = transcription_result.sentence_id\n",
" text = transcription_result.text\n",
" if transcription_result.stash is not None:\n",
" stash = transcription_result.stash.text\n",
" else:\n",
" stash = \"\"\n",
" print(f\"#{id}: {text}{stash}\")\n",
" if usage: self.sentences.append(text)\n",
" \n",
" if translation_result is not None:\n",
" lang = translation_result.get_language_list()[0]\n",
" text = translation_result.get_translation(lang).text\n",
" if translation_result.get_translation(lang).stash is not None:\n",
" stash = translation_result.get_translation(lang).stash.text\n",
" else:\n",
" stash = \"\"\n",
" print(f\"#{lang}: {text}{stash}\")\n",
" if usage: self.translations.append(text)\n",
" \n",
" if usage: self.usage += usage['duration']"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"采样输入设备:\n",
" - 序号26\n",
" - 名称:耳机 (HUAWEI FreeLace 活力版) [Loopback]\n",
" - 最大输入通道数2\n",
" - 默认低输入延迟0.003s\n",
" - 默认高输入延迟0.01s\n",
" - 默认采样率48000.0Hz\n",
" - 是否回环设备True\n",
"\n",
"音频样本块大小4800\n",
"样本位宽2\n",
"音频数据格式8\n",
"音频通道数2\n",
"音频采样率48000\n",
"\n"
]
}
],
"source": [
"mic = pyaudio.PyAudio()\n",
"default_speaker = getDefaultSpeakers(mic, False)\n",
"\n",
"SAMP_WIDTH = pyaudio.get_sample_size(pyaudio.paInt16)\n",
"FORMAT = pyaudio.paInt16\n",
"CHANNELS = default_speaker[\"maxInputChannels\"]\n",
"RATE = int(default_speaker[\"defaultSampleRate\"])\n",
"CHUNK = RATE // 10\n",
"INDEX = default_speaker[\"index\"]\n",
"\n",
"dev_info = f\"\"\"\n",
"采样输入设备:\n",
" - 序号:{default_speaker['index']}\n",
" - 名称:{default_speaker['name']}\n",
" - 最大输入通道数:{default_speaker['maxInputChannels']}\n",
" - 默认低输入延迟:{default_speaker['defaultLowInputLatency']}s\n",
" - 默认高输入延迟:{default_speaker['defaultHighInputLatency']}s\n",
" - 默认采样率:{default_speaker['defaultSampleRate']}Hz\n",
" - 是否回环设备:{default_speaker['isLoopbackDevice']}\n",
"\n",
"音频样本块大小:{CHUNK}\n",
"样本位宽:{SAMP_WIDTH}\n",
"音频数据格式:{FORMAT}\n",
"音频通道数:{CHANNELS}\n",
"音频采样率:{RATE}\n",
"\"\"\"\n",
"print(dev_info)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RECORD_SECONDS = 20 # 监听时长(s)\n",
"\n",
"stream = mic.open(\n",
" format = FORMAT,\n",
" channels = CHANNELS,\n",
" rate = RATE,\n",
" input = True,\n",
" input_device_index = INDEX\n",
")\n",
"translator = TranslationRecognizerRealtime(\n",
" model = \"gummy-realtime-v1\",\n",
" format = \"pcm\",\n",
" sample_rate = RATE,\n",
" transcription_enabled = True,\n",
" translation_enabled = True,\n",
" source_language = \"ja\",\n",
" translation_target_languages = [\"zh\"],\n",
" callback = Callback()\n",
")\n",
"translator.start()\n",
"\n",
"for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):\n",
" data = stream.read(CHUNK)\n",
" data_np = np.frombuffer(data, dtype=np.int16)\n",
" data_np_r = data_np.reshape(-1, CHANNELS)\n",
" print(data_np_r.shape)\n",
" mono_data = np.mean(data_np_r.astype(np.float32), axis=1)\n",
" mono_data = mono_data.astype(np.int16)\n",
" mono_data_bytes = mono_data.tobytes()\n",
" translator.send_audio_frame(mono_data_bytes)\n",
"\n",
"translator.stop()\n",
"stream.stop_stream()\n",
"stream.close()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "mystd",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,189 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 7,
"id": "1e12f3ef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" 采样输入设备:\n",
" - 设备类型:音频输出\n",
" - 序号0\n",
" - 名称BlackHole 2ch\n",
" - 最大输入通道数2\n",
" - 默认低输入延迟0.01s\n",
" - 默认高输入延迟0.1s\n",
" - 默认采样率48000.0Hz\n",
"\n",
" 音频样本块大小2400\n",
" 样本位宽2\n",
" 采样格式8\n",
" 音频通道数2\n",
" 音频采样率48000\n",
" \n"
]
}
],
"source": [
"import sys\n",
"import os\n",
"import wave\n",
"\n",
"current_dir = os.getcwd() \n",
"sys.path.append(os.path.join(current_dir, '../caption-engine'))\n",
"\n",
"from sysaudio.darwin import AudioStream\n",
"from audioprcs import resampleRawChunk, mergeChunkChannels\n",
"\n",
"stream = AudioStream(0)\n",
"stream.printInfo()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a72914f4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recording...\n",
"Done\n"
]
}
],
"source": [
"\"\"\"获取系统音频输出5秒然后保存为wav文件\"\"\"\n",
"\n",
"with wave.open('output.wav', 'wb') as wf:\n",
" wf.setnchannels(stream.CHANNELS)\n",
" wf.setsampwidth(stream.SAMP_WIDTH)\n",
" wf.setframerate(stream.RATE)\n",
" stream.openStream()\n",
"\n",
" print('Recording...')\n",
"\n",
" for _ in range(0, 100):\n",
" chunk = stream.read_chunk()\n",
" if isinstance(chunk, bytes):\n",
" wf.writeframes(chunk)\n",
" else:\n",
" raise Exception('Error: chunk is not bytes')\n",
" \n",
" stream.closeStream() \n",
" print('Done')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a6e8a098",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recording...\n",
"Done\n"
]
}
],
"source": [
"\"\"\"获取系统音频输入转换为单通道音频持续5秒然后保存为wav文件\"\"\"\n",
"\n",
"with wave.open('output.wav', 'wb') as wf:\n",
" wf.setnchannels(1)\n",
" wf.setsampwidth(stream.SAMP_WIDTH)\n",
" wf.setframerate(stream.RATE)\n",
" stream.openStream()\n",
"\n",
" print('Recording...')\n",
"\n",
" for _ in range(0, 100):\n",
" chunk = mergeChunkChannels(\n",
" stream.read_chunk(),\n",
" stream.CHANNELS\n",
" )\n",
" if isinstance(chunk, bytes):\n",
" wf.writeframes(chunk)\n",
" else:\n",
" raise Exception('Error: chunk is not bytes')\n",
" \n",
" stream.closeStream() \n",
" print('Done')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "aaca1465",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recording...\n",
"Done\n"
]
}
],
"source": [
"\"\"\"获取系统音频输入转换为单通道音频并重采样到16000Hz持续5秒然后保存为wav文件\"\"\"\n",
"\n",
"with wave.open('output.wav', 'wb') as wf:\n",
" wf.setnchannels(1)\n",
" wf.setsampwidth(stream.SAMP_WIDTH)\n",
" wf.setframerate(16000)\n",
" stream.openStream()\n",
"\n",
" print('Recording...')\n",
"\n",
" for _ in range(0, 100):\n",
" chunk = resampleRawChunk(\n",
" stream.read_chunk(),\n",
" stream.CHANNELS,\n",
" stream.RATE,\n",
" 16000,\n",
" mode=\"sinc_best\"\n",
" )\n",
" if isinstance(chunk, bytes):\n",
" wf.writeframes(chunk)\n",
" else:\n",
" raise Exception('Error: chunk is not bytes')\n",
" \n",
" stream.closeStream() \n",
" print('Done')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,183 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 8,
"id": "eff7155c",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"import numpy as np\n",
"import sounddevice as sd\n",
"from sherpa_onnx import OnlineRecognizer\n",
"\n",
"current_dir = os.getcwd() \n",
"sys.path.append(os.path.join(current_dir, '../caption-engine'))\n",
"\n",
"from sysaudio.win import AudioStream\n",
"from audioprcs import resampleRawChunk, mergeChunkChannels"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9447e927",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Started! Please speak\n",
"木のデップハートへいてかいものしましたNIさんはまいばんなにちを兄さんは毎朝七時に家を出かけます机の上にねこがいます测试"
]
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mKeyboardInterrupt\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[10]\u001b[39m\u001b[32m, line 35\u001b[39m\n\u001b[32m 33\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m sd.InputStream(channels=\u001b[32m1\u001b[39m, dtype=\u001b[33m\"\u001b[39m\u001b[33mfloat32\u001b[39m\u001b[33m\"\u001b[39m, samplerate=sample_rate) \u001b[38;5;28;01mas\u001b[39;00m s:\n\u001b[32m 34\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m35\u001b[39m samples, _ = \u001b[43ms\u001b[49m\u001b[43m.\u001b[49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43msamples_per_read\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# a blocking read\u001b[39;00m\n\u001b[32m 36\u001b[39m samples = samples.reshape(-\u001b[32m1\u001b[39m)\n\u001b[32m 37\u001b[39m stream.accept_waveform(sample_rate, samples)\n",
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\sounddevice.py:1475\u001b[39m, in \u001b[36mInputStream.read\u001b[39m\u001b[34m(self, frames)\u001b[39m\n\u001b[32m 1473\u001b[39m dtype, _ = _split(\u001b[38;5;28mself\u001b[39m._dtype)\n\u001b[32m 1474\u001b[39m channels, _ = _split(\u001b[38;5;28mself\u001b[39m._channels)\n\u001b[32m-> \u001b[39m\u001b[32m1475\u001b[39m data, overflowed = \u001b[43mRawInputStream\u001b[49m\u001b[43m.\u001b[49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mframes\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1476\u001b[39m data = _array(data, channels, dtype)\n\u001b[32m 1477\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m data, overflowed\n",
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\sounddevice.py:1245\u001b[39m, in \u001b[36mRawInputStream.read\u001b[39m\u001b[34m(self, frames)\u001b[39m\n\u001b[32m 1243\u001b[39m samplesize, _ = _split(\u001b[38;5;28mself\u001b[39m._samplesize)\n\u001b[32m 1244\u001b[39m data = _ffi.new(\u001b[33m'\u001b[39m\u001b[33msigned char[]\u001b[39m\u001b[33m'\u001b[39m, channels * samplesize * frames)\n\u001b[32m-> \u001b[39m\u001b[32m1245\u001b[39m err = \u001b[43m_lib\u001b[49m\u001b[43m.\u001b[49m\u001b[43mPa_ReadStream\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_ptr\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mframes\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1246\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m err == _lib.paInputOverflowed:\n\u001b[32m 1247\u001b[39m overflowed = \u001b[38;5;28;01mTrue\u001b[39;00m\n",
"\u001b[31mKeyboardInterrupt\u001b[39m: "
]
}
],
"source": [
"devices = sd.query_devices()\n",
"if len(devices) == 0:\n",
" print(\"No microphone devices found\")\n",
" sys.exit(0)\n",
"\n",
"# print(devices)\n",
"default_input_device_idx = sd.default.device[0]\n",
"# print(f'Use default device: {devices[default_input_device_idx][\"name\"]}') # type: ignore\n",
"\n",
"m_path = \"D:/Projects/auto-caption/caption-engine/models/sherpa-onnx-streaming-zipformer-ar_en_id_ja_ru_th_vi_zh-2025-02-10\"\n",
"recognizer = OnlineRecognizer.from_transducer(\n",
" tokens=f\"{m_path}/tokens.txt\",\n",
" encoder=f\"{m_path}/encoder-epoch-75-avg-11-chunk-16-left-128.int8.onnx\",\n",
" decoder=f\"{m_path}/decoder-epoch-75-avg-11-chunk-16-left-128.onnx\",\n",
" joiner=f\"{m_path}/joiner-epoch-75-avg-11-chunk-16-left-128.int8.onnx\",\n",
" num_threads=2,\n",
" sample_rate=16000,\n",
" feature_dim=80,\n",
" enable_endpoint_detection=True,\n",
" rule1_min_trailing_silence=2.4,\n",
" rule2_min_trailing_silence=1.2,\n",
" rule3_min_utterance_length=300, # it essentially disables this rule\n",
")\n",
"\n",
"print(\"Started! Please speak\")\n",
"\n",
"# The model is using 16 kHz, we use 48 kHz here to demonstrate that\n",
"# sherpa-onnx will do resampling inside.\n",
"sample_rate = 48000\n",
"samples_per_read = int(0.1 * sample_rate) # 0.1 second = 100 ms\n",
"last_result = \"\"\n",
"stream = recognizer.create_stream()\n",
"with sd.InputStream(channels=1, dtype=\"float32\", samplerate=sample_rate) as s:\n",
" while True:\n",
" samples, _ = s.read(samples_per_read) # a blocking read\n",
" samples = samples.reshape(-1)\n",
" stream.accept_waveform(sample_rate, samples)\n",
" while recognizer.is_ready(stream):\n",
" recognizer.decode_stream(stream)\n",
" result = recognizer.get_result(stream)\n",
" if last_result != result:\n",
" last_result = result\n",
" print(\"\\r{}\".format(result), end=\"\", flush=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abb254f8",
"metadata": {},
"outputs": [],
"source": [
"# m_path = \"D:/Projects/auto-caption/caption-engine/models/sherpa-onnx-nemo-streaming-fast-conformer-transducer-en-80ms-int8\"\n",
"# recognizer = OnlineRecognizer.from_transducer(\n",
"# tokens=f\"{m_path}/tokens.txt\",\n",
"# encoder=f\"{m_path}/encoder.int8.onnx\",\n",
"# decoder=f\"{m_path}/decoder.int8.onnx\",\n",
"# joiner=f\"{m_path}/joiner.int8.onnx\",\n",
"# enable_endpoint_detection=True,\n",
"# )\n",
"\n",
"m_path = \"D:/Projects/auto-caption/caption-engine/models/sherpa-onnx-streaming-zipformer-ar_en_id_ja_ru_th_vi_zh-2025-02-10\"\n",
"recognizer = OnlineRecognizer.from_transducer(\n",
" tokens=f\"{m_path}/tokens.txt\",\n",
" encoder=f\"{m_path}/encoder-epoch-75-avg-11-chunk-16-left-128.int8.onnx\",\n",
" decoder=f\"{m_path}/decoder-epoch-75-avg-11-chunk-16-left-128.onnx\",\n",
" joiner=f\"{m_path}/joiner-epoch-75-avg-11-chunk-16-left-128.int8.onnx\",\n",
" num_threads=1,\n",
" sample_rate=16000,\n",
" feature_dim=80,\n",
" enable_endpoint_detection=True,\n",
" rule1_min_trailing_silence=2.4,\n",
" rule2_min_trailing_silence=1.2,\n",
" rule3_min_utterance_length=300, # it essentially disables this rule\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f10fc82",
"metadata": {},
"outputs": [],
"source": [
"rec_stream = recognizer.create_stream()\n",
"\n",
"stream = AudioStream(0, 1)\n",
"stream.printInfo()\n",
"\n",
"stream.openStream()\n",
"\n",
"\n",
"for i in range(300):\n",
" chunk = stream.read_chunk()\n",
" chunk_mono = resampleRawChunk(chunk, stream.CHANNELS, stream.RATE, 16000)\n",
" chunk_mono = np.frombuffer(chunk_mono, dtype=np.int16)\n",
" chunk_mono = chunk_mono.astype(np.float32)\n",
" print(i, chunk_mono.shape)\n",
" # print(type(chunk_mono), chunk_mono.shape)\n",
" rec_stream.accept_waveform(16000, chunk_mono)\n",
" while recognizer.is_ready(rec_stream):\n",
" recognizer.decode_stream(rec_stream)\n",
" result = recognizer.get_result(rec_stream)\n",
" if result:\n",
" print(result)\n",
" if recognizer.is_endpoint(rec_stream):\n",
" recognizer.reset(rec_stream)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "subenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,124 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "6fb12704",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"d:\\Projects\\auto-caption\\caption-engine\\subenv\\Lib\\site-packages\\vosk\\__init__.py\n"
]
}
],
"source": [
"import vosk\n",
"print(vosk.__file__)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "63a06f5c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" 采样设备:\n",
" - 设备类型:音频输入\n",
" - 序号1\n",
" - 名称:麦克风阵列 (Realtek(R) Audio)\n",
" - 最大输入通道数2\n",
" - 默认低输入延迟0.09s\n",
" - 默认高输入延迟0.18s\n",
" - 默认采样率44100.0Hz\n",
" - 是否回环设备False\n",
"\n",
" 音频样本块大小2205\n",
" 样本位宽2\n",
" 采样格式8\n",
" 音频通道数2\n",
" 音频采样率44100\n",
" \n"
]
}
],
"source": [
"import sys\n",
"import os\n",
"import json\n",
"from vosk import Model, KaldiRecognizer\n",
"\n",
"current_dir = os.getcwd() \n",
"sys.path.append(os.path.join(current_dir, '../caption-engine'))\n",
"\n",
"from sysaudio.win import AudioStream\n",
"from audioprcs import resampleRawChunk, mergeChunkChannels\n",
"\n",
"stream = AudioStream(1)\n",
"stream.printInfo()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "5d5a0afa",
"metadata": {},
"outputs": [],
"source": [
"model = Model(os.path.join(\n",
" current_dir,\n",
" '../caption-engine/models/vosk-model-small-cn-0.22'\n",
"))\n",
"recognizer = KaldiRecognizer(model, 16000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e9d1530",
"metadata": {},
"outputs": [],
"source": [
"stream.openStream()\n",
"\n",
"for i in range(200):\n",
" chunk = stream.read_chunk()\n",
" chunk_mono = resampleRawChunk(chunk, stream.CHANNELS, stream.RATE, 16000)\n",
" if recognizer.AcceptWaveform(chunk_mono):\n",
" result = json.loads(recognizer.Result())\n",
" print(\"acc:\", result.get(\"text\", \"\"))\n",
" else:\n",
" partial = json.loads(recognizer.PartialResult())\n",
" print(\"else:\", partial.get(\"partial\", \"\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "subenv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,7 +1,6 @@
dashscope
numpy
samplerate
PyAudio
PyAudioWPatch
vosk
pyinstaller

View File

@@ -28,18 +28,21 @@ export class CaptionEngine {
if (process.platform === 'win32') {
gummyName += '.exe'
}
this.command = []
if (is.dev) {
this.appPath = path.join(
app.getAppPath(),
'caption-engine', 'dist', gummyName
app.getAppPath(), 'engine',
'subenv', 'Scripts', 'python.exe'
)
this.command.push(path.join(
app.getAppPath(), 'engine', 'main-gummy.py'
))
}
else {
this.appPath = path.join(
process.resourcesPath, 'caption-engine', gummyName
process.resourcesPath, 'engine', gummyName
)
}
this.command = []
this.command.push('-s', allConfig.controls.sourceLang)
this.command.push(
'-t', allConfig.controls.translation ?
@@ -59,12 +62,12 @@ export class CaptionEngine {
if (is.dev) {
this.appPath = path.join(
app.getAppPath(),
'caption-engine', 'dist', voskName
'engine', 'dist', voskName
)
}
else {
this.appPath = path.join(
process.resourcesPath, 'caption-engine', voskName
process.resourcesPath, 'engine', voskName
)
}
this.command = []

View File

@@ -136,6 +136,7 @@ import { useCaptionLogStore } from '@renderer/stores/captionLog'
import { message } from 'ant-design-vue'
import { useI18n } from 'vue-i18n'
import * as tc from '../utils/timeCalc'
import { CaptionItem } from '../types'
const { t } = useI18n()
@@ -154,10 +155,9 @@ const baseMS = ref<number>(0)
const pagination = ref({
current: 1,
pageSize: 10,
pageSize: 20,
showSizeChanger: true,
pageSizeOptions: ['10', '20', '50'],
showTotal: (total: number) => `Total: ${total}`,
pageSizeOptions: ['10', '20', '50', '100'],
onChange: (page: number, pageSize: number) => {
pagination.value.current = page
pagination.value.pageSize = pageSize
@@ -180,6 +180,12 @@ const columns = [
dataIndex: 'time',
key: 'time',
width: 160,
sorter: (a: CaptionItem, b: CaptionItem) => {
if(a.time_s <= b.time_s) return -1
return 1
},
sortDirections: ['descend'],
defaultSortOrder: 'descend',
},
{
title: 'content',

View File

@@ -37,7 +37,7 @@
<a-input
class="input-area"
type="range"
min="0" max="64"
min="0" max="72"
v-model:value="currentFontSize"
/>
<div class="input-item-value">{{ currentFontSize }}px</div>
@@ -114,7 +114,7 @@
<a-input
class="input-area"
type="range"
min="0" max="64"
min="0" max="72"
v-model:value="currentTransFontSize"
/>
<div class="input-item-value">{{ currentTransFontSize }}px</div>
@@ -159,7 +159,7 @@
<a-input
class="input-area"
type="range"
min="0" max="10"
min="0" max="12"
v-model:value="currentBlur"
/>
<div class="input-item-value">{{ currentBlur }}px</div>