330 lines
7.3 KiB
Markdown
330 lines
7.3 KiB
Markdown
|
|
# 官方文档分析和正确实现
|
|||
|
|
|
|||
|
|
## 📚 官方文档关键信息
|
|||
|
|
|
|||
|
|
### 1. send_audio_frame 的正确用法
|
|||
|
|
|
|||
|
|
根据官方文档:
|
|||
|
|
|
|||
|
|
> **每次推送的音频流不宜过大或过小,建议每包音频时长为100ms左右,大小在1KB~16KB之间。**
|
|||
|
|
|
|||
|
|
### 2. 官方示例代码
|
|||
|
|
|
|||
|
|
#### 识别本地文件的正确方式
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
recognition.start()
|
|||
|
|
|
|||
|
|
try:
|
|||
|
|
f = open("asr_example.wav", 'rb')
|
|||
|
|
while True:
|
|||
|
|
audio_data = f.read(3200) # 每次读取 3200 字节(约 3KB)
|
|||
|
|
if not audio_data:
|
|||
|
|
break
|
|||
|
|
else:
|
|||
|
|
recognition.send_audio_frame(audio_data) # 发送小块数据
|
|||
|
|
time.sleep(0.1) # 延迟 100ms
|
|||
|
|
f.close()
|
|||
|
|
except Exception as e:
|
|||
|
|
raise e
|
|||
|
|
|
|||
|
|
recognition.stop()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键点**:
|
|||
|
|
- ✅ 每次读取 3200 字节(约 3KB)
|
|||
|
|
- ✅ 延迟 100ms(0.1秒)
|
|||
|
|
- ✅ 循环发送,模拟实时流
|
|||
|
|
|
|||
|
|
#### 识别麦克风的正确方式
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
recognition.start()
|
|||
|
|
|
|||
|
|
while True:
|
|||
|
|
if stream:
|
|||
|
|
data = stream.read(3200, exception_on_overflow=False) # 每次 3200 字节
|
|||
|
|
recognition.send_audio_frame(data) # 立即发送
|
|||
|
|
else:
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
recognition.stop()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键点**:
|
|||
|
|
- ✅ 每次读取 3200 字节
|
|||
|
|
- ✅ 实时发送,无需延迟(因为是实时流)
|
|||
|
|
|
|||
|
|
## 🔍 我们的问题
|
|||
|
|
|
|||
|
|
### 当前实现(错误)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 服务器端 lover/routers/voice_call.py
|
|||
|
|
async def feed_audio(self, data: bytes):
|
|||
|
|
if self.recognition:
|
|||
|
|
self.recognition.send_audio_frame(data) # 直接发送整个文件
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
```javascript
|
|||
|
|
// 客户端
|
|||
|
|
fs.readFile({
|
|||
|
|
filePath: res.tempFilePath,
|
|||
|
|
success: (fileRes) => {
|
|||
|
|
// 一次性发送 260KB ❌
|
|||
|
|
socketTask.send({ data: fileRes.data })
|
|||
|
|
}
|
|||
|
|
})
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**问题**:
|
|||
|
|
- ❌ 客户端一次性发送 260KB
|
|||
|
|
- ❌ 服务器直接喂给 ASR
|
|||
|
|
- ❌ 不符合官方要求(1KB~16KB)
|
|||
|
|
|
|||
|
|
### 正确实现(已修复)
|
|||
|
|
|
|||
|
|
```javascript
|
|||
|
|
// 客户端分片发送
|
|||
|
|
sendAudioInChunks(audioData) {
|
|||
|
|
const chunkSize = 8192 // 8KB(符合官方要求)
|
|||
|
|
|
|||
|
|
for (let offset = 0; offset < totalSize; offset += chunkSize) {
|
|||
|
|
const chunk = audioData.slice(offset, offset + chunkSize)
|
|||
|
|
socketTask.send({ data: chunk })
|
|||
|
|
await sleep(50) // 延迟 50ms(每秒发送 20 片)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
socketTask.send({ data: 'end' }) // 发送结束标记
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**改进**:
|
|||
|
|
- ✅ 每次发送 8KB(符合 1KB~16KB 要求)
|
|||
|
|
- ✅ 延迟 50ms(比官方建议的 100ms 更快)
|
|||
|
|
- ✅ 发送结束标记
|
|||
|
|
|
|||
|
|
## 📊 数据大小计算
|
|||
|
|
|
|||
|
|
### PCM 音频数据大小
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
采样率:16000 Hz
|
|||
|
|
位深度:16 bit = 2 bytes
|
|||
|
|
声道数:1(单声道)
|
|||
|
|
|
|||
|
|
每秒数据量 = 16000 × 2 × 1 = 32000 bytes = 31.25 KB/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 官方建议
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
每包时长:100ms
|
|||
|
|
每包大小:31.25 KB/s × 0.1s = 3.125 KB ≈ 3200 bytes
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**这就是为什么官方示例用 3200 字节!**
|
|||
|
|
|
|||
|
|
### 我们的实现
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
每包大小:8192 bytes = 8 KB
|
|||
|
|
每包时长:8192 / 32000 = 0.256 秒 = 256ms
|
|||
|
|
发送间隔:50ms
|
|||
|
|
|
|||
|
|
实际传输速率:8192 / 0.05 = 163840 bytes/s = 160 KB/s
|
|||
|
|
实际音频速率:32000 bytes/s = 31.25 KB/s
|
|||
|
|
|
|||
|
|
速率比:160 / 31.25 = 5.12 倍
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**结论**:我们的发送速度是实际音频速度的 5 倍,完全够用。
|
|||
|
|
|
|||
|
|
## 🔧 优化建议
|
|||
|
|
|
|||
|
|
### 方案1:使用官方推荐的参数(推荐)
|
|||
|
|
|
|||
|
|
```javascript
|
|||
|
|
sendAudioInChunks(audioData) {
|
|||
|
|
const chunkSize = 3200 // 3.2KB(官方推荐)
|
|||
|
|
const delay = 100 // 100ms(官方推荐)
|
|||
|
|
|
|||
|
|
for (let offset = 0; offset < totalSize; offset += chunkSize) {
|
|||
|
|
const chunk = audioData.slice(offset, offset + chunkSize)
|
|||
|
|
socketTask.send({ data: chunk })
|
|||
|
|
await sleep(delay)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
socketTask.send({ data: 'end' })
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- 完全符合官方建议
|
|||
|
|
- 更接近实时音频流
|
|||
|
|
- 延迟更低
|
|||
|
|
|
|||
|
|
### 方案2:保持当前实现
|
|||
|
|
|
|||
|
|
```javascript
|
|||
|
|
const chunkSize = 8192 // 8KB
|
|||
|
|
const delay = 50 // 50ms
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- 发送更快
|
|||
|
|
- 减少网络请求次数
|
|||
|
|
- 仍在官方范围内(1KB~16KB)
|
|||
|
|
|
|||
|
|
## 🎯 服务器端需要的改动
|
|||
|
|
|
|||
|
|
### 当前代码
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
async def feed_audio(self, data: bytes):
|
|||
|
|
if self.recognition:
|
|||
|
|
self.recognition.send_audio_frame(data)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**问题**:没有处理 "end" 标记
|
|||
|
|
|
|||
|
|
### 建议改动
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
async def feed_audio(self, data: bytes):
|
|||
|
|
# 检查是否为结束标记
|
|||
|
|
if isinstance(data, str) and data == 'end':
|
|||
|
|
# 停止 ASR,触发最终识别
|
|||
|
|
self.finalize_asr()
|
|||
|
|
return
|
|||
|
|
|
|||
|
|
# 正常音频数据
|
|||
|
|
if self.recognition:
|
|||
|
|
self.recognition.send_audio_frame(data)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
或者在 WebSocket 消息处理中:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
async def voice_call(websocket: WebSocket):
|
|||
|
|
# ...
|
|||
|
|
while True:
|
|||
|
|
msg = await websocket.receive()
|
|||
|
|
if "bytes" in msg and msg["bytes"] is not None:
|
|||
|
|
await session.feed_audio(msg["bytes"])
|
|||
|
|
elif "text" in msg and msg["text"]:
|
|||
|
|
text = msg["text"].strip()
|
|||
|
|
if text == "end":
|
|||
|
|
session.finalize_asr() # 触发最终识别
|
|||
|
|
# ...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 📋 完整的工作流程
|
|||
|
|
|
|||
|
|
### 正确的流程
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
1. 客户端录音完成
|
|||
|
|
↓
|
|||
|
|
2. 读取 PCM 文件(260KB)
|
|||
|
|
↓
|
|||
|
|
3. 分片发送(每片 8KB,间隔 50ms)
|
|||
|
|
├─ 发送片段 1 (8KB)
|
|||
|
|
├─ 延迟 50ms
|
|||
|
|
├─ 发送片段 2 (8KB)
|
|||
|
|
├─ 延迟 50ms
|
|||
|
|
├─ ...
|
|||
|
|
└─ 发送片段 32 (6KB)
|
|||
|
|
↓
|
|||
|
|
4. 发送 "end" 标记
|
|||
|
|
↓
|
|||
|
|
5. 服务器接收每个片段
|
|||
|
|
├─ 片段 1 → recognition.send_audio_frame()
|
|||
|
|
├─ 片段 2 → recognition.send_audio_frame()
|
|||
|
|
├─ ...
|
|||
|
|
└─ 片段 32 → recognition.send_audio_frame()
|
|||
|
|
↓
|
|||
|
|
6. 服务器收到 "end" 标记
|
|||
|
|
↓
|
|||
|
|
7. 调用 recognition.stop()
|
|||
|
|
↓
|
|||
|
|
8. ASR 完成识别,触发回调
|
|||
|
|
↓
|
|||
|
|
9. LLM 生成回复
|
|||
|
|
↓
|
|||
|
|
10. TTS 合成语音
|
|||
|
|
↓
|
|||
|
|
11. 返回音频给客户端
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## ✅ 验证清单
|
|||
|
|
|
|||
|
|
测试时检查以下日志:
|
|||
|
|
|
|||
|
|
### 客户端日志
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
✅ 📦 开始分片发送,总大小: 260000 bytes,每片: 8192 bytes
|
|||
|
|
✅ 📤 发送第 1 片,范围: 0-8192,大小: 8192 bytes
|
|||
|
|
✅ ✅ 第 1 片发送成功
|
|||
|
|
✅ 📤 发送第 2 片,范围: 8192-16384,大小: 8192 bytes
|
|||
|
|
✅ ✅ 第 2 片发送成功
|
|||
|
|
...
|
|||
|
|
✅ ✅ 所有音频片段发送完成,共 32 片
|
|||
|
|
✅ ✅ 发送结束标记
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 服务器日志
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
✅ ASR connection opened
|
|||
|
|
✅ ASR event end=False sentence=...
|
|||
|
|
✅ ASR event end=True sentence=...
|
|||
|
|
✅ ASR complete
|
|||
|
|
✅ LLM 生成回复
|
|||
|
|
✅ TTS 合成语音
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 客户端收到响应
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
✅ 📋 收到控制消息, type: reply_text
|
|||
|
|
✅ 🎵 收到音频数据流
|
|||
|
|
✅ 📋 收到控制消息, type: reply_end
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🎓 经验总结
|
|||
|
|
|
|||
|
|
### 关键教训
|
|||
|
|
|
|||
|
|
1. **RTFM(Read The F***ing Manual)**
|
|||
|
|
- 官方文档明确说明了参数要求
|
|||
|
|
- 必须仔细阅读文档
|
|||
|
|
|
|||
|
|
2. **理解模型特性**
|
|||
|
|
- Paraformer-realtime-v2 是实时流式模型
|
|||
|
|
- 必须按照流式方式喂数据
|
|||
|
|
|
|||
|
|
3. **参数范围很重要**
|
|||
|
|
- 1KB~16KB 不是随便说的
|
|||
|
|
- 超出范围会导致识别失败
|
|||
|
|
|
|||
|
|
### 最佳实践
|
|||
|
|
|
|||
|
|
1. **遵循官方建议**
|
|||
|
|
- 每包 3200 字节(100ms 音频)
|
|||
|
|
- 延迟 100ms
|
|||
|
|
|
|||
|
|
2. **添加结束标记**
|
|||
|
|
- 告诉服务器数据发送完毕
|
|||
|
|
- 触发最终处理
|
|||
|
|
|
|||
|
|
3. **完善日志**
|
|||
|
|
- 记录每个步骤
|
|||
|
|
- 便于问题排查
|
|||
|
|
|
|||
|
|
## 🔗 参考文档
|
|||
|
|
|
|||
|
|
- [Paraformer 实时语音识别 Python SDK](https://help.aliyun.com/zh/model-studio/paraformer-real-time-speech-recognition-python-sdk)
|
|||
|
|
- [实时语音识别](https://help.aliyun.com/zh/model-studio/real-time-speech-recognition)
|