222 lines
5.1 KiB
Markdown
222 lines
5.1 KiB
Markdown
|
|
# 语音通话技术栈说明
|
|||
|
|
|
|||
|
|
## 🤖 使用的大模型和服务
|
|||
|
|
|
|||
|
|
### 1. 语音识别(ASR)
|
|||
|
|
|
|||
|
|
**服务商**:阿里云 DashScope
|
|||
|
|
**模型**:`paraformer-realtime-v2`
|
|||
|
|
**配置**:
|
|||
|
|
```python
|
|||
|
|
VOICE_CALL_ASR_MODEL = "paraformer-realtime-v2"
|
|||
|
|
VOICE_CALL_ASR_SAMPLE_RATE = 16000 # 16kHz 采样率
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- 实时语音识别
|
|||
|
|
- 支持流式输入
|
|||
|
|
- 中文识别准确率高
|
|||
|
|
- 低延迟
|
|||
|
|
|
|||
|
|
### 2. 大语言模型(LLM)
|
|||
|
|
|
|||
|
|
**服务商**:阿里云 DashScope(通义千问)
|
|||
|
|
**默认模型**:`qwen-flash`
|
|||
|
|
**配置**:
|
|||
|
|
```python
|
|||
|
|
LLM_MODEL = "gpt-3.5-turbo" # 默认配置
|
|||
|
|
# 实际使用:qwen-flash(通义千问快速版)
|
|||
|
|
LLM_TEMPERATURE = 0.8
|
|||
|
|
LLM_MAX_TOKENS = 2000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**可选模型**:
|
|||
|
|
- `qwen-flash` - 快速版,低延迟(推荐用于语音通话)
|
|||
|
|
- `qwen-turbo` - 标准版
|
|||
|
|
- `qwen-plus` - 增强版
|
|||
|
|
- `qwen-max` - 旗舰版
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- 支持流式输出
|
|||
|
|
- 中文理解能力强
|
|||
|
|
- 响应速度快
|
|||
|
|
- 支持多轮对话
|
|||
|
|
|
|||
|
|
### 3. 语音合成(TTS)
|
|||
|
|
|
|||
|
|
**服务商**:阿里云 DashScope
|
|||
|
|
**模型**:`cosyvoice-v2`
|
|||
|
|
**默认音色**:`longxiaochun_v2`
|
|||
|
|
**配置**:
|
|||
|
|
```python
|
|||
|
|
VOICE_CALL_TTS_MODEL = "cosyvoice-v2"
|
|||
|
|
VOICE_CALL_TTS_VOICE = "longxiaochun_v2"
|
|||
|
|
VOICE_CALL_TTS_FORMAT = "mp3" # 或 pcm
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**支持的音色**:
|
|||
|
|
- 可以在数据库 `voice_library` 表中配置
|
|||
|
|
- 支持自定义音色克隆
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- 高质量语音合成
|
|||
|
|
- 支持多种音色
|
|||
|
|
- 支持情感控制
|
|||
|
|
- 低延迟
|
|||
|
|
|
|||
|
|
## 📊 完整的技术栈
|
|||
|
|
|
|||
|
|
### 后端框架
|
|||
|
|
- **FastAPI** - Python 异步 Web 框架
|
|||
|
|
- **SQLAlchemy** - ORM 数据库操作
|
|||
|
|
- **MySQL** - 数据库
|
|||
|
|
|
|||
|
|
### AI 服务
|
|||
|
|
- **阿里云 DashScope** - 统一的 AI 服务平台
|
|||
|
|
- ASR:Paraformer 实时语音识别
|
|||
|
|
- LLM:通义千问系列模型
|
|||
|
|
- TTS:CosyVoice 语音合成
|
|||
|
|
|
|||
|
|
### 前端
|
|||
|
|
- **uni-app** - 跨平台开发框架
|
|||
|
|
- **Vue.js** - 前端框架
|
|||
|
|
- **WebSocket** - 实时通信
|
|||
|
|
|
|||
|
|
## 🔄 语音通话流程
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
用户说话
|
|||
|
|
↓
|
|||
|
|
[客户端] 录音(PCM 16kHz)
|
|||
|
|
↓
|
|||
|
|
[WebSocket] 发送音频数据
|
|||
|
|
↓
|
|||
|
|
[服务器] ASR 识别(Paraformer)
|
|||
|
|
↓
|
|||
|
|
[服务器] LLM 生成回复(通义千问)
|
|||
|
|
↓
|
|||
|
|
[服务器] TTS 合成语音(CosyVoice)
|
|||
|
|
↓
|
|||
|
|
[WebSocket] 返回音频数据
|
|||
|
|
↓
|
|||
|
|
[客户端] 播放语音
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## ⚙️ 配置说明
|
|||
|
|
|
|||
|
|
### 必需的环境变量
|
|||
|
|
|
|||
|
|
在 `lover/.env` 文件中配置:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 阿里云 DashScope API Key(必需)
|
|||
|
|
DASHSCOPE_API_KEY=sk-xxxxxxxxxxxxx
|
|||
|
|
|
|||
|
|
# LLM 模型配置
|
|||
|
|
LLM_MODEL=qwen-flash
|
|||
|
|
LLM_TEMPERATURE=0.8
|
|||
|
|
LLM_MAX_TOKENS=2000
|
|||
|
|
|
|||
|
|
# 语音通话配置
|
|||
|
|
VOICE_CALL_ASR_MODEL=paraformer-realtime-v2
|
|||
|
|
VOICE_CALL_ASR_SAMPLE_RATE=16000
|
|||
|
|
VOICE_CALL_TTS_MODEL=cosyvoice-v2
|
|||
|
|
VOICE_CALL_TTS_VOICE=longxiaochun_v2
|
|||
|
|
VOICE_CALL_TTS_FORMAT=mp3
|
|||
|
|
VOICE_CALL_IDLE_TIMEOUT=60
|
|||
|
|
VOICE_CALL_MAX_HISTORY=20
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 获取 API Key
|
|||
|
|
|
|||
|
|
1. 访问 [阿里云 DashScope 控制台](https://dashscope.console.aliyun.com/)
|
|||
|
|
2. 注册/登录账号
|
|||
|
|
3. 创建 API Key
|
|||
|
|
4. 配置到 `.env` 文件
|
|||
|
|
|
|||
|
|
## 💰 成本估算
|
|||
|
|
|
|||
|
|
### 阿里云 DashScope 定价(参考)
|
|||
|
|
|
|||
|
|
1. **ASR(语音识别)**
|
|||
|
|
- 约 ¥0.0004/秒
|
|||
|
|
- 5 秒语音 ≈ ¥0.002
|
|||
|
|
|
|||
|
|
2. **LLM(通义千问 qwen-flash)**
|
|||
|
|
- 约 ¥0.0004/1000 tokens
|
|||
|
|
- 一次对话(200 tokens)≈ ¥0.00008
|
|||
|
|
|
|||
|
|
3. **TTS(语音合成)**
|
|||
|
|
- 约 ¥0.002/100 字符
|
|||
|
|
- 50 字回复 ≈ ¥0.001
|
|||
|
|
|
|||
|
|
**单次对话成本**:约 ¥0.003-0.005(不到 1 分钱)
|
|||
|
|
|
|||
|
|
## 🔧 性能优化建议
|
|||
|
|
|
|||
|
|
### 1. 使用更快的模型
|
|||
|
|
```python
|
|||
|
|
LLM_MODEL = "qwen-flash" # 最快
|
|||
|
|
# 而不是 qwen-max(最慢但最准确)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 减少历史消息数量
|
|||
|
|
```python
|
|||
|
|
VOICE_CALL_MAX_HISTORY = 10 # 从 20 降到 10
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 降低 LLM 输出长度
|
|||
|
|
```python
|
|||
|
|
LLM_MAX_TOKENS = 1000 # 从 2000 降到 1000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. 使用流式输出
|
|||
|
|
```python
|
|||
|
|
# 已实现,无需修改
|
|||
|
|
stream = chat_completion_stream(messages)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5. 优化 TTS 分段
|
|||
|
|
```python
|
|||
|
|
# 在 voice_call.py 中已优化
|
|||
|
|
threshold = 8 if self.tts_first_chunk else 18
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🆚 模型对比
|
|||
|
|
|
|||
|
|
| 模型 | 速度 | 质量 | 成本 | 推荐场景 |
|
|||
|
|
|------|------|------|------|----------|
|
|||
|
|
| qwen-flash | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 语音通话(推荐) |
|
|||
|
|
| qwen-turbo | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 文字聊天 |
|
|||
|
|
| qwen-plus | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | 复杂任务 |
|
|||
|
|
| qwen-max | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ | 专业场景 |
|
|||
|
|
|
|||
|
|
## 📝 代码位置
|
|||
|
|
|
|||
|
|
- **LLM 封装**:`lover/llm.py`
|
|||
|
|
- **TTS 封装**:`lover/tts.py`
|
|||
|
|
- **语音通话路由**:`lover/routers/voice_call.py`
|
|||
|
|
- **配置文件**:`lover/config.py`
|
|||
|
|
- **环境变量**:`lover/.env`
|
|||
|
|
|
|||
|
|
## 🔗 相关文档
|
|||
|
|
|
|||
|
|
- [阿里云 DashScope 文档](https://help.aliyun.com/zh/dashscope/)
|
|||
|
|
- [通义千问 API 文档](https://help.aliyun.com/zh/dashscope/developer-reference/api-details)
|
|||
|
|
- [Paraformer ASR 文档](https://help.aliyun.com/zh/dashscope/developer-reference/paraformer-realtime-v2)
|
|||
|
|
- [CosyVoice TTS 文档](https://help.aliyun.com/zh/dashscope/developer-reference/cosyvoice-v2)
|
|||
|
|
|
|||
|
|
## 🎯 总结
|
|||
|
|
|
|||
|
|
语音通话使用的是**阿里云 DashScope 全家桶**:
|
|||
|
|
- ASR:Paraformer 实时语音识别
|
|||
|
|
- LLM:通义千问 qwen-flash
|
|||
|
|
- TTS:CosyVoice v2
|
|||
|
|
|
|||
|
|
这套方案的优势:
|
|||
|
|
- ✅ 全中文支持
|
|||
|
|
- ✅ 低延迟
|
|||
|
|
- ✅ 高质量
|
|||
|
|
- ✅ 成本低
|
|||
|
|
- ✅ 易于集成
|