293 lines
6.6 KiB
Markdown
293 lines
6.6 KiB
Markdown
|
|
# 唱歌视频问题完整解决方案
|
|||
|
|
|
|||
|
|
## 问题汇总
|
|||
|
|
|
|||
|
|
### 问题1: 任务失败 - "文件下载失败"
|
|||
|
|
- **任务**: 382, 384
|
|||
|
|
- **原因**: OSS配置错误,使用了错误的bucket
|
|||
|
|
- **状态**: ✅ 已修复
|
|||
|
|
|
|||
|
|
### 问题2: 任务卡住 - 一直running
|
|||
|
|
- **任务**: 385
|
|||
|
|
- **原因**: 任务处理超时,未正确标记失败
|
|||
|
|
- **状态**: ⚠️ 需要手动修复
|
|||
|
|
|
|||
|
|
## 完整修复流程
|
|||
|
|
|
|||
|
|
### 第一步:修复OSS配置(已完成)
|
|||
|
|
|
|||
|
|
`.env` 文件已更新为:
|
|||
|
|
```env
|
|||
|
|
ALIYUN_OSS_BUCKET_NAME=nvlovers
|
|||
|
|
ALIYUN_OSS_ENDPOINT=https://oss-cn-qingdao.aliyuncs.com
|
|||
|
|
ALIYUN_OSS_CDN_DOMAIN=https://nvlovers.oss-cn-qingdao.aliyuncs.com
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 第二步:清理卡住的任务
|
|||
|
|
|
|||
|
|
**方法A: 使用一键脚本(推荐)**
|
|||
|
|
```bash
|
|||
|
|
双击运行: 修复卡住的任务.bat
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**方法B: 手动执行SQL**
|
|||
|
|
```sql
|
|||
|
|
-- 连接数据库
|
|||
|
|
mysql -u root -prootx77 fastadmin
|
|||
|
|
|
|||
|
|
-- 执行修复
|
|||
|
|
UPDATE nf_generation_tasks
|
|||
|
|
SET
|
|||
|
|
status = 'failed',
|
|||
|
|
error_msg = '任务处理超时,已自动标记为失败',
|
|||
|
|
updated_at = NOW()
|
|||
|
|
WHERE status = 'running'
|
|||
|
|
AND TIMESTAMPDIFF(MINUTE, updated_at, NOW()) > 10;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 第三步:重启服务
|
|||
|
|
|
|||
|
|
**方法A: 使用重启脚本**
|
|||
|
|
```bash
|
|||
|
|
双击运行: 重启服务.bat
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**方法B: 手动重启**
|
|||
|
|
1. 在Python终端按 `Ctrl+C` 停止服务
|
|||
|
|
2. 或运行 `杀死端口30101.bat`
|
|||
|
|
3. 重新运行 `启动项目.bat`
|
|||
|
|
|
|||
|
|
### 第四步:验证修复
|
|||
|
|
|
|||
|
|
1. **检查任务状态**
|
|||
|
|
```sql
|
|||
|
|
-- 查看最近的任务
|
|||
|
|
SELECT id, status, error_msg, created_at
|
|||
|
|
FROM nf_generation_tasks
|
|||
|
|
ORDER BY id DESC
|
|||
|
|
LIMIT 10;
|
|||
|
|
|
|||
|
|
-- 确认没有长时间running的任务
|
|||
|
|
SELECT id, status,
|
|||
|
|
TIMESTAMPDIFF(MINUTE, updated_at, NOW()) as stuck_minutes
|
|||
|
|
FROM nf_generation_tasks
|
|||
|
|
WHERE status = 'running';
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **测试新任务**
|
|||
|
|
- 在应用中重新生成唱歌视频
|
|||
|
|
- 观察任务是否正常完成
|
|||
|
|
- 检查视频是否能正常播放
|
|||
|
|
|
|||
|
|
## 问题根源分析
|
|||
|
|
|
|||
|
|
### 1. OSS配置不一致
|
|||
|
|
|
|||
|
|
**问题**:
|
|||
|
|
- 配置文件指向 `hello12312312` bucket(杭州)
|
|||
|
|
- 歌曲音频存储在 `nvlovers` bucket(青岛)
|
|||
|
|
- 导致下载失败
|
|||
|
|
|
|||
|
|
**影响**:
|
|||
|
|
- 任务382: 音频404,任务失败
|
|||
|
|
- 任务384: 文件下载失败
|
|||
|
|
|
|||
|
|
**解决**:
|
|||
|
|
- 统一使用 `nvlovers` bucket
|
|||
|
|
- 所有资源从同一个bucket读取
|
|||
|
|
|
|||
|
|
### 2. 任务超时未处理
|
|||
|
|
|
|||
|
|
**问题**:
|
|||
|
|
- 任务处理时间过长(可能是API调用慢)
|
|||
|
|
- 没有超时机制
|
|||
|
|
- 任务一直卡在running状态
|
|||
|
|
|
|||
|
|
**影响**:
|
|||
|
|
- 任务385: 卡住不动
|
|||
|
|
- 占用系统资源
|
|||
|
|
- 影响后续任务
|
|||
|
|
|
|||
|
|
**解决**:
|
|||
|
|
- 手动标记超时任务为失败
|
|||
|
|
- 添加超时监控机制(长期)
|
|||
|
|
|
|||
|
|
## 预防措施
|
|||
|
|
|
|||
|
|
### 1. 统一配置管理
|
|||
|
|
|
|||
|
|
**检查清单**:
|
|||
|
|
- [ ] `.env` 文件OSS配置正确
|
|||
|
|
- [ ] `lover/.env` 没有覆盖配置
|
|||
|
|
- [ ] 所有环境使用相同配置
|
|||
|
|
|
|||
|
|
### 2. 添加资源检查
|
|||
|
|
|
|||
|
|
在任务开始前验证资源:
|
|||
|
|
```python
|
|||
|
|
def validate_resources(image_url, audio_url):
|
|||
|
|
"""验证资源是否可访问"""
|
|||
|
|
for url in [image_url, audio_url]:
|
|||
|
|
try:
|
|||
|
|
response = requests.head(url, timeout=5)
|
|||
|
|
if response.status_code != 200:
|
|||
|
|
raise HTTPException(
|
|||
|
|
status_code=400,
|
|||
|
|
detail=f"资源不可访问: {url}"
|
|||
|
|
)
|
|||
|
|
except Exception as e:
|
|||
|
|
raise HTTPException(
|
|||
|
|
status_code=400,
|
|||
|
|
detail=f"资源验证失败: {str(e)}"
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 添加超时处理
|
|||
|
|
|
|||
|
|
设置合理的超时时间:
|
|||
|
|
```python
|
|||
|
|
# config.py
|
|||
|
|
EMO_TASK_TIMEOUT_SECONDS = 600 # 10分钟
|
|||
|
|
SING_TASK_TIMEOUT_SECONDS = 1800 # 30分钟
|
|||
|
|
TASK_MAX_PROCESSING_TIME = 3600 # 1小时
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. 定期清理超时任务
|
|||
|
|
|
|||
|
|
创建定时任务:
|
|||
|
|
```python
|
|||
|
|
# 每5分钟检查一次
|
|||
|
|
@scheduler.scheduled_job('interval', minutes=5)
|
|||
|
|
def cleanup_stuck_tasks():
|
|||
|
|
db = SessionLocal()
|
|||
|
|
try:
|
|||
|
|
timeout = datetime.utcnow() - timedelta(minutes=30)
|
|||
|
|
stuck_tasks = (
|
|||
|
|
db.query(GenerationTask)
|
|||
|
|
.filter(
|
|||
|
|
GenerationTask.status == "running",
|
|||
|
|
GenerationTask.updated_at < timeout
|
|||
|
|
)
|
|||
|
|
.all()
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
for task in stuck_tasks:
|
|||
|
|
task.status = "failed"
|
|||
|
|
task.error_msg = "任务处理超时"
|
|||
|
|
db.add(task)
|
|||
|
|
|
|||
|
|
db.commit()
|
|||
|
|
finally:
|
|||
|
|
db.close()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5. 改进错误日志
|
|||
|
|
|
|||
|
|
记录详细信息:
|
|||
|
|
```python
|
|||
|
|
logger.error(
|
|||
|
|
f"任务 {task_id} 失败: {error_msg}",
|
|||
|
|
extra={
|
|||
|
|
"task_id": task_id,
|
|||
|
|
"user_id": user_id,
|
|||
|
|
"lover_id": lover_id,
|
|||
|
|
"song_id": song_id,
|
|||
|
|
"image_url": image_url,
|
|||
|
|
"audio_url": audio_url,
|
|||
|
|
"error": str(exc),
|
|||
|
|
"traceback": traceback.format_exc()
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 监控指标
|
|||
|
|
|
|||
|
|
### 关键指标
|
|||
|
|
|
|||
|
|
1. **任务成功率**
|
|||
|
|
```sql
|
|||
|
|
SELECT
|
|||
|
|
status,
|
|||
|
|
COUNT(*) as count,
|
|||
|
|
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage
|
|||
|
|
FROM nf_generation_tasks
|
|||
|
|
WHERE created_at > DATE_SUB(NOW(), INTERVAL 24 HOUR)
|
|||
|
|
GROUP BY status;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **平均处理时间**
|
|||
|
|
```sql
|
|||
|
|
SELECT
|
|||
|
|
AVG(TIMESTAMPDIFF(SECOND, created_at, updated_at)) as avg_seconds,
|
|||
|
|
MAX(TIMESTAMPDIFF(SECOND, created_at, updated_at)) as max_seconds
|
|||
|
|
FROM nf_generation_tasks
|
|||
|
|
WHERE status = 'succeeded'
|
|||
|
|
AND created_at > DATE_SUB(NOW(), INTERVAL 24 HOUR);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **卡住的任务数**
|
|||
|
|
```sql
|
|||
|
|
SELECT COUNT(*) as stuck_count
|
|||
|
|
FROM nf_generation_tasks
|
|||
|
|
WHERE status = 'running'
|
|||
|
|
AND TIMESTAMPDIFF(MINUTE, updated_at, NOW()) > 10;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 告警规则
|
|||
|
|
|
|||
|
|
- 任务成功率 < 80%
|
|||
|
|
- 平均处理时间 > 20分钟
|
|||
|
|
- 卡住的任务数 > 5
|
|||
|
|
- 连续失败任务 > 3
|
|||
|
|
|
|||
|
|
## 常见问题
|
|||
|
|
|
|||
|
|
### Q1: 修复后还是失败?
|
|||
|
|
A: 检查:
|
|||
|
|
1. 服务是否已重启
|
|||
|
|
2. 配置文件是否正确保存
|
|||
|
|
3. 查看新任务的错误信息
|
|||
|
|
|
|||
|
|
### Q2: 如何查看任务详情?
|
|||
|
|
A:
|
|||
|
|
```sql
|
|||
|
|
SELECT * FROM nf_generation_tasks WHERE id = 任务ID\G
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Q3: 如何重试失败的任务?
|
|||
|
|
A: 使用重试接口:
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://192.168.1.141:30101/sing/retry/任务ID
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Q4: 如何清理所有失败任务?
|
|||
|
|
A:
|
|||
|
|
```sql
|
|||
|
|
-- 仅查看,不删除
|
|||
|
|
SELECT id, error_msg FROM nf_generation_tasks WHERE status = 'failed';
|
|||
|
|
|
|||
|
|
-- 如需删除(谨慎)
|
|||
|
|
-- DELETE FROM nf_generation_tasks WHERE status = 'failed' AND created_at < DATE_SUB(NOW(), INTERVAL 7 DAY);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 总结
|
|||
|
|
|
|||
|
|
### 已完成
|
|||
|
|
- ✅ 修复OSS配置
|
|||
|
|
- ✅ 创建修复脚本
|
|||
|
|
- ✅ 创建诊断工具
|
|||
|
|
|
|||
|
|
### 待执行
|
|||
|
|
- ⚠️ 清理卡住的任务(运行 `修复卡住的任务.bat`)
|
|||
|
|
- ⚠️ 重启服务(运行 `重启服务.bat`)
|
|||
|
|
- ⚠️ 测试验证
|
|||
|
|
|
|||
|
|
### 长期改进
|
|||
|
|
- 📋 添加超时处理机制
|
|||
|
|
- 📋 添加资源验证
|
|||
|
|
- 📋 添加定时清理任务
|
|||
|
|
- 📋 改进错误日志
|
|||
|
|
- 📋 添加监控告警
|
|||
|
|
|
|||
|
|
执行完待执行的步骤后,唱歌视频生成功能应该就能正常工作了!
|