235 lines
5.0 KiB
Markdown
235 lines
5.0 KiB
Markdown
|
|
# 任务卡住问题分析
|
|||
|
|
|
|||
|
|
## 问题现象
|
|||
|
|
|
|||
|
|
任务385一直卡在 `running` 状态,从日志可以看到:
|
|||
|
|
|
|||
|
|
1. **不断重试**: 任务状态一直是 `running`,不断尝试处理
|
|||
|
|
2. **undefined错误**: 出现多次 `undefined` 错误
|
|||
|
|
3. **环境变量错误**: `uni.env., 'undefined'`
|
|||
|
|
4. **状态存储错误**: `storedStates` 相关错误
|
|||
|
|
|
|||
|
|
## 可能的原因
|
|||
|
|
|
|||
|
|
### 1. EMO视频生成超时
|
|||
|
|
- DashScope API调用时间过长
|
|||
|
|
- 网络连接不稳定
|
|||
|
|
- API配额限制
|
|||
|
|
|
|||
|
|
### 2. 音频/视频下载超时
|
|||
|
|
- OSS资源下载慢
|
|||
|
|
- 文件过大
|
|||
|
|
- 网络问题
|
|||
|
|
|
|||
|
|
### 3. 分段视频卡住
|
|||
|
|
- 某个分段视频生成失败但未正确处理
|
|||
|
|
- 等待DashScope返回结果超时
|
|||
|
|
- 并发限制导致任务排队
|
|||
|
|
|
|||
|
|
### 4. 数据库连接问题
|
|||
|
|
- 长时间事务未提交
|
|||
|
|
- 数据库锁等待
|
|||
|
|
- 连接池耗尽
|
|||
|
|
|
|||
|
|
### 5. 代码逻辑问题
|
|||
|
|
- 异常未正确捕获
|
|||
|
|
- 无限循环或死锁
|
|||
|
|
- 超时设置不合理
|
|||
|
|
|
|||
|
|
## 诊断步骤
|
|||
|
|
|
|||
|
|
### 步骤1: 检查任务状态
|
|||
|
|
```sql
|
|||
|
|
-- 查看任务详情
|
|||
|
|
SELECT * FROM nf_generation_tasks WHERE id = 385\G
|
|||
|
|
|
|||
|
|
-- 查看分段视频状态
|
|||
|
|
SELECT sv.*, ss.segment_index
|
|||
|
|
FROM nf_song_segment_video sv
|
|||
|
|
LEFT JOIN nf_song_segment ss ON sv.segment_id = ss.id
|
|||
|
|
WHERE sv.song_id = (SELECT JSON_EXTRACT(payload, '$.song_id') FROM nf_generation_tasks WHERE id = 385)
|
|||
|
|
ORDER BY ss.segment_index;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 步骤2: 检查应用日志
|
|||
|
|
查找任务385相关的详细错误:
|
|||
|
|
```bash
|
|||
|
|
# 在日志中搜索任务385
|
|||
|
|
grep "任务 385" lover/logs/*.log
|
|||
|
|
|
|||
|
|
# 查看最近的错误
|
|||
|
|
grep -i "error\|exception\|failed" lover/logs/*.log | tail -50
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 步骤3: 检查DashScope任务
|
|||
|
|
如果有 `dashscope_task_id`,可以查询DashScope任务状态:
|
|||
|
|
```python
|
|||
|
|
from dashscope import VideoSynthesis
|
|||
|
|
|
|||
|
|
task_id = "从数据库获取的dashscope_task_id"
|
|||
|
|
result = VideoSynthesis.fetch(task_id)
|
|||
|
|
print(result)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 步骤4: 检查系统资源
|
|||
|
|
```bash
|
|||
|
|
# 检查内存使用
|
|||
|
|
free -h
|
|||
|
|
|
|||
|
|
# 检查磁盘空间
|
|||
|
|
df -h
|
|||
|
|
|
|||
|
|
# 检查Python进程
|
|||
|
|
ps aux | grep python
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 解决方案
|
|||
|
|
|
|||
|
|
### 方案1: 强制标记为失败(立即生效)
|
|||
|
|
|
|||
|
|
执行SQL:
|
|||
|
|
```sql
|
|||
|
|
UPDATE nf_generation_tasks
|
|||
|
|
SET
|
|||
|
|
status = 'failed',
|
|||
|
|
error_msg = '任务处理超时,已手动标记为失败',
|
|||
|
|
updated_at = NOW()
|
|||
|
|
WHERE id = 385;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 方案2: 重启服务(清理状态)
|
|||
|
|
|
|||
|
|
1. 停止Python服务
|
|||
|
|
2. 执行方案1的SQL
|
|||
|
|
3. 重启Python服务
|
|||
|
|
|
|||
|
|
### 方案3: 增加超时处理(预防未来问题)
|
|||
|
|
|
|||
|
|
在代码中添加超时机制:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# lover/routers/sing.py
|
|||
|
|
|
|||
|
|
# 在_process_sing_task函数开始处添加
|
|||
|
|
import signal
|
|||
|
|
|
|||
|
|
def timeout_handler(signum, frame):
|
|||
|
|
raise TimeoutError("任务处理超时")
|
|||
|
|
|
|||
|
|
# 设置30分钟超时
|
|||
|
|
signal.signal(signal.SIGALRM, timeout_handler)
|
|||
|
|
signal.alarm(1800) # 30分钟
|
|||
|
|
|
|||
|
|
try:
|
|||
|
|
# 原有的处理逻辑
|
|||
|
|
...
|
|||
|
|
finally:
|
|||
|
|
signal.alarm(0) # 取消超时
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 方案4: 添加任务监控
|
|||
|
|
|
|||
|
|
创建定时任务,自动清理超时任务:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# lover/task_cleanup.py
|
|||
|
|
import schedule
|
|||
|
|
import time
|
|||
|
|
from db import SessionLocal
|
|||
|
|
from models import GenerationTask
|
|||
|
|
from datetime import datetime, timedelta
|
|||
|
|
|
|||
|
|
def cleanup_stuck_tasks():
|
|||
|
|
"""清理卡住的任务"""
|
|||
|
|
db = SessionLocal()
|
|||
|
|
try:
|
|||
|
|
# 查找超过30分钟的running任务
|
|||
|
|
timeout = datetime.utcnow() - timedelta(minutes=30)
|
|||
|
|
stuck_tasks = (
|
|||
|
|
db.query(GenerationTask)
|
|||
|
|
.filter(
|
|||
|
|
GenerationTask.status == "running",
|
|||
|
|
GenerationTask.updated_at < timeout
|
|||
|
|
)
|
|||
|
|
.all()
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
for task in stuck_tasks:
|
|||
|
|
task.status = "failed"
|
|||
|
|
task.error_msg = "任务处理超时(超过30分钟)"
|
|||
|
|
task.updated_at = datetime.utcnow()
|
|||
|
|
db.add(task)
|
|||
|
|
|
|||
|
|
db.commit()
|
|||
|
|
print(f"清理了 {len(stuck_tasks)} 个超时任务")
|
|||
|
|
finally:
|
|||
|
|
db.close()
|
|||
|
|
|
|||
|
|
# 每5分钟检查一次
|
|||
|
|
schedule.every(5).minutes.do(cleanup_stuck_tasks)
|
|||
|
|
|
|||
|
|
if __name__ == "__main__":
|
|||
|
|
while True:
|
|||
|
|
schedule.run_pending()
|
|||
|
|
time.sleep(60)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 立即操作
|
|||
|
|
|
|||
|
|
### 1. 修复当前卡住的任务
|
|||
|
|
```bash
|
|||
|
|
# 连接数据库
|
|||
|
|
mysql -u root -prootx77 fastadmin
|
|||
|
|
|
|||
|
|
# 执行修复SQL
|
|||
|
|
source xuniYou/修复卡住的任务.sql
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 重启服务
|
|||
|
|
```bash
|
|||
|
|
# 双击运行
|
|||
|
|
重启服务.bat
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 验证修复
|
|||
|
|
- 检查任务385是否已标记为失败
|
|||
|
|
- 尝试重新生成视频
|
|||
|
|
- 观察新任务是否正常完成
|
|||
|
|
|
|||
|
|
## 预防措施
|
|||
|
|
|
|||
|
|
### 1. 设置合理的超时时间
|
|||
|
|
```python
|
|||
|
|
# config.py
|
|||
|
|
EMO_TASK_TIMEOUT_SECONDS = 600 # 10分钟
|
|||
|
|
SING_TASK_TIMEOUT_SECONDS = 1800 # 30分钟
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 添加重试机制
|
|||
|
|
```python
|
|||
|
|
MAX_RETRIES = 3
|
|||
|
|
RETRY_DELAY = 60 # 秒
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 改进错误处理
|
|||
|
|
- 捕获所有异常
|
|||
|
|
- 记录详细的错误信息
|
|||
|
|
- 及时更新任务状态
|
|||
|
|
|
|||
|
|
### 4. 监控告警
|
|||
|
|
- 监控running状态超过一定时间的任务
|
|||
|
|
- 发送告警通知
|
|||
|
|
- 自动清理超时任务
|
|||
|
|
|
|||
|
|
## 总结
|
|||
|
|
|
|||
|
|
任务卡住通常是因为:
|
|||
|
|
1. 外部API调用超时(DashScope)
|
|||
|
|
2. 资源下载超时(OSS)
|
|||
|
|
3. 代码异常未正确处理
|
|||
|
|
|
|||
|
|
解决方法:
|
|||
|
|
1. 立即:手动标记为失败
|
|||
|
|
2. 短期:重启服务,增加超时处理
|
|||
|
|
3. 长期:添加监控和自动清理机制
|