dpo模型RuntimeError: CUDA driver error: invalid argument， #4104

listwebit · 2025-05-07T02:20:27Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)
用llama factory 框架，在 12台 8卡 a100机器，共96卡上面长度12k,可以正常训练qwen2.5 72b的dpo模型，但是用swift模型训练就直接报错：RuntimeError: CUDA driver error: invalid argument，

环境没有问题的，因为用小尺寸模型 qwen2.5 14b就能正常跑dpo模型，请问怎么解决呢。感觉像是显存不足导致的，

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

Additional context
Add any other context about the problem here(在这里补充其他信息)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpo模型RuntimeError: CUDA driver error: invalid argument， #4104

dpo模型RuntimeError: CUDA driver error: invalid argument， #4104

listwebit commented May 7, 2025

dpo模型RuntimeError: CUDA driver error: invalid argument， #4104

dpo模型RuntimeError: CUDA driver error: invalid argument， #4104

Comments

listwebit commented May 7, 2025