Skip to content

奇怪的out of memory报错 #3964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jfy1016 opened this issue Apr 23, 2025 · 13 comments
Open

奇怪的out of memory报错 #3964

jfy1016 opened this issue Apr 23, 2025 · 13 comments

Comments

@jfy1016
Copy link

jfy1016 commented Apr 23, 2025

代码没有变化,用数据量较少的数据集就可以正常训练,用数据量较多的数据集就报错out of memory,两个数据集除了数据量不同,没有任何差别。为什么?

@jfy1016
Copy link
Author

jfy1016 commented Apr 23, 2025

Image这是数据量较多的数据集报错

@xh-2000
Copy link

xh-2000 commented Apr 23, 2025

我也遇到了相同的问题,不知道您是否解决了?

@jfy1016
Copy link
Author

jfy1016 commented Apr 24, 2025

我也遇到了相同的问题,不知道您是否解决了?

没有解决

@Xu-Chen
Copy link

Xu-Chen commented Apr 24, 2025

@Jintao-Huang 麻烦您看下哈,我发现只有 packing 的时候会内存 OOM

@Jintao-Huang
Copy link
Collaborator

@Jintao-Huang 麻烦您看下哈,我发现只有 packing 的时候会内存 OOM

多模态模型嘛,你看看有没有--streaming true

@Xu-Chen
Copy link

Xu-Chen commented Apr 24, 2025

@Jintao-Huang 麻烦您看下哈,我发现只有 packing 的时候会内存 OOM

多模态模型嘛,你看看有没有--streaming true

不是多模态模型,没用流式加载,在packing快结束的时候报oom

@Jintao-Huang
Copy link
Collaborator

显存还是内存

@Xu-Chen
Copy link

Xu-Chen commented Apr 24, 2025

显存还是内存

内存oom

@Jintao-Huang
Copy link
Collaborator

加一下 --streaming true

@jfy1016
Copy link
Author

jfy1016 commented Apr 25, 2025

@Jintao-Huang 今早看训练发现训练了200多步后又报错了,问题还是没有解决

Image

@jfy1016
Copy link
Author

jfy1016 commented Apr 25, 2025

CUDA_VISIBLE_DEVICES=0,1,2
MAX_PIXELS=1003520
swift sft
--model /home/jdn/.cache/modelscope/hub/models/deepseek-ai/deepseek-vl2-tiny
--dataset /home/jdn/deepseek/save_json/xunlian_CT_and_Xray.json
--streaming true
--train_type lora
--torch_dtype float16
--num_train_epochs 5
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--gradient_accumulation_steps 16
--eval_steps 50
--save_steps 50
--save_total_limit 5
--logging_steps 5
--max_length 2048
--output_dir /home/jdn/deepseek/output
--warmup_ratio 0.05
--lazy_tokenize true
--dataloader_num_workers 0 \

加上--streaming true后报错

Image如果不支持bf16怎么办

@xh-2000
Copy link

xh-2000 commented Apr 26, 2025

Image
我是微调大概1500-2000轮时出现killed,本来3w条训练数据时没问题,增加到6w条就会这样,这是bug嘛?

@zideliu
Copy link

zideliu commented May 13, 2025

Image 我是微调大概1500-2000轮时出现killed,本来3w条训练数据时没问题,增加到6w条就会这样,这是bug嘛?

这是不是内存爆掉了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants