奇怪的out of memory报错 #3964

jfy1016 · 2025-04-23T06:05:49Z

代码没有变化，用数据量较少的数据集就可以正常训练，用数据量较多的数据集就报错out of memory,两个数据集除了数据量不同，没有任何差别。为什么？

jfy1016 · 2025-04-23T06:13:17Z

这是数据量较多的数据集报错

xh-2000 · 2025-04-23T11:50:54Z

我也遇到了相同的问题，不知道您是否解决了？

jfy1016 · 2025-04-24T01:10:44Z

我也遇到了相同的问题，不知道您是否解决了？

没有解决

Xu-Chen · 2025-04-24T13:51:09Z

@Jintao-Huang 麻烦您看下哈，我发现只有 packing 的时候会内存 OOM

Jintao-Huang · 2025-04-24T14:25:47Z

@Jintao-Huang 麻烦您看下哈，我发现只有 packing 的时候会内存 OOM

多模态模型嘛，你看看有没有--streaming true

Xu-Chen · 2025-04-24T14:29:33Z

@Jintao-Huang 麻烦您看下哈，我发现只有 packing 的时候会内存 OOM

多模态模型嘛，你看看有没有--streaming true

不是多模态模型，没用流式加载，在packing快结束的时候报oom

Jintao-Huang · 2025-04-24T14:29:56Z

显存还是内存

Xu-Chen · 2025-04-24T14:47:05Z

显存还是内存

内存oom

Jintao-Huang · 2025-04-24T14:48:41Z

加一下 --streaming true

jfy1016 · 2025-04-25T01:39:23Z

@Jintao-Huang 今早看训练发现训练了200多步后又报错了，问题还是没有解决

jfy1016 · 2025-04-25T01:44:47Z

CUDA_VISIBLE_DEVICES=0,1,2
MAX_PIXELS=1003520
swift sft
--model /home/jdn/.cache/modelscope/hub/models/deepseek-ai/deepseek-vl2-tiny
--dataset /home/jdn/deepseek/save_json/xunlian_CT_and_Xray.json
--streaming true
--train_type lora
--torch_dtype float16
--num_train_epochs 5
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--gradient_accumulation_steps 16
--eval_steps 50
--save_steps 50
--save_total_limit 5
--logging_steps 5
--max_length 2048
--output_dir /home/jdn/deepseek/output
--warmup_ratio 0.05
--lazy_tokenize true
--dataloader_num_workers 0 \

加上--streaming true后报错

如果不支持bf16怎么办

xh-2000 · 2025-04-26T02:57:26Z

我是微调大概1500-2000轮时出现killed，本来3w条训练数据时没问题，增加到6w条就会这样，这是bug嘛？

zideliu · 2025-05-13T06:31:14Z

我是微调大概1500-2000轮时出现killed，本来3w条训练数据时没问题，增加到6w条就会这样，这是bug嘛？

这是不是内存爆掉了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

奇怪的out of memory报错 #3964

奇怪的out of memory报错 #3964

jfy1016 commented Apr 23, 2025

jfy1016 commented Apr 23, 2025 •

edited

Loading

xh-2000 commented Apr 23, 2025

jfy1016 commented Apr 24, 2025

Xu-Chen commented Apr 24, 2025 •

edited

Loading

Jintao-Huang commented Apr 24, 2025

Xu-Chen commented Apr 24, 2025

Jintao-Huang commented Apr 24, 2025

Xu-Chen commented Apr 24, 2025

Jintao-Huang commented Apr 24, 2025

jfy1016 commented Apr 25, 2025

jfy1016 commented Apr 25, 2025

xh-2000 commented Apr 26, 2025

zideliu commented May 13, 2025

奇怪的out of memory报错 #3964

奇怪的out of memory报错 #3964

Comments

jfy1016 commented Apr 23, 2025

jfy1016 commented Apr 23, 2025 • edited Loading

xh-2000 commented Apr 23, 2025

jfy1016 commented Apr 24, 2025

Xu-Chen commented Apr 24, 2025 • edited Loading

Jintao-Huang commented Apr 24, 2025

Xu-Chen commented Apr 24, 2025

Jintao-Huang commented Apr 24, 2025

Xu-Chen commented Apr 24, 2025

Jintao-Huang commented Apr 24, 2025

jfy1016 commented Apr 25, 2025

jfy1016 commented Apr 25, 2025

xh-2000 commented Apr 26, 2025

zideliu commented May 13, 2025

jfy1016 commented Apr 23, 2025 •

edited

Loading

Xu-Chen commented Apr 24, 2025 •

edited

Loading