Skip to content

Qwen2VL微调到某个step就报错 #3924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ooochen-30 opened this issue Apr 18, 2025 · 5 comments
Open

Qwen2VL微调到某个step就报错 #3924

ooochen-30 opened this issue Apr 18, 2025 · 5 comments

Comments

@ooochen-30
Copy link

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
对Qwen2VL做多模态任务的all-linear层lora微调,Train到第200步就报错,之前也有做过同样的任务能正常训练,用的是单图像的数据样本,现在是每个数据样本两张图像(不知道这有没有影响)

运行命令:

run sh: `/home/cjy/miniconda3/envs/swift/bin/python /home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/cli/sft.py --torch_dtype bfloat16 --model /home/cjy/model/Qwen/Qwen2-VL-2B-Instruct --model_type qwen2_vl --template qwen2_vl --system You are a helpful and harmless assistant. --dataset /home/cjy/data/xxx_indoor/label/part1_three.jsonl --max_length 1024 --init_weights True --per_device_train_batch_size 2 --learning_rate 1e-4 --num_train_epochs 1 --attn_impl flash_attn --gradient_accumulation_steps 16 --eval_steps 200 --output_dir /home/cjy/model/xxx/Qwen2-VL-2B-Instruct/ --report_to tensorboard --add_version False --output_dir /home/cjy/model/xxx/Qwen2-VL-2B-Instruct/v9-20250417-183456 --logging_dir /home/cjy/model/xxx/Qwen2-VL-2B-Instruct/v9-20250417-183456/runs --ignore_args_error True`
Train:  40%|████      | 200/500 [3:18:00<4:53:51, 58.77s/it]
Val: 100%|██████████| 161/161 [03:35<00:00,  1.26s/it]
Train:  40%|████      | 200/500 [3:18:00<4:53:51, 58.77s/it]
Val: 100%|██████████| 161/161 [03:35<00:00,  1.34s/it]
{'eval_loss': 0.14110497, 'eval_token_acc': 0.95689655, 'eval_runtime': 217.6855, 'eval_samples_per_second': 0.74, 'eval_steps_per_second': 0.74, 'epoch': 0.4, 'global_step/max_steps': '200/500', 'percentage': '40.00%', 'elapsed_time': '3h 18m 0s', 'remaining_time': '4h 57m 0s'}
Traceback (most recent call last):
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/cli/sft.py", line 10, in <module>
    sft_main()
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/llm/train/sft.py", line 265, in sft_main
    return SwiftSft(args).main()
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/llm/base.py", line 47, in main
    result = self.run()
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/llm/train/sft.py", line 142, in run
    return self.train(trainer)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/llm/train/sft.py", line 202, in train
    trainer.train(trainer.args.resume_from_checkpoint)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/trainers/mixin.py", line 289, in train
    res = super().train(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 3718, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/swift/trainers/trainers.py", line 165, in compute_loss
    outputs = model(**inputs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
    return inner()
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
    result = forward_call(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/accelerate/utils/operations.py", line 814, in forward
    return model_forward(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/accelerate/utils/operations.py", line 802, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/peft/peft_model.py", line 1756, in forward
    return self.base_model(
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 193, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/cjy/miniconda3/envs/swift/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1750, in forward
    logits = logits.float()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

cuda版本:12.6
驱动版本:560.35.05
4卡A100
ubuntu:20.04
torch:2.5.1

Additional context
Add any other context about the problem here(在这里补充其他信息)

@nioxinjiang3
Copy link

遇到相同错误

@Jintao-Huang
Copy link
Collaborator

maybe OOM

@ooochen-30
Copy link
Author

@Jintao-Huang 我重跑了一遍 还是报错了 记录了显卡的情况 如下所示

正常运行时
Image

运行终止前
Image

@ooochen-30
Copy link
Author

我将batch_size从2调到1 可以正常运行了,不过2的时候其实卡也没跑完,有点奇怪

@lucasjinreal
Copy link

Your GPU mem usage are extremly imbalanced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants