-
Notifications
You must be signed in to change notification settings - Fork 637
Possible bugs in MolmoE fintuning #3998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
bug
Something isn't working
Comments
貌似是一个模型层面的错误,目前在 x = x.clone()
x[batch_idx[valid], image_input_idx[valid]] += image_features[valid] |
main分支依旧修复了 |
非常感谢 |
我是用如下脚本已经能够正常训练,但是发现应该是做了model parallel,速度很慢。 CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
--model /home/ckpt/MolmoE \
--model_type molmoe \
--train_type lora \
--dataset /home/data/general.json \
--torch_dtype bfloat16 \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 8 \
--save_strategy epoch \
--logging_steps 10 \
--max_length 4096 \
--max_pixels 980000 \
--output_dir output/molmoe/aitz \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--model_author swift \
--model_name swift-robot 按照 CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=4 \
swift sft \
--model /home/ckpt/MolmoE \
--model_type molmoe \
--train_type lora \
--dataset /home/data/general.json \
--torch_dtype bfloat16 \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 8 \
--save_strategy epoch \
--logging_steps 10 \
--max_length 4096 \
--max_pixels 980000 \
--output_dir output/molmoe/aitz \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--model_author swift \
--model_name swift-robot 但是遇到了如下问题
看起来是molmoe不支持tensor parallel的问题,请问有没有办法禁用这个,理论上不做张量并行也没有问题。 |
transformers的bug,直接改一下transformers代码或者降低一下版本 |
我参考的 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
在尝试对MolmoE模型进行微调的时候出现报错
脚本如下:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
swift sft
--model /home/ckpt/MolmoE
--model_type molmoe
--train_type lora
--dataset /home/train_data/train.json
--torch_dtype bfloat16
--num_train_epochs 3
--per_device_train_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 16
--save_strategy epoch
--logging_steps 10
--max_length 4096
--max_pixels 980000
--output_dir output/molmoe
--warmup_ratio 0.05
--dataloader_num_workers 4
--model_author swift
--model_name swift-robot
出现的错误:
Traceback (most recent call last):
File "/home/project/ms-swift/swift/cli/sft.py", line 7, in
sft_main()
File "/home/project/ms-swift/swift/llm/train/sft.py", line 283, in sft_main
return SwiftSft(args).main()
File "/home/project/ms-swift/swift/llm/base.py", line 47, in main
result = self.run()
File "/home/project/ms-swift/swift/llm/train/sft.py", line 144, in run
return self.train(trainer)
File "/home/project/ms-swift/swift/llm/train/sft.py", line 204, in train
trainer.train(trainer.args.resume_from_checkpoint)
File "/home/project/ms-swift/swift/trainers/mixin.py", line 294, in train
res = super().train(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
return inner_training_loop(
File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
File "/home/project/ms-swift/swift/trainers/trainers.py", line 205, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
return inner()
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1793, in inner
result = forward_call(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/accelerate/utils/operations.py", line 814, in forward
return model_forward(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/accelerate/utils/operations.py", line 802, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/peft/peft_model.py", line 1757, in forward
return self.base_model(
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 193, in forward
return self.model.forward(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/MolmoE/modeling_molmoe.py", line 2230, in forward
outputs = self.model(
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/MolmoE/modeling_molmoe.py", line 1920, in forward
x[batch_idx[valid], image_input_idx[valid]] += image_features[valid]
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
CUDA 12.2
transformers 4.51.0
torch 2.6.0
Additional context
Add any other context about the problem here(在这里补充其他信息)
The text was updated successfully, but these errors were encountered: