Skip to content

Possible bugs in MolmoE fintuning #3998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
EthanLeo-LYX opened this issue Apr 25, 2025 · 6 comments
Closed

Possible bugs in MolmoE fintuning #3998

EthanLeo-LYX opened this issue Apr 25, 2025 · 6 comments
Labels
bug Something isn't working

Comments

@EthanLeo-LYX
Copy link

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

在尝试对MolmoE模型进行微调的时候出现报错

脚本如下:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
swift sft
--model /home/ckpt/MolmoE
--model_type molmoe
--train_type lora
--dataset /home/train_data/train.json
--torch_dtype bfloat16
--num_train_epochs 3
--per_device_train_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 16
--save_strategy epoch
--logging_steps 10
--max_length 4096
--max_pixels 980000
--output_dir output/molmoe
--warmup_ratio 0.05
--dataloader_num_workers 4
--model_author swift
--model_name swift-robot

出现的错误:

Traceback (most recent call last):
File "/home/project/ms-swift/swift/cli/sft.py", line 7, in
sft_main()
File "/home/project/ms-swift/swift/llm/train/sft.py", line 283, in sft_main
return SwiftSft(args).main()
File "/home/project/ms-swift/swift/llm/base.py", line 47, in main
result = self.run()
File "/home/project/ms-swift/swift/llm/train/sft.py", line 144, in run
return self.train(trainer)
File "/home/project/ms-swift/swift/llm/train/sft.py", line 204, in train
trainer.train(trainer.args.resume_from_checkpoint)
File "/home/project/ms-swift/swift/trainers/mixin.py", line 294, in train
res = super().train(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
return inner_training_loop(
File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
File "/home/project/ms-swift/swift/trainers/trainers.py", line 205, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
return inner()
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1793, in inner
result = forward_call(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/accelerate/utils/operations.py", line 814, in forward
return model_forward(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/accelerate/utils/operations.py", line 802, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/peft/peft_model.py", line 1757, in forward
return self.base_model(
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 193, in forward
return self.model.forward(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/accelerate/hooks.py", line 176, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/MolmoE/modeling_molmoe.py", line 2230, in forward
outputs = self.model(
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/MolmoE/modeling_molmoe.py", line 1920, in forward
x[batch_idx[valid], image_input_idx[valid]] += image_features[valid]
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

CUDA 12.2
transformers 4.51.0
torch 2.6.0

Additional context
Add any other context about the problem here(在这里补充其他信息)

@Jintao-Huang Jintao-Huang added the bug Something isn't working label Apr 25, 2025
@EthanLeo-LYX
Copy link
Author

貌似是一个模型层面的错误,目前在 modeling_molmoe.py:1920 这个地方做了如下修改可以正常训练

x = x.clone()
x[batch_idx[valid], image_input_idx[valid]] += image_features[valid]

@Jintao-Huang
Copy link
Collaborator

main分支依旧修复了

@EthanLeo-LYX
Copy link
Author

main分支依旧修复了

非常感谢

@EthanLeo-LYX
Copy link
Author

EthanLeo-LYX commented Apr 27, 2025

我是用如下脚本已经能够正常训练,但是发现应该是做了model parallel,速度很慢。

CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model /home/ckpt/MolmoE \
    --model_type molmoe \
    --train_type lora \
    --dataset /home/data/general.json \
    --torch_dtype bfloat16 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 8 \
    --save_strategy epoch \
    --logging_steps 10 \
    --max_length 4096 \
    --max_pixels 980000 \
    --output_dir output/molmoe/aitz \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --model_author swift \
    --model_name swift-robot 

按照example中提供的ddp脚本,增加了NPROC_PER_NODE环境变量

CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=4 \
swift sft \
    --model /home/ckpt/MolmoE \
    --model_type molmoe \
    --train_type lora \
    --dataset /home/data/general.json \
    --torch_dtype bfloat16 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 8 \
    --save_strategy epoch \
    --logging_steps 10 \
    --max_length 4096 \
    --max_pixels 980000 \
    --output_dir output/molmoe/aitz \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --model_author swift \
    --model_name swift-robot 

但是遇到了如下问题

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/project/swift/swift/cli/sft.py", line 7, in <module>
[rank0]:     sft_main()
[rank0]:   File "/home/project/swift/swift/llm/train/sft.py", line 281, in sft_main
[rank0]:     return SwiftSft(args).main()
[rank0]:   File "/home/project/swift/swift/llm/train/sft.py", line 31, in __init__
[rank0]:     self._prepare_model_tokenizer()
[rank0]:   File "/home/project/swift/swift/llm/train/sft.py", line 62, in _prepare_model_tokenizer
[rank0]:     self.model, self.processor = args.get_model_processor()
[rank0]:   File "/home/project/swift/swift/llm/argument/base_args/base_args.py", line 274, in get_model_processor
[rank0]:     return get_model_tokenizer(**kwargs)
[rank0]:   File "/home/project/swift/swift/llm/model/register.py", line 571, in get_model_tokenizer
[rank0]:     model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]:   File "/home/project/swift/swift/llm/model/model/mllm.py", line 74, in get_model_tokenizer_molmoe
[rank0]:     model, processor = get_model_tokenizer_multimodal(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]:   File "/home/project/swift/swift/llm/model/register.py", line 279, in get_model_tokenizer_multimodal
[rank0]:     model, _ = get_model_tokenizer_with_flash_attn(model_dir, *args, **kwargs)
[rank0]:   File "/home/project/swift/swift/llm/model/register.py", line 272, in get_model_tokenizer_with_flash_attn
[rank0]:     return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]:   File "/home/project/swift/swift/llm/model/register.py", line 241, in get_model_tokenizer_from_local
[rank0]:     model = automodel_class.from_pretrained(
[rank0]:   File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
[rank0]:     return model_class.from_pretrained(
[rank0]:   File "/home/project/swift/swift/llm/model/patcher.py", line 281, in _new_from_pretrained
[rank0]:     return from_pretrained(cls, *args, **kwargs)
[rank0]:   File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4400, in from_pretrained
[rank0]:     ) = cls._load_pretrained_model(
[rank0]:   File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
[rank0]:     caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
[rank0]:   File "/opt/conda/envs/swift/lib/python3.10/site-packages/transformers/modeling_utils.py", line 5775, in caching_allocator_warmup
[rank0]:     re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
[rank0]: TypeError: 'NoneType' object is not iterable

看起来是molmoe不支持tensor parallel的问题,请问有没有办法禁用这个,理论上不做张量并行也没有问题。

@EthanLeo-LYX EthanLeo-LYX reopened this Apr 27, 2025
@Jintao-Huang
Copy link
Collaborator

transformers的bug,直接改一下transformers代码或者降低一下版本

@EthanLeo-LYX
Copy link
Author

transformers的bug,直接改一下transformers代码或者降低一下版本

我参考的README里的transformers版本,目前是4.51.0,降低到4.49.0已经能够实现单机DDP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants