Skip to content

InternVL3+DPO训练报错 #3870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fourierer opened this issue Apr 14, 2025 · 4 comments
Closed

InternVL3+DPO训练报错 #3870

fourierer opened this issue Apr 14, 2025 · 4 comments

Comments

@fourierer
Copy link

fourierer commented Apr 14, 2025

hi,我这边使用最新的分支来对InternVL3-38B进行DPO训练,训练指令如下:
nproc_per_node=2

CUDA_VISIBLE_DEVICES=0,1
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type dpo
--model_type internvl3
--model /mnt/workspace/internvl3/SFT/universe/InternVL3-38B_sft_lora_universe_extract_v2_1/v0-20250413-203633/checkpoint-615-merged
--load_args false
--train_type lora
--dataset /mnt/workspace/multimodal_database/grounding/train/train_mix_stack_box_PO_norm1000_285.json
--output_dir /mnt/workspace/internvl3/SFT/universe/InternVL3-38B_sft_dpo_lora_universe_extract_v2_2
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--gradient_accumulation_steps $(expr 16 / $nproc_per_node)
--eval_steps 100
--save_steps 100
--save_total_limit 5
--logging_steps 5
--max_length 2048
--warmup_ratio 0.05
--dataloader_num_workers 4
--deepspeed zero2
--dataset_num_proc 4

然后在加载模型的时候就会报错:
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/cli/rlhf.py", line 5, in
[rank0]: rlhf_main()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/sft.py", line 31, in init
[rank0]: self._prepare_model_tokenizer()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/rlhf.py", line 65, in _prepare_model_tokenizer
[rank0]: super()._prepare_model_tokenizer()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/sft.py", line 62, in _prepare_model_tokenizer
[rank0]: self.model, self.processor = args.get_model_processor()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/argument/base_args/base_args.py", line 276, in get_model_processor
[rank0]: return get_model_tokenizer(**kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/register.py", line 564, in get_model_tokenizer
[rank0]: model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/model/internlm.py", line 129, in get_model_tokenizer_internvl
[rank0]: model, tokenizer = get_model_tokenizer_with_flash_attn(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/register.py", line 265, in get_model_tokenizer_with_flash_attn
[rank0]: return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/register.py", line 234, in get_model_tokenizer_from_local
[rank0]: model = automodel_class.from_pretrained(
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/patcher.py", line 285, in _new_from_pretrained
[rank0]: return from_pretrained(cls, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4399, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
[rank0]: caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 5779, in caching_allocator_warmup
[rank0]: re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
[rank0]: TypeError: 'NoneType' object is not iterable
上述指令是使用sft过的模型,我这边直接还在InternVL3-38B原模型也会报同样的错,请教下如何解决~

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 14, 2025

you can modify the code of transformers code

- tp_plan_regex = (
-     re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
-     if _torch_distributed_available and torch.distributed.is_initialized()
-     else None
- )
+ tp_plan_regex = None

related issue
#3715

@fourierer
Copy link
Author

you can modify the code of transformers code

  • tp_plan_regex = (
  • re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
    
  • if _torch_distributed_available and torch.distributed.is_initialized()
    
  • else None
    
  • )
  • tp_plan_regex = None
    related issue #3715

感谢回复,目前报错的问题已经解决了,但是会出现OOM,我使用了4*90G的卡训练lora仍然会OOM,之前的版本两张卡就可以dpo InternVL-38B了,请问可以如何改善呢
nproc_per_node=4

CUDA_VISIBLE_DEVICES=0,1,2,3
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type dpo
--model_type internvl3
--model /mnt/workspace/internvl3/SFT/universe/InternVL3-38B_sft_lora_universe_extract_v2_1/v0-20250413-203633/checkpoint-615-merged
--load_args false
--train_type lora
--dataset /mnt/workspace/multimodal_database/grounding/train/train_mix_stack_box_PO_norm1000_285.json
--output_dir /mnt/workspace/internvl3/SFT/universe/InternVL3-38B_sft_dpo_lora_universe_extract_v2_2
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--gradient_accumulation_steps $(expr 16 / $nproc_per_node)
--eval_steps 100
--save_steps 100
--save_total_limit 5
--logging_steps 5
--max_length 2048
--warmup_ratio 0.05
--dataloader_num_workers 4
--deepspeed zero2
--dataset_num_proc 4

报错:
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/cli/rlhf.py", line 5, in
[rank1]: rlhf_main()
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
[rank1]: return SwiftRLHF(args).main()
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/base.py", line 47, in main
[rank1]: result = self.run()
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/sft.py", line 144, in run
[rank1]: return self.train(trainer)
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/sft.py", line 204, in train
[rank1]: trainer.train(trainer.args.resume_from_checkpoint)
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/trainers/mixin.py", line 294, in train
[rank1]: res = super().train(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank1]: return inner_training_loop(
[rank1]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
[rank1]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 98, in compute_loss
[rank1]: res = super().compute_loss(model, inputs, return_outputs=return_outputs)
[rank1]: File "/usr/local/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1356, in compute_loss
[rank1]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank1]: File "/usr/local/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1291, in get_batch_loss_metrics
[rank1]: model_output = self.concatenated_forward(model, batch)
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/trainers/rlhf_trainer/dpo_trainer.py", line 68, in concatenated_forward
[rank1]: all_logps, size_completion = self.get_batch_logps(
[rank1]: File "/mnt/workspace/internvl3/ms-swift/swift/trainers/rlhf_trainer/dpo_trainer.py", line 128, in get_batch_logps
[rank1]: per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.06 GiB. GPU has a total capacity of 95.62 GiB of which 608.50 MiB is free. Process 3139980 has 95.08 GiB memory in use. Of the allocated memory 87.91 GiB is allocated by PyTorch, and 822.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 14, 2025

maybe --deepspeed zero3will works well

@fourierer
Copy link
Author

maybe --deepspeed zero3will works well

感谢,问题已解决

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants