-
Notifications
You must be signed in to change notification settings - Fork 637
InternVL3+DPO训练报错 #3870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you can modify the code of transformers code - tp_plan_regex = (
- re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
- if _torch_distributed_available and torch.distributed.is_initialized()
- else None
- )
+ tp_plan_regex = None related issue |
感谢回复,目前报错的问题已经解决了,但是会出现OOM,我使用了4*90G的卡训练lora仍然会OOM,之前的版本两张卡就可以dpo InternVL-38B了,请问可以如何改善呢 CUDA_VISIBLE_DEVICES=0,1,2,3 报错: |
maybe |
感谢,问题已解决 |
hi,我这边使用最新的分支来对InternVL3-38B进行DPO训练,训练指令如下:
nproc_per_node=2
CUDA_VISIBLE_DEVICES=0,1
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type dpo
--model_type internvl3
--model /mnt/workspace/internvl3/SFT/universe/InternVL3-38B_sft_lora_universe_extract_v2_1/v0-20250413-203633/checkpoint-615-merged
--load_args false
--train_type lora
--dataset /mnt/workspace/multimodal_database/grounding/train/train_mix_stack_box_PO_norm1000_285.json
--output_dir /mnt/workspace/internvl3/SFT/universe/InternVL3-38B_sft_dpo_lora_universe_extract_v2_2
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--gradient_accumulation_steps $(expr 16 / $nproc_per_node)
--eval_steps 100
--save_steps 100
--save_total_limit 5
--logging_steps 5
--max_length 2048
--warmup_ratio 0.05
--dataloader_num_workers 4
--deepspeed zero2
--dataset_num_proc 4
然后在加载模型的时候就会报错:
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/cli/rlhf.py", line 5, in
[rank0]: rlhf_main()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/sft.py", line 31, in init
[rank0]: self._prepare_model_tokenizer()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/rlhf.py", line 65, in _prepare_model_tokenizer
[rank0]: super()._prepare_model_tokenizer()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/train/sft.py", line 62, in _prepare_model_tokenizer
[rank0]: self.model, self.processor = args.get_model_processor()
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/argument/base_args/base_args.py", line 276, in get_model_processor
[rank0]: return get_model_tokenizer(**kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/register.py", line 564, in get_model_tokenizer
[rank0]: model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/model/internlm.py", line 129, in get_model_tokenizer_internvl
[rank0]: model, tokenizer = get_model_tokenizer_with_flash_attn(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/register.py", line 265, in get_model_tokenizer_with_flash_attn
[rank0]: return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/register.py", line 234, in get_model_tokenizer_from_local
[rank0]: model = automodel_class.from_pretrained(
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: File "/mnt/workspace/internvl3/ms-swift/swift/llm/model/patcher.py", line 285, in _new_from_pretrained
[rank0]: return from_pretrained(cls, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4399, in from_pretrained
[rank0]: ) = cls._load_pretrained_model(
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
[rank0]: caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 5779, in caching_allocator_warmup
[rank0]: re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
[rank0]: TypeError: 'NoneType' object is not iterable
上述指令是使用sft过的模型,我这边直接还在InternVL3-38B原模型也会报同样的错,请教下如何解决~
The text was updated successfully, but these errors were encountered: