Skip to content

Qwen2.5vl32B merge lora OOM问题 #4145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SeuZL opened this issue May 9, 2025 · 6 comments
Open

Qwen2.5vl32B merge lora OOM问题 #4145

SeuZL opened this issue May 9, 2025 · 6 comments

Comments

@SeuZL
Copy link

SeuZL commented May 9, 2025

您好,我使用的是8卡910B的昇腾的机器
我在尝试微调并推理Qwen2.5vl32B的模型
直接训练的话,模型会在加载的时候出现OOM的情况,我才用了zero3,分布在八卡上,避免了这个问题
但现在,在推理阶段,我再次遇到了模型无法单卡加载的问题(OOM)
我尝试将模型分布到多卡上,受限于我匮乏的代码能力,一直没有成功,请问是否有参数可以完成这一点?
后续我还需要这个模型进行merge lora,也是相同的,无法单卡加载模型的问题,请问merge lora是否有参数可以解决?

非常期待您的回复,再次感谢您优秀的工作

@Jintao-Huang
Copy link
Collaborator

try --device_map auto

@SeuZL
Copy link
Author

SeuZL commented May 9, 2025

try --device_map auto

非常感谢您的回复,merge lora使用这个确实可以了,但是推理的时候出现了新的bug

NPROC_PER_NODE=1 \(此行我尝试过删除,和分别取1和8,删除和1是以下的bug报告,8是OOM)
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
MAX_PIXELS=802816
swift infer
--model /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged
--infer_backend pt
--max_batch_size 1
--val_dataset /root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/valdata.jsonl
--device_map auto
--temperature 0.1
--repetition_penalty 1.2
--top_p 0.95
--max_new_tokens 512
run sh: /root/anaconda3/envs/qwenft/bin/python -m torch.distributed.run --nproc_per_node 1 /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py --model /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged --infer_backend pt --max_batch_size 1 --val_dataset /root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/valdata.jsonl --device_map auto --temperature 0.1 --repetition_penalty 1.2 --top_p 0.95 --max_new_tokens 512
[INFO:swift] Successfully registered /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/dataset/data/dataset_info.json
[INFO:swift] Successfully loaded /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged/args.json.
[INFO:swift] rank: 0, local_rank: 0, world_size: 1, local_world_size: 1
[INFO:swift] Loading the model using model_dir: /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged
[INFO:swift] Because len(args.val_dataset) > 0, setting split_dataset_ratio: 0.0
[INFO:swift] args.result_path: /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged/infer_result/20250509-165006.jsonl
[INFO:swift] Setting args.eval_human: False
[INFO:swift] Global seed set to 42
[INFO:swift] args: InferArguments(model='/data/output/zhongda/v36-20250506-151902/checkpoint-300-merged', model_type='qwen2_5_vl', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=None, num_labels=None, rope_scaling=None, device_map='auto', local_repo_path=None, template='qwen2_5_vl', system=None, max_length=None, truncation_strategy='delete', max_pixels=None, tools_prompt='react_en', norm_bbox=None, padding_side='right', loss_scale='default', sequence_parallel_size=1, use_chat_template=True, template_backend='swift', dataset=[], val_dataset=['/root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/valdata.jsonl'], split_dataset_ratio=0.0, data_seed=42, dataset_num_proc=1, streaming=False, enable_cache=False, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, model_name=[None, None], model_author=[None, None], custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=512, temperature=0.1, top_k=None, top_p=0.95, repetition_penalty=1.2, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, ckpt_dir='/data/output/zhongda/v36-20250506-151902/checkpoint-300-merged', load_dataset_config=None, lora_modules=[], tuner_backend='peft', train_type='lora', adapters=[], seed=42, model_kwargs={'device_map': {'': 'npu:auto'}}, load_args=True, load_data_args=False, use_hf=False, hub_token=None, custom_register_path=[], ignore_args_error=False, use_swift_lora=False, tp=1, session_len=None, cache_max_entry_count=0.8, quant_policy=0, vision_batch_size=1, gpu_memory_utilization=0.9, tensor_parallel_size=1, pipeline_parallel_size=1, max_num_seqs=256, max_model_len=None, disable_custom_all_reduce=False, enforce_eager=False, limit_mm_per_prompt={}, vllm_max_lora_rank=16, enable_prefix_caching=False, merge_lora=False, safe_serialization=True, max_shard_size='5GB', infer_backend='pt', result_path='/data/output/zhongda/v36-20250506-151902/checkpoint-300-merged/infer_result/20250509-165006.jsonl', metric=None, max_batch_size=1, ddp_backend='hccl', val_dataset_sample=None)
[INFO:swift] Using MP + DDP(device_map)
[INFO:swift] Using MP + DDP(device_map)
[INFO:swift] Loading the model using model_dir: /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
[INFO:swift] model_kwargs: {'device_map': 'auto'}
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
[rank0]: infer_main()
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
[rank0]: return SwiftInfer(args).main()
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 35, in init
[rank0]: model, self.template = prepare_model_template(args)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/utils.py", line 144, in prepare_model_template
[rank0]: model, processor = args.get_model_processor(**kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/base_args/base_args.py", line 280, in get_model_processor
[rank0]: return get_model_tokenizer(**kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 529, in get_model_tokenizer
[rank0]: model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/model/qwen.py", line 585, in get_model_tokenizer_qwen2_5_vl
[rank0]: return get_model_tokenizer_qwen2_vl(*args, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/model/qwen.py", line 516, in get_model_tokenizer_qwen2_vl
[rank0]: model, tokenizer = get_model_tokenizer_multimodal(*args, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 251, in get_model_tokenizer_multimodal
[rank0]: model, _ = get_model_tokenizer_with_flash_attn(model_dir, *args, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 244, in get_model_tokenizer_with_flash_attn
[rank0]: return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 223, in get_model_tokenizer_from_local
[rank0]: model = automodel_class.from_pretrained(
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 262, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4259, in from_pretrained
[rank0]: device_map = infer_auto_device_map(model, dtype=target_dtype, **device_map_kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/patcher.py", line 266, in _infer_auto_device_map_patch
[rank0]: max_memory = _get_max_memory(device_ids)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/torch_utils.py", line 134, in _get_max_memory
[rank0]: max_memory[i] = torch.cuda.mem_get_info(i)[0]
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/cuda/memory.py", line 685, in mem_get_info
[rank0]: return torch.cuda.cudart().cudaMemGetInfo(device)
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/cuda/init.py", line 340, in cudart
[rank0]: _lazy_init()
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/cuda/init.py", line 305, in _lazy_init
[rank0]: raise AssertionError("Torch not compiled with CUDA enabled")
[rank0]: AssertionError: Torch not compiled with CUDA enabled
[ERROR] 2025-05-09-16:50:22 (PID:876200, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception
E0509 16:50:28.047000 281473490513952 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 876200) of binary: /root/anaconda3/envs/qwenft/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 905, in
main()
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-09_16:50:28
host : ascendnode5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 876200)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

我是NPU,出现的问题在网上几乎找不到解决方案,自己尝试许久也没有成功,期待您的回复,再次感谢您

@Jintao-Huang
Copy link
Collaborator

删除应该没问题才对哇

@Jintao-Huang
Copy link
Collaborator

方便测试一下 这个有效嘛 torch.npu.mem_get_info(0)

我这里没有NPU测试

@SeuZL
Copy link
Author

SeuZL commented May 9, 2025

Image

您好,请问您这个是要放到哪里呀

方便测试一下 这个有效嘛 torch.npu.mem_get_info(0)

我这里没有NPU测试

@SeuZL
Copy link
Author

SeuZL commented May 9, 2025

是要修改infer的程序吗

方便测试一下 这个有效嘛 torch.npu.mem_get_info(0)

我这里没有NPU测试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants