Qwen2.5vl32B merge lora OOM问题 #4145

SeuZL · 2025-05-09T03:28:18Z

您好，我使用的是8卡910B的昇腾的机器
我在尝试微调并推理Qwen2.5vl32B的模型
直接训练的话，模型会在加载的时候出现OOM的情况，我才用了zero3，分布在八卡上，避免了这个问题
但现在，在推理阶段，我再次遇到了模型无法单卡加载的问题（OOM）
我尝试将模型分布到多卡上，受限于我匮乏的代码能力，一直没有成功，请问是否有参数可以完成这一点？
后续我还需要这个模型进行merge lora，也是相同的，无法单卡加载模型的问题，请问merge lora是否有参数可以解决？

非常期待您的回复，再次感谢您优秀的工作

Jintao-Huang · 2025-05-09T03:34:29Z

try --device_map auto

SeuZL · 2025-05-09T08:55:34Z

try --device_map auto

非常感谢您的回复，merge lora使用这个确实可以了，但是推理的时候出现了新的bug

NPROC_PER_NODE=1 \（此行我尝试过删除，和分别取1和8，删除和1是以下的bug报告，8是OOM）
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
MAX_PIXELS=802816
swift infer
--model /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged
--infer_backend pt
--max_batch_size 1
--val_dataset /root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/valdata.jsonl
--device_map auto
--temperature 0.1
--repetition_penalty 1.2
--top_p 0.95
--max_new_tokens 512
run sh: `/root/anaconda3/envs/qwenft/bin/python -m torch.distributed.run --nproc_per_node 1 /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py --model /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged --infer_backend pt --max_batch_size 1 --val_dataset /root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/valdata.jsonl --device_map auto --temperature 0.1 --repetition_penalty 1.2 --top_p 0.95 --max_new_tokens 512`
[INFO:swift] Successfully registered `/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/dataset/data/dataset_info.json`
[INFO:swift] Successfully loaded /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged/args.json.
[INFO:swift] rank: 0, local_rank: 0, world_size: 1, local_world_size: 1
[INFO:swift] Loading the model using model_dir: /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged
[INFO:swift] Because len(args.val_dataset) > 0, setting split_dataset_ratio: 0.0
[INFO:swift] args.result_path: /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged/infer_result/20250509-165006.jsonl
[INFO:swift] Setting args.eval_human: False
[INFO:swift] Global seed set to 42
[INFO:swift] args: InferArguments(model='/data/output/zhongda/v36-20250506-151902/checkpoint-300-merged', model_type='qwen2_5_vl', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=None, num_labels=None, rope_scaling=None, device_map='auto', local_repo_path=None, template='qwen2_5_vl', system=None, max_length=None, truncation_strategy='delete', max_pixels=None, tools_prompt='react_en', norm_bbox=None, padding_side='right', loss_scale='default', sequence_parallel_size=1, use_chat_template=True, template_backend='swift', dataset=[], val_dataset=['/root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/valdata.jsonl'], split_dataset_ratio=0.0, data_seed=42, dataset_num_proc=1, streaming=False, enable_cache=False, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, model_name=[None, None], model_author=[None, None], custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=512, temperature=0.1, top_k=None, top_p=0.95, repetition_penalty=1.2, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, ckpt_dir='/data/output/zhongda/v36-20250506-151902/checkpoint-300-merged', load_dataset_config=None, lora_modules=[], tuner_backend='peft', train_type='lora', adapters=[], seed=42, model_kwargs={'device_map': {'': 'npu:auto'}}, load_args=True, load_data_args=False, use_hf=False, hub_token=None, custom_register_path=[], ignore_args_error=False, use_swift_lora=False, tp=1, session_len=None, cache_max_entry_count=0.8, quant_policy=0, vision_batch_size=1, gpu_memory_utilization=0.9, tensor_parallel_size=1, pipeline_parallel_size=1, max_num_seqs=256, max_model_len=None, disable_custom_all_reduce=False, enforce_eager=False, limit_mm_per_prompt={}, vllm_max_lora_rank=16, enable_prefix_caching=False, merge_lora=False, safe_serialization=True, max_shard_size='5GB', infer_backend='pt', result_path='/data/output/zhongda/v36-20250506-151902/checkpoint-300-merged/infer_result/20250509-165006.jsonl', metric=None, max_batch_size=1, ddp_backend='hccl', val_dataset_sample=None)
[INFO:swift] Using MP + DDP(device_map)
[INFO:swift] Using MP + DDP(device_map)
[INFO:swift] Loading the model using model_dir: /data/output/zhongda/v36-20250506-151902/checkpoint-300-merged
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[INFO:swift] model_kwargs: {'device_map': 'auto'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
[rank0]: infer_main()
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
[rank0]: return SwiftInfer(args).main()
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 35, in init
[rank0]: model, self.template = prepare_model_template(args)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/utils.py", line 144, in prepare_model_template
[rank0]: model, processor = args.get_model_processor(kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/base_args/base_args.py", line 280, in get_model_processor
[rank0]: return get_model_tokenizer(kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 529, in get_model_tokenizer
[rank0]: model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/model/qwen.py", line 585, in get_model_tokenizer_qwen2_5_vl
[rank0]: return get_model_tokenizer_qwen2_vl(*args, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/model/qwen.py", line 516, in get_model_tokenizer_qwen2_vl
[rank0]: model, tokenizer = get_model_tokenizer_multimodal(*args, **kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 251, in get_model_tokenizer_multimodal
[rank0]: model, _ = get_model_tokenizer_with_flash_attn(model_dir, args, kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 244, in get_model_tokenizer_with_flash_attn
[rank0]: return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/register.py", line 223, in get_model_tokenizer_from_local
[rank0]: model = automodel_class.from_pretrained(
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 262, in _wrapper
[rank0]: return func(args, kwargs)
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4259, in from_pretrained
[rank0]: device_map = infer_auto_device_map(model, dtype=target_dtype, device_map_kwargs)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/model/patcher.py", line 266, in _infer_auto_device_map_patch
[rank0]: max_memory = _get_max_memory(device_ids)
[rank0]: File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/torch_utils.py", line 134, in _get_max_memory
[rank0]: max_memory[i] = torch.cuda.mem_get_info(i)[0]
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/cuda/memory.py", line 685, in mem_get_info
[rank0]: return torch.cuda.cudart().cudaMemGetInfo(device)
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/cuda/init.py", line 340, in cudart
[rank0]: _lazy_init()
[rank0]: File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/cuda/init.py", line 305, in _lazy_init
[rank0]: raise AssertionError("Torch not compiled with CUDA enabled")
[rank0]: AssertionError: Torch not compiled with CUDA enabled
[ERROR] 2025-05-09-16:50:22 (PID:876200, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception
E0509 16:50:28.047000 281473490513952 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 876200) of binary: /root/anaconda3/envs/qwenft/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 905, in
main()
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-09_16:50:28
host : ascendnode5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 876200)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

我是NPU，出现的问题在网上几乎找不到解决方案，自己尝试许久也没有成功，期待您的回复，再次感谢您

Jintao-Huang · 2025-05-09T08:58:08Z

删除应该没问题才对哇

Jintao-Huang · 2025-05-09T08:59:23Z

方便测试一下这个有效嘛 torch.npu.mem_get_info(0)

我这里没有NPU测试

SeuZL · 2025-05-09T09:12:17Z

您好，请问您这个是要放到哪里呀

方便测试一下这个有效嘛 torch.npu.mem_get_info(0)

我这里没有NPU测试

SeuZL · 2025-05-09T09:13:26Z

是要修改infer的程序吗

方便测试一下这个有效嘛 torch.npu.mem_get_info(0)

我这里没有NPU测试

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2.5vl32B merge lora OOM问题 #4145

Qwen2.5vl32B merge lora OOM问题 #4145

SeuZL commented May 9, 2025

Jintao-Huang commented May 9, 2025

SeuZL commented May 9, 2025

Jintao-Huang commented May 9, 2025

Jintao-Huang commented May 9, 2025

SeuZL commented May 9, 2025

SeuZL commented May 9, 2025

Qwen2.5vl32B merge lora OOM问题 #4145

Qwen2.5vl32B merge lora OOM问题 #4145

Comments

SeuZL commented May 9, 2025

Jintao-Huang commented May 9, 2025

SeuZL commented May 9, 2025

/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-05-09_16:50:28 host : ascendnode5 rank : 0 (local_rank: 0) exitcode : 1 (pid: 876200) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Jintao-Huang commented May 9, 2025

Jintao-Huang commented May 9, 2025

SeuZL commented May 9, 2025

SeuZL commented May 9, 2025

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-09_16:50:28
host : ascendnode5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 876200)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html