Qwen 2.5 GPTQ int 4量化模型LORA微调报错 #3910

wangxiajun68 · 2025-04-17T07:09:45Z

Describe the bug
swift是否支持对量化模型的微调，这个报错是环境原因吗？

> **Your hardware and system info**
> [INFO:swift] model_parameter_info: PeftModelForSequenceClassification: 326.2853M Params (14.9688M Trainable [4.5877%]), 371.2943M Buffers.
> /mnt/wangxj/ms-swift/swift/trainers/mixin.py:81: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `RewardTrainer.                                                                                                                                            __init__`. Use `processing_class` instead.
>   super().__init__(
> WARNING:accelerate.utils.other:Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang.                                                                                                                                             It is recommended to upgrade the kernel to the minimum version or higher.
> No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names                                                                                                                                             is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
> [INFO:swift] The logging file will be saved in: /mnt/wangxj/ms-swift/output/ner_dpo/v16-20250417-070456/logging.jsonl
> ERROR MarlinQuantLinear: `gptqmodel.nn_modules.qlinear.marlin.MarlinQuantLinear` switching to training mode.
> [rank0]: Traceback (most recent call last):
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/cli/rlhf.py", line 5, in <module>
> [rank0]:     rlhf_main()
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
> [rank0]:     return SwiftRLHF(args).main()
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/base.py", line 47, in main
> [rank0]:     result = self.run()
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/train/sft.py", line 144, in run
> [rank0]:     return self.train(trainer)
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/train/sft.py", line 204, in train
> [rank0]:     trainer.train(trainer.args.resume_from_checkpoint)
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/trainers/mixin.py", line 294, in train
> [rank0]:     res = super().train(*args, **kwargs)
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
> [rank0]:     return inner_training_loop(
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
> [rank0]:     self.model.train()
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]:     module.train(mode)
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]:     module.train(mode)
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]:     module.train(mode)
> [rank0]:   [Previous line repeated 5 more times]
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/gptqmodel/nn_modules/qlinear/__init__.py", line 400, in train
> [rank0]:     raise NotImplementedError(err)
> [rank0]: NotImplementedError: MarlinQuantLinear: `gptqmodel.nn_modules.qlinear.marlin.MarlinQuantLinear` switching to training mode.
> [rank0]:[W417 07:05:04.233287816 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which ca                                                                                                                                            n leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
> E0417 07:05:05.894000 1842662 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1842693                                                                                                                                            ) of binary: /home/tgnet/.conda/envs/swift/bin/python
> Traceback (most recent call last):
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/runpy.py", line 86, in _run_code
>     exec(code, run_globals)
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
>     main()
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wra                                                                                                                                            pper
>     return f(*args, **kwargs)
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
>     run(args)
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
>     elastic_launch(
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
>     return launch_agent(self._config, self._entrypoint, list(args))
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
>     raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
> ============================================================
> /mnt/wangxj/ms-swift/swift/cli/rlhf.py FAILED
> ------------------------------------------------------------
> Failures:
>   <NO_OTHER_FAILURES>
> ------------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
>   time      : 2025-04-17_07:05:05
>   host      : tgnet
>   rank      : 0 (local_rank: 0)
>   exitcode  : 1 (pid: 1842693)
>   error_file: <N/A>
>   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
> ============================================================

CUDA1 2.2
GPU GTX3070
gptqmodel 2.2.0
auto_gptq 0.7.1

Additional context

nproc_per_node=1

CUDA_VISIBLE_DEVICES=0 \
NPROC_PER_NODE=$nproc_per_node \
swift rlhf \
    --rlhf_type rm \
    --model /mnt/wangxj/pre_train_model/qwen2.5-3b-gptq-int4 \
    --model_type qwen2_5 \
    --train_type lora \
    --dataset /ztb_evl_rlhf_convert.json \
    --torch_dtype bfloat16 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps $(expr 8 / $nproc_per_node) \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir /mnt/wangxj/ms-swift/output/ner_dpo/ \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 1 \
    --deepspeed zero2 \
    --dataset_num_proc 1

The text was updated successfully, but these errors were encountered:

Jintao-Huang · 2025-04-19T10:46:18Z

请尝试卸载gptmodel

wangxiajun68 closed this as completed Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen 2.5 GPTQ int 4量化模型LORA微调报错 #3910

Qwen 2.5 GPTQ int 4量化模型LORA微调报错 #3910

wangxiajun68 commented Apr 17, 2025 •

edited

Loading

Jintao-Huang commented Apr 19, 2025

Qwen 2.5 GPTQ int 4量化模型LORA微调报错 #3910

Qwen 2.5 GPTQ int 4量化模型LORA微调报错 #3910

Comments

wangxiajun68 commented Apr 17, 2025 • edited Loading

Jintao-Huang commented Apr 19, 2025

wangxiajun68 commented Apr 17, 2025 •

edited

Loading