Skip to content

Qwen 2.5 GPTQ int 4量化模型LORA微调报错 #3910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wangxiajun68 opened this issue Apr 17, 2025 · 1 comment
Closed

Qwen 2.5 GPTQ int 4量化模型LORA微调报错 #3910

wangxiajun68 opened this issue Apr 17, 2025 · 1 comment

Comments

@wangxiajun68
Copy link

wangxiajun68 commented Apr 17, 2025

Describe the bug
swift是否支持对量化模型的微调,这个报错是环境原因吗?

> **Your hardware and system info**
> [INFO:swift] model_parameter_info: PeftModelForSequenceClassification: 326.2853M Params (14.9688M Trainable [4.5877%]), 371.2943M Buffers.
> /mnt/wangxj/ms-swift/swift/trainers/mixin.py:81: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `RewardTrainer.                                                                                                                                            __init__`. Use `processing_class` instead.
>   super().__init__(
> WARNING:accelerate.utils.other:Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang.                                                                                                                                             It is recommended to upgrade the kernel to the minimum version or higher.
> No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names                                                                                                                                             is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
> [INFO:swift] The logging file will be saved in: /mnt/wangxj/ms-swift/output/ner_dpo/v16-20250417-070456/logging.jsonl
> ERROR MarlinQuantLinear: `gptqmodel.nn_modules.qlinear.marlin.MarlinQuantLinear` switching to training mode.
> [rank0]: Traceback (most recent call last):
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/cli/rlhf.py", line 5, in <module>
> [rank0]:     rlhf_main()
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
> [rank0]:     return SwiftRLHF(args).main()
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/base.py", line 47, in main
> [rank0]:     result = self.run()
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/train/sft.py", line 144, in run
> [rank0]:     return self.train(trainer)
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/llm/train/sft.py", line 204, in train
> [rank0]:     trainer.train(trainer.args.resume_from_checkpoint)
> [rank0]:   File "/mnt/wangxj/ms-swift/swift/trainers/mixin.py", line 294, in train
> [rank0]:     res = super().train(*args, **kwargs)
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
> [rank0]:     return inner_training_loop(
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
> [rank0]:     self.model.train()
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]:     module.train(mode)
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]:     module.train(mode)
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]:     module.train(mode)
> [rank0]:   [Previous line repeated 5 more times]
> [rank0]:   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/gptqmodel/nn_modules/qlinear/__init__.py", line 400, in train
> [rank0]:     raise NotImplementedError(err)
> [rank0]: NotImplementedError: MarlinQuantLinear: `gptqmodel.nn_modules.qlinear.marlin.MarlinQuantLinear` switching to training mode.
> [rank0]:[W417 07:05:04.233287816 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which ca                                                                                                                                            n leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
> E0417 07:05:05.894000 1842662 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1842693                                                                                                                                            ) of binary: /home/tgnet/.conda/envs/swift/bin/python
> Traceback (most recent call last):
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/runpy.py", line 86, in _run_code
>     exec(code, run_globals)
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
>     main()
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wra                                                                                                                                            pper
>     return f(*args, **kwargs)
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
>     run(args)
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
>     elastic_launch(
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
>     return launch_agent(self._config, self._entrypoint, list(args))
>   File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
>     raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
> ============================================================
> /mnt/wangxj/ms-swift/swift/cli/rlhf.py FAILED
> ------------------------------------------------------------
> Failures:
>   <NO_OTHER_FAILURES>
> ------------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
>   time      : 2025-04-17_07:05:05
>   host      : tgnet
>   rank      : 0 (local_rank: 0)
>   exitcode  : 1 (pid: 1842693)
>   error_file: <N/A>
>   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
> ============================================================

CUDA1 2.2
GPU GTX3070
gptqmodel 2.2.0
auto_gptq 0.7.1

Additional context

nproc_per_node=1

CUDA_VISIBLE_DEVICES=0 \
NPROC_PER_NODE=$nproc_per_node \
swift rlhf \
    --rlhf_type rm \
    --model /mnt/wangxj/pre_train_model/qwen2.5-3b-gptq-int4 \
    --model_type qwen2_5 \
    --train_type lora \
    --dataset /ztb_evl_rlhf_convert.json \
    --torch_dtype bfloat16 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps $(expr 8 / $nproc_per_node) \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir /mnt/wangxj/ms-swift/output/ner_dpo/ \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 1 \
    --deepspeed zero2 \
    --dataset_num_proc 1
@Jintao-Huang
Copy link
Collaborator

请尝试卸载gptmodel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants