You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **Your hardware and system info**
> [INFO:swift] model_parameter_info: PeftModelForSequenceClassification: 326.2853M Params (14.9688M Trainable [4.5877%]), 371.2943M Buffers.
> /mnt/wangxj/ms-swift/swift/trainers/mixin.py:81: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `RewardTrainer. __init__`. Use `processing_class` instead.
> super().__init__(
> WARNING:accelerate.utils.other:Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
> No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
> [INFO:swift] The logging file will be saved in: /mnt/wangxj/ms-swift/output/ner_dpo/v16-20250417-070456/logging.jsonl
> ERROR MarlinQuantLinear: `gptqmodel.nn_modules.qlinear.marlin.MarlinQuantLinear` switching to training mode.
> [rank0]: Traceback (most recent call last):
> [rank0]: File "/mnt/wangxj/ms-swift/swift/cli/rlhf.py", line 5, in <module>
> [rank0]: rlhf_main()
> [rank0]: File "/mnt/wangxj/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
> [rank0]: return SwiftRLHF(args).main()
> [rank0]: File "/mnt/wangxj/ms-swift/swift/llm/base.py", line 47, in main
> [rank0]: result = self.run()
> [rank0]: File "/mnt/wangxj/ms-swift/swift/llm/train/sft.py", line 144, in run
> [rank0]: return self.train(trainer)
> [rank0]: File "/mnt/wangxj/ms-swift/swift/llm/train/sft.py", line 204, in train
> [rank0]: trainer.train(trainer.args.resume_from_checkpoint)
> [rank0]: File "/mnt/wangxj/ms-swift/swift/trainers/mixin.py", line 294, in train
> [rank0]: res = super().train(*args, **kwargs)
> [rank0]: File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
> [rank0]: return inner_training_loop(
> [rank0]: File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
> [rank0]: self.model.train()
> [rank0]: File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]: module.train(mode)
> [rank0]: File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]: module.train(mode)
> [rank0]: File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2843, in train
> [rank0]: module.train(mode)
> [rank0]: [Previous line repeated 5 more times]
> [rank0]: File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/gptqmodel/nn_modules/qlinear/__init__.py", line 400, in train
> [rank0]: raise NotImplementedError(err)
> [rank0]: NotImplementedError: MarlinQuantLinear: `gptqmodel.nn_modules.qlinear.marlin.MarlinQuantLinear` switching to training mode.
> [rank0]:[W417 07:05:04.233287816 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which ca n leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
> E0417 07:05:05.894000 1842662 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1842693 ) of binary: /home/tgnet/.conda/envs/swift/bin/python
> Traceback (most recent call last):
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
> return _run_code(code, main_globals, None,
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/runpy.py", line 86, in _run_code
> exec(code, run_globals)
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
> main()
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wra pper
> return f(*args, **kwargs)
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
> run(args)
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
> elastic_launch(
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
> return launch_agent(self._config, self._entrypoint, list(args))
> File "/home/tgnet/.conda/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
> raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
> ============================================================
> /mnt/wangxj/ms-swift/swift/cli/rlhf.py FAILED
> ------------------------------------------------------------
> Failures:
> <NO_OTHER_FAILURES>
> ------------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
> time : 2025-04-17_07:05:05
> host : tgnet
> rank : 0 (local_rank: 0)
> exitcode : 1 (pid: 1842693)
> error_file: <N/A>
> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
> ============================================================
Describe the bug
swift是否支持对量化模型的微调,这个报错是环境原因吗?
CUDA1 2.2
GPU GTX3070
gptqmodel 2.2.0
auto_gptq 0.7.1
Additional context
The text was updated successfully, but these errors were encountered: