You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
run sh: `/home/ma-user/anaconda3/envs/PyTorch-2.1.0/bin/python3.9 /share/code/ms-swift/swift/cli/sft.py --model_type=qwen3 --dataset=/share/code/QwenInfer/gigaspeech_continuation_qwen.jsonl --model=/share/code/Qwen3-8B --num_train_epochs=5 --train_type=full --output_dir=outputs --eval_steps=1000 --save_steps=1000 --device_map=npu --ddp_backend hccl --per_device_train_batch_size=30 --dataloader_num_workers=20 --lazy_tokenize true --torch_dtype=bfloat16 --check_model=false --max_length=2048 --learning_rate=1e-3 --warmup_steps=1000 --lr_scheduler_type=cosine --dataset_prefix=/share/DATA/ --gradient_accumulation_steps=1 --dataset_num_proc=2 --save_total_limit=5`
[2025-05-05 20:04:57,571] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[INFO:swift] Successfully registered `/share/code/ms-swift/swift/llm/dataset/data/dataset_info.json`.
[INFO:swift] rank: -1, local_rank: -1, world_size: 1, local_world_size: 1
[INFO:swift] Loading the model using model_dir: /share/code/Qwen3-8B
Traceback (most recent call last):
File "/share/code/ms-swift/swift/cli/sft.py", line 7, in <module>
sft_main()
File "/share/code/ms-swift/swift/llm/train/sft.py", line 281, in sft_main
return SwiftSft(args).main()
File "/share/code/ms-swift/swift/llm/train/sft.py", line 29, in __init__
super().__init__(args)
File "/share/code/ms-swift/swift/llm/base.py", line 18, in __init__
self.args = self._parse_args(args)
File "/share/code/ms-swift/swift/llm/base.py", line 30, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/share/code/ms-swift/swift/utils/utils.py", line 151, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/hf_argparser.py", line 358, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 303, in __init__
File "/share/code/ms-swift/swift/llm/argument/train_args.py", line 170, in __post_init__
self.training_args = TrainerFactory.get_training_args(self)
File "/share/code/ms-swift/swift/trainers/trainer_factory.py", line 64, in get_training_args
return training_args_cls(**args_dict)
File "<string>", line 152, in __init__
File "/share/code/ms-swift/swift/trainers/arguments.py", line 132, in __post_init__
super().__post_init__()
File "/share/code/ms-swift/swift/trainers/arguments.py", line 118, in __post_init__
super().__post_init__()
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/training_args.py", line 1761, in __post_init__
self.device
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/training_args.py", line 2297, in device
return self._setup_devices
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/utils/generic.py", line 67, in __get__
cached = self.fget(obj)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/training_args.py", line 2224, in _setup_devices
self.distributed_state = PartialState(**accelerator_state_kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/state.py", line 271, in __init__
self.num_processes = torch.distributed.get_world_size()
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1492, in get_world_size
return _get_group_size(group)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 785, in _get_group_size
default_pg = _get_default_group()
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Describe the bug
在华为NPU上 SFT Qwen3 8B 的时候出现了这个错误:
Your hardware and system info
OS:
NPU:
Additional context
SFT 脚本:
感谢!
The text was updated successfully, but these errors were encountered: