We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在D910B NPU卡上进行QWQ的GRPO训练时遇到报错,具体为同步计算流超时(已设置export HCCL_EXEC_TIMEOUT=3600)
Traceback (most recent call last): File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/cli/rlhf.py", line 5, in rlhf_main() File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main return SwiftRLHF(args).main() File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 31, in init self._prepare_model_tokenizer() File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 65, in _prepare_model_tokenizer super()._prepare_model_tokenizer() File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 62, in _prepare_model_tokenizer self.model, self.processor = args.get_model_processor() File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/argument/base_args/base_args.py", line 276, in get_model_processor return get_model_tokenizer(**kwargs) File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 564, in get_model_tokenizer model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs) File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 265, in get_model_tokenizer_with_flash_attn return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs) File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 234, in get_model_tokenizer_from_local model = automodel_class.from_pretrained( File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/patcher.py", line 285, in _new_from_pretrained return from_pretrained(cls, *args, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained ) = cls._load_pretrained_model( File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4728, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 993, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 329, in set_module_tensor_to_device new_value = value.to(device) RuntimeError: ACL stream synchronize failed, error code:107020 [W compiler_depend.ts:465] Warning: NPU warning, error code is 107020[Error]: . EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: [PID: 5139] 2025-04-18-15:50:02.969.138 wait for compute device to finish failed, runtime result = 107020.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeUsedDevices)
The text was updated successfully, but these errors were encountered:
torchrun --master_addr=${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=${NPROC_PER_NODE} --nnodes=${NNODES} --node_rank=${NODE_RANK} ${SCRIPT_DIR}/swift/cli/rlhf.py --rlhf_type grpo --check_model false --model /cache/model --reward_funcs format --use_vllm false --vllm_device auto --gradient_checkpointing_kwargs '{"use_reentrant": false}' --vllm_gpu_memory_utilization 0.6 --train_type lora --lora_rank 8 --lora_alpha 16 --lora_dropout 0.05 --target_modules all-linear --torch_dtype bfloat16 --dataset /cache/data --max_completion_length 1024 --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-5 --gradient_accumulation_steps 1 --eval_strategy 'steps' --eval_steps 100 --save_strategy 'steps' --save_steps 100 --logging_steps 5 --max_length 2048 --output_dir /cache/output --warmup_ratio 0.05 --dataloader_num_workers 4 --dataset_num_proc 4 --num_generations 8 --temperature 0.9 --system /cache/prompt.txt --log_completions true --num_iterations 1 --num_infer_workers 1 --async_generate false --beta 0.0 --max_grad_norm 0.5 --model_type qwen2_5 --tensor_parallel_size 8
Sorry, something went wrong.
For NPU, it is recommended to use an external vLLM server.
https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#external
No branches or pull requests
问题描述
在D910B NPU卡上进行QWQ的GRPO训练时遇到报错,具体为同步计算流超时(已设置export HCCL_EXEC_TIMEOUT=3600)
环境配置
报错内容
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/cli/rlhf.py", line 5, in
rlhf_main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
return SwiftRLHF(args).main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 31, in init
self._prepare_model_tokenizer()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 65, in _prepare_model_tokenizer
super()._prepare_model_tokenizer()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 62, in _prepare_model_tokenizer
self.model, self.processor = args.get_model_processor()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/argument/base_args/base_args.py", line 276, in get_model_processor
return get_model_tokenizer(**kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 564, in get_model_tokenizer
model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 265, in get_model_tokenizer_with_flash_attn
return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 234, in get_model_tokenizer_from_local
model = automodel_class.from_pretrained(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/patcher.py", line 285, in _new_from_pretrained
return from_pretrained(cls, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained
) = cls._load_pretrained_model(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4728, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 993, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 329, in set_module_tensor_to_device
new_value = value.to(device)
RuntimeError: ACL stream synchronize failed, error code:107020
[W compiler_depend.ts:465] Warning: NPU warning, error code is 107020[Error]: .
EH9999: Inner Error!
rtDeviceSynchronize execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 5139] 2025-04-18-15:50:02.969.138 wait for compute device to finish failed, runtime result = 107020.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeUsedDevices)
The text was updated successfully, but these errors were encountered: