Skip to content

QWQ:GRPO训练无法跑通,报错”RuntimeError: ACL stream synchronize failed, error code:107020“ #3932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
miyeeee opened this issue Apr 18, 2025 · 2 comments

Comments

@miyeeee
Copy link

miyeeee commented Apr 18, 2025

问题描述

在D910B NPU卡上进行QWQ的GRPO训练时遇到报错,具体为同步计算流超时(已设置export HCCL_EXEC_TIMEOUT=3600)

环境配置

  • D910B NPU卡,32G
  • 4节点,每节点8卡

报错内容

Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/cli/rlhf.py", line 5, in
rlhf_main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 98, in rlhf_main
return SwiftRLHF(args).main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 31, in init
self._prepare_model_tokenizer()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/rlhf.py", line 65, in _prepare_model_tokenizer
super()._prepare_model_tokenizer()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/train/sft.py", line 62, in _prepare_model_tokenizer
self.model, self.processor = args.get_model_processor()
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/argument/base_args/base_args.py", line 276, in get_model_processor
return get_model_tokenizer(**kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 564, in get_model_tokenizer
model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 265, in get_model_tokenizer_with_flash_attn
return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/register.py", line 234, in get_model_tokenizer_from_local
model = automodel_class.from_pretrained(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/home/ma-user/modelarts/user-job-dir/ms-swift/swift/llm/model/patcher.py", line 285, in _new_from_pretrained
return from_pretrained(cls, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained
) = cls._load_pretrained_model(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4728, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 993, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 329, in set_module_tensor_to_device
new_value = value.to(device)
RuntimeError: ACL stream synchronize failed, error code:107020
[W compiler_depend.ts:465] Warning: NPU warning, error code is 107020[Error]: .
EH9999: Inner Error!
rtDeviceSynchronize execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 5139] 2025-04-18-15:50:02.969.138 wait for compute device to finish failed, runtime result = 107020.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeUsedDevices)

@miyeeee
Copy link
Author

miyeeee commented Apr 24, 2025

脚本:

torchrun --master_addr=${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=${NPROC_PER_NODE} --nnodes=${NNODES} --node_rank=${NODE_RANK}
${SCRIPT_DIR}/swift/cli/rlhf.py
--rlhf_type grpo
--check_model false
--model /cache/model
--reward_funcs format
--use_vllm false
--vllm_device auto
--gradient_checkpointing_kwargs '{"use_reentrant": false}'
--vllm_gpu_memory_utilization 0.6
--train_type lora
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--target_modules all-linear
--torch_dtype bfloat16
--dataset /cache/data
--max_completion_length 1024
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-5
--gradient_accumulation_steps 1
--eval_strategy 'steps'
--eval_steps 100
--save_strategy 'steps'
--save_steps 100
--logging_steps 5
--max_length 2048
--output_dir /cache/output
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 4
--num_generations 8
--temperature 0.9
--system /cache/prompt.txt
--log_completions true
--num_iterations 1
--num_infer_workers 1
--async_generate false
--beta 0.0
--max_grad_norm 0.5
--model_type qwen2_5
--tensor_parallel_size 8

补充请教一个问题:--async_generate false 是否就是关闭了async mode?

@hjh0119
Copy link
Collaborator

hjh0119 commented May 1, 2025

For NPU, it is recommended to use an external vLLM server.

https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#external

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants