Megatron SFT context_parallel_size>1时报cuda error #4144

Emperorizzis · 2025-05-09T03:20:05Z

Describe the bug
使用官方建议镜像(modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.3-modelscope1.25.0-swift3.3.0.post1
)，然后使用官方建议脚本训练Qwen3 MOE模型，context_parallel_size=1时可以正常跑，但是大于1时报错

一个rank的具体报错信息 (每个rank都会报错，下面放一个rank的完整报错信息)：

[rank6]: Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                   
[rank6]:   File "/work_dir/0504-ms-swift/ms-swift/swift/cli/_megatron/sft.py", line 4, in <module>                                                                                                                                                                                                                                                                  
[rank6]:     megatron_sft_main()                                                                                                                                                                                                                                                                                                                                              
[rank6]:   File "/work_dir/0504-ms-swift/ms-swift/swift/megatron/train/sft.py", line 71, in megatron_sft_main                                                                                                                                                                                                                                                       
[rank6]:     return MegatronSft(args).main()                                                                                                                                                                                                                                                                                                                                  
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                  
[rank6]:   File "/work_dir/0504-ms-swift/ms-swift/swift/llm/base.py", line 47, in main                                                                                                                                                                                                                                                                              
[rank6]:     result = self.run()                                                                                                                                                                                                                                                                                                                                              
[rank6]:              ^^^^^^^^^^                                                                                                                                                                           
[rank6]:   File "/work_dir/0504-ms-swift/ms-swift/swift/megatron/train/sft.py", line 56, in run                                                                                                  
[rank6]:     pretrain(                                                                                                                                                                                     
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/training/training.py", line 408, in pretrain                                                                                       
[rank6]:     iteration, num_floating_point_operations_so_far = train(                                                                                                                                      
[rank6]:                                                       ^^^^^^                                                                                                                                      
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/training/training.py", line 1493, in train                                                                                         
[rank6]:     train_step(forward_step_func,                                                                                                                                                                 
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/training/training.py", line 791, in train_step                                                                                     
[rank6]:     losses_reduced = forward_backward_func(                                                                                                                                                       
[rank6]:                      ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                       
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 453, in forward_backward_no_pipelining                                                  
[rank6]:     output_tensor, num_tokens = forward_step(                                                                                                                                                                                                                                                           
[rank6]:                                 ^^^^^^^^^^^^^                                                                                                                                                                                                               
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 275, in forward_step                                                                    
[rank6]:     output_tensor, loss_func = forward_step_func(data_iterator, model)                                                                                                                                                                                                                                  
[rank6]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                      
[rank6]:   File "/work_dir/0504-ms-swift/ms-swift/swift/megatron/train/utils.py", line 167, in forward_step                                                                                      
[rank6]:     output_tensor = model(tokens, position_ids, attention_mask, labels=labels, packed_seq_params=packed_seq_params)                                                                                                         
[rank6]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                         
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                     
[rank6]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                             
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                             
[rank6]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                    
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward                                                                       
[rank6]:     return self.module(*inputs, **kwargs)                                                                                                                                                                                                                                                               
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                   
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                     
[rank6]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                             
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                             
[rank6]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                    
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/legacy/model/module.py", line 189, in forward                                                                                      
[rank6]:     outputs = self.module(*inputs, **kwargs)                                                                                                                                                                                                                                                            
[rank6]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                     
[rank6]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                             
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                             
[rank6]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                    
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 264, in forward                                                                                
[rank6]:     hidden_states = self.decoder(                                                                                                                                                                                                                                                                       
[rank6]:                     ^^^^^^^^^^^^^                                                                                                                                                                                                                           
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                     
[rank6]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                             
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                             
[rank6]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                    
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/transformer_block.py", line 549, in forward                                                                                                 
[rank6]:     hidden_states, context = layer(                                                                                                                                                                                                                         
[rank6]:                              ^^^^^^                                                                                                                                                                                                                         
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 502, in __call__                                                                                                                                
[rank6]:     return super(MegatronModule, self).__call__(*args, **kwargs)                                                                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                            
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                                                                               
[rank6]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                 
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                                                                                                                                   
[rank6]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                    
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 390, in forward                                                                                                                                 
[rank6]:     attention_output_with_bias = self.self_attention(                                                                                                                                                                                                                                                   
[rank6]:                                  ^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                   
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                                                                                                                           
[rank6]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                             
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                             
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                                                                                                                                   
[rank6]:     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/attention.py", line 435, in forward                                                                                                                                                                                     
[rank6]:     query = apply_rotary_pos_emb(                                                                                                                                                                                                                                                                       
[rank6]:             ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                       
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/models/common/embeddings/rope_utils.py", line 200, in apply_rotary_pos_emb                                                                              
[rank6]:     return fused_apply_rotary_pos_emb_thd(                                                                                                                                                                                                                                                              
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                           
[rank6]:   File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 1320, in fused_apply_rotary_pos_emb_thd                                                                         
[rank6]:     return FusedRoPEFunc.apply(t, freqs, "thd", cu_seqlens, cp_size, cp_rank)                                                                                                                                               
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
[rank6]:   File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply                                                                                                                                                                                                                                                                                                                                                                  
[rank6]:     return super().apply(*args, **kwargs)  # type: ignore[misc]                                                                                                                                                             
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                   
[rank6]:   File "/usr/local/lib/python3.11/site-packages/transformer_engine/pytorch/attention.py", line 4928, in forward                                                                                                             
[rank6]:     output = tex.fused_rope_thd_forward(t, cu_seqlens, freqs, cp_size, cp_rank)                                                                                                                                             
[rank6]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                             
[rank6]: RuntimeError: /tmp/pip-req-build-ytsxlfcy/transformer_engine/common/fused_rope/fused_rope.cu:229 in function fused_rope_thd_forward_launcher: CUDA Error: invalid configuration argument

注：megatron版本使用的是core_r0.11.0分支

Jintao-Huang · 2025-05-11T02:55:12Z

fixed

Jintao-Huang added the bug Something isn't working label May 9, 2025

Jintao-Huang linked a pull request May 11, 2025 that will close this issue

[megatron]Support packing & CP #4163

Merged

Jintao-Huang closed this as completed in #4163 May 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron SFT context_parallel_size>1时报cuda error #4144

Megatron SFT context_parallel_size>1时报cuda error #4144

Emperorizzis commented May 9, 2025

Jintao-Huang commented May 11, 2025

Megatron SFT context_parallel_size>1时报cuda error #4144

Megatron SFT context_parallel_size>1时报cuda error #4144

Comments

Emperorizzis commented May 9, 2025

Jintao-Huang commented May 11, 2025