We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug 使用官方建议镜像(modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.3-modelscope1.25.0-swift3.3.0.post1 ),然后使用官方建议脚本训练Qwen3 MOE模型,context_parallel_size=1时可以正常跑,但是大于1时报错
一个rank的具体报错信息 (每个rank都会报错,下面放一个rank的完整报错信息):
[rank6]: Traceback (most recent call last): [rank6]: File "/work_dir/0504-ms-swift/ms-swift/swift/cli/_megatron/sft.py", line 4, in <module> [rank6]: megatron_sft_main() [rank6]: File "/work_dir/0504-ms-swift/ms-swift/swift/megatron/train/sft.py", line 71, in megatron_sft_main [rank6]: return MegatronSft(args).main() [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/ms-swift/swift/llm/base.py", line 47, in main [rank6]: result = self.run() [rank6]: ^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/ms-swift/swift/megatron/train/sft.py", line 56, in run [rank6]: pretrain( [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/training/training.py", line 408, in pretrain [rank6]: iteration, num_floating_point_operations_so_far = train( [rank6]: ^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/training/training.py", line 1493, in train [rank6]: train_step(forward_step_func, [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/training/training.py", line 791, in train_step [rank6]: losses_reduced = forward_backward_func( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 453, in forward_backward_no_pipelining [rank6]: output_tensor, num_tokens = forward_step( [rank6]: ^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 275, in forward_step [rank6]: output_tensor, loss_func = forward_step_func(data_iterator, model) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/ms-swift/swift/megatron/train/utils.py", line 167, in forward_step [rank6]: output_tensor = model(tokens, position_ids, attention_mask, labels=labels, packed_seq_params=packed_seq_params) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank6]: return forward_call(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward [rank6]: return self.module(*inputs, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank6]: return forward_call(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/legacy/model/module.py", line 189, in forward [rank6]: outputs = self.module(*inputs, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank6]: return forward_call(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 264, in forward [rank6]: hidden_states = self.decoder( [rank6]: ^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank6]: return forward_call(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/transformer_block.py", line 549, in forward [rank6]: hidden_states, context = layer( [rank6]: ^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 502, in __call__ [rank6]: return super(MegatronModule, self).__call__(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank6]: return forward_call(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 390, in forward [rank6]: attention_output_with_bias = self.self_attention( [rank6]: ^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank6]: return self._call_impl(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank6]: return forward_call(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/transformer/attention.py", line 435, in forward [rank6]: query = apply_rotary_pos_emb( [rank6]: ^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/models/common/embeddings/rope_utils.py", line 200, in apply_rotary_pos_emb [rank6]: return fused_apply_rotary_pos_emb_thd( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/work_dir/0504-ms-swift/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 1320, in fused_apply_rotary_pos_emb_thd [rank6]: return FusedRoPEFunc.apply(t, freqs, "thd", cu_seqlens, cp_size, cp_rank) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply [rank6]: return super().apply(*args, **kwargs) # type: ignore[misc] [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/usr/local/lib/python3.11/site-packages/transformer_engine/pytorch/attention.py", line 4928, in forward [rank6]: output = tex.fused_rope_thd_forward(t, cu_seqlens, freqs, cp_size, cp_rank) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: RuntimeError: /tmp/pip-req-build-ytsxlfcy/transformer_engine/common/fused_rope/fused_rope.cu:229 in function fused_rope_thd_forward_launcher: CUDA Error: invalid configuration argument
注:megatron版本使用的是core_r0.11.0分支
The text was updated successfully, but these errors were encountered:
fixed
Sorry, something went wrong.
Successfully merging a pull request may close this issue.
Describe the bug
使用官方建议镜像(modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.3-modelscope1.25.0-swift3.3.0.post1
),然后使用官方建议脚本训练Qwen3 MOE模型,context_parallel_size=1时可以正常跑,但是大于1时报错
一个rank的具体报错信息 (每个rank都会报错,下面放一个rank的完整报错信息):
注:megatron版本使用的是core_r0.11.0分支
The text was updated successfully, but these errors were encountered: