Skip to content

🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Jintao-Huang opened this issue Apr 28, 2025 · 68 comments
Open

🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030

Jintao-Huang opened this issue Apr 28, 2025 · 68 comments
Labels
good first issue Good for newcomers

Comments

@Jintao-Huang
Copy link
Collaborator

Jintao-Huang commented Apr 28, 2025

中文版 notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb

Qwen docs: https://qwen.readthedocs.io/en/latest/training/ms_swift.html

English Version

We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The CPT/SFT/DPO/GRPO for Qwen3/Qwen3-MoE has been supported at the first time by the ms-swift large model training framework. Meanwhile, it also supports the Megatron training (CPT/SFT) implementation for Qwen3/Qwen3-MoE, which is 10 times faster than the training speed achieved using transformers on MoE models.

We will showcase a runnable fine-tuning demo and provide the format for custom datasets.

Before starting the fine-tuning process, please ensure that your environment is properly set up.

# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U

Qwen3-8B SFT

The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook

# Training GPU memory: 22GB
# You can specify `--dataset AI-ModelScope/alpaca-gpt4-data-zh` to run the experiment
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

The format for a custom dataset is as follows (the system field is optional). Simply specify --dataset <dataset_path>:

For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

Datasets without thinking can be handled in two ways to reduce the disruption of thinking during fine-tuning:

Option 1: During training, additionally specify --loss_scale ignore_empty_think to ignore the loss calculation for <think>\n\n</think>\n\n, preventing the loss of thinking ability.

Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

Option 2: Add /no_think to the query in the dataset to avoid the loss of thinking ability.

Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)

ref:

row['query'] = row['query'] + ' /no_think'

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot

Inference and test the fine-tuning results:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048
Image

Qwen3-8B GRPO

Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html

The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:

pip install math_verify==0.5.2

The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.

{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

You can also train with custom reward functions or reward models. Columns in the dataset will be passed into **kwargs of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.py

    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

During training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.

The training script is as follows:

# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true 

Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.

For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html

We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:

More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# Please ensure that the weight saving paths are the same for both nodes.
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true

Training loss (partial):

Image

The custom dataset format is the same as swift sft, which can be found above. Specify --dataset <dataset_path>.

Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using megatron sft and swift sft:

Megatron-LM DeepSpeed-ZERO2 DeepSpeed-ZERO3
Training Speed 9.6s/it - 91.2s/it
GPU Memory Usage 16 * 60GiB OOM 16 * 80GiB

中文版

非常高兴听到Qwen3和Qwen3-MoE的开源, ms-swift大模型训练框架首发支持了Qwen3/Qwen3-MoE的CPT/SFT/DPO/GRPO,同时支持了Qwen3/Qwen3-MoE的Megatron训练(CPT/SFT)实现,在MoE模型上相比transformers实现的训练速度快10倍

我们将展示可运行的微调demo,并给出自定义数据集的格式。

在开始微调之前,请确保您的环境已准备妥当。

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U

Qwen3-8B SFT

对Qwen3-8B进行训练的脚本如下,可在魔搭提供的免费算力A10中运行:https://modelscope.cn/my/mynotebook

# 训练显存:22GB
# 你可以指定`--dataset AI-ModelScope/alpaca-gpt4-data-zh`来跑通实验
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

自定义数据集格式如下(system字段可选),指定--dataset <dataset_path>即可:

参考自定义数据集文档:https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html

{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\n浙江的省会在杭州。"}]}

不带思考的数据集可以有两种处理方式,来减少微调过程对思考的破坏:

方案一:在训练时额外指定--loss_scale ignore_empty_think,忽略<think>\n\n</think>\n\n的损失计算,避免思考能力的丢失。

demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh

{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}

方案二:在数据集的query中额外增加/no_think,避免思考能力的丢失。

demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh

{"messages": [{"role": "user", "content": "浙江的省会在哪? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}

10分钟快速自我认知微调Demo(显存占用:22GB)

ref:

row['query'] = row['query'] + ' /no_think'

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot

推理测试微调效果:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048
Image

Qwen3-8B GRPO

以Qwen3-8B为例,下面使用ms-swift框架对进行GRPO训练。更多关于GRPO,可以参考GRPO文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html

使用AI-MO/NuminaMath-TIR作为数据集,并使用accuracy函数计算模型回答的准确率奖励, 计算奖励需要安装以下环境:

pip install math_verify==0.5.2

自定义数据集格式与SFT类似,其中assistant部分不必需。如果使用accuracy奖励,则需要solution列来计算准确率。

{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

也可以使用自定义的奖励函数/奖励模型进行训练,数据集中的列会传到奖励函数的**kwargs中,自定义奖励函数的例子参考:swift/examples/train/grpo/plugin/plugin.py

    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

在训练过程中,我们使用vLLM来加速采样过程。设置num_infer_workers=8,我们为每个device都部署一个vLLM engine来加速采样过程。

训练脚本如下:

# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true 

Qwen3-30B-A3B MoE SFT(Megatron-SWIFT)

ms-swift引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3Qwen3-MoE、Qwen2.5、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。

对于环境准备(镜像)和HF与MCore模型权重的转换,可以参考Megatron-SWIFT训练文档,这里不进行介绍:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html

我们使用DLC启动训练命令,训练环境是2机8 * 80GiB A800:

更多多节点启动方式参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# 请确保两个节点的保存权重路径相同
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true

训练loss图(部分):

Image

效果截图:

Image

自定义数据集格式与swift sft相同,可以在本文上方找到,指定--dataset <dataset_path>即可。

使用megatron sftswift sft进行Qwen3-30B-A3B模型全参数训练速度/显存占用对比如下:

Megatron-LM DeepSpeed-ZeRO2 DeepSpeed-ZeRO3
训练速度 9.6s/it - 91.2s/it
显存占用 16 * 60GiB OOM 16 * 80GiB
@Jintao-Huang Jintao-Huang added the good first issue Good for newcomers label Apr 28, 2025
@Jintao-Huang Jintao-Huang pinned this issue Apr 28, 2025
@Jintao-Huang Jintao-Huang changed the title Best Practices for Training Qwen3/Qwen3-MoE 🍭🚀 Best Practices for Training Qwen3/Qwen3-MoE Apr 28, 2025
@Jintao-Huang
Copy link
Collaborator Author

Jintao-Huang commented Apr 28, 2025

Model Inference:

Thinking Mode:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --max_model_len 8192
<<<  who are you?
<think>
Okay, the user is asking "who are you?" Let me start by introducing myself as Qwen, the large language model developed by Alibaba Cloud. I should mention my capabilities, like answering questions, creating content, and engaging in conversations. But I need to keep it concise. Also, the user might want to know how I can assist them. Maybe I should ask how I can help them today. Let me check if there's anything else important to include. Oh, I should make sure the tone is friendly and approachable. Alright, that should cover it.
</think>

Hello! I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, such as answering questions, creating content, writing stories, coding, and more. How can I help you today? 😊
<<< who are you? /no_think
<think>

</think>

I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I help you today?

Non-Thinking Mode:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --max_model_len 8192 \
    --response_prefix '<think>\n\n</think>\n\n'
<<< who are you?
<think>

</think>

I am Qwen, a large-scale language model developed by Alibaba Cloud. I am designed to assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I assist you today?

Model Quantization:

Qwen3-32B-AWQ: https://modelscope.cn/models/swift/Qwen3-32B-AWQ

Qwen3-30B-A3B-AWQ: https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ

Qwen3-235B-A22B-AWQ: https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ

@Jintao-Huang Jintao-Huang changed the title 🍭🚀 Best Practices for Training Qwen3/Qwen3-MoE 🚀 Best Practices for Training Qwen3/Qwen3-MoE 👋 Apr 28, 2025
@Jintao-Huang Jintao-Huang changed the title 🚀 Best Practices for Training Qwen3/Qwen3-MoE 👋 🚀 Best Practices for Training Qwen3/Qwen3-MoE Apr 29, 2025
@EvilCalf
Copy link

请问vllm版本选择多少

@Jintao-Huang
Copy link
Collaborator Author

Jintao-Huang commented Apr 29, 2025

vllm==0.8.5

@sosofun
Copy link

sosofun commented Apr 29, 2025

将HF格式的权重转为Megatron格式失败:

CUDA_VISIBLE_DEVICES=0 \ swift export \ --model Qwen/Qwen3-30B-A3B \ --to_mcore true \ --torch_dtype bfloat16 \ --output_dir Qwen/Qwen3-30B-A3B-mcore

errors:
[rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/cli/export.py", line 5, in <module> [rank0]: export_main() [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/export/export.py", line 50, in export_main [rank0]: return SwiftExport(args).main() [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/base.py", line 47, in main [rank0]: result = self.run() [rank0]: ^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/export/export.py", line 34, in run [rank0]: convert_hf2mcore(args) [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/convert.py", line 72, in convert_hf2mcore [rank0]: assert megatron_model_meta is not None, f'Model: {args.model} is not supported.' [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: AssertionError: Model: Qwen/Qwen3-30B-A3B is not supported.

@Jintao-Huang
Copy link
Collaborator Author

Jintao-Huang commented Apr 29, 2025

It's still on the main branch now, and the version ms-swift==3.4.0 will be released tonight.

@NianBroken
Copy link

请求增加对Qwen3-8B的自我认知训练的NoteBook文件

我在魔塔提供的PAI-DSW中使用“self-cognition-sft.ipynb”训练“Qwen3-8B”时注意到该NoteBook文件无法训练“Qwen3”模型。

@yxk9810
Copy link

yxk9810 commented Apr 29, 2025

能否添加全参数微调的脚本?

@Jintao-Huang
Copy link
Collaborator Author

You can refer to the example here and modify the --model parameter accordingly.

https://github.com/modelscope/ms-swift/blob/main/examples/train/full/qwen2_5_32b.sh

@Jintao-Huang
Copy link
Collaborator Author

请求增加对Qwen3-8B的自我认知训练的NoteBook文件

我在魔塔提供的PAI-DSW中使用“self-cognition-sft.ipynb”训练“Qwen3-8B”时注意到该NoteBook文件无法训练“Qwen3”模型。

已加入自我认知微调的demo

@qingzhong1
Copy link

qingzhong1 commented Apr 29, 2025

If I currently have data without a reasoning process, but I want to use this data to fine-tune Qwen3, should I simply add /no_think after the prompt and prefix the response with <think>\n\n</think>\n\n?

@Jintao-Huang
Copy link
Collaborator Author

Perhaps you can refer to this for a solution:

row['query'] = row['query'] + ' /no_think'

@NianBroken
Copy link

NianBroken commented Apr 29, 2025

已加入自我认知微调的demo

如何将微调成功后的模型导出为GGUF格式?
请求增加一个用于将通过ms-swift微调后的模型转为GGUF格式文件的Notebook文件

@stephen-nju
Copy link

Perhaps you can refer to this for a solution:

ms-swift/swift/llm/dataset/dataset/llm.py

Line 835 in 51cafe5

row['query'] = row['query'] + ' /no_think'

@Jintao-Huang 在不采用推理的情况下,是否仍然可以使用Qwen2.5 的模板微调模型?

@Jintao-Huang
Copy link
Collaborator Author

When using --packing true, please additionally use --attn_impl flash_attn. This was missed in the best practices.

@Gpwner
Copy link

Gpwner commented Apr 30, 2025

在华为NPU上运行Swift deploy失败:

[INFO:swift] model_kwargs: {'device_map': 'npu:0'}
Loading checkpoint shards:   0%|                                                                                                  | 0/5 [00:00<?, ?it/s][2025-05-01 05:26:45,878] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[INFO:swift] Successfully registered `/data4/code185/ms-swift/swift/llm/dataset/data/dataset_info.json`.
Loading checkpoint shards:   0%|                                                                                                  | 0/5 [01:28<?, ?it/s]
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data4/code185/ms-swift/swift/llm/infer/deploy.py", line 207, in deploy_main
    SwiftDeploy(args).main()
  File "/data4/code185/ms-swift/swift/llm/infer/deploy.py", line 39, in __init__
    super().__init__(args)
  File "/data4/code185/ms-swift/swift/llm/infer/infer.py", line 32, in __init__
    model, self.template = prepare_model_template(args)
  File "/data4/code185/ms-swift/swift/llm/infer/utils.py", line 144, in prepare_model_template
    model, processor = args.get_model_processor(**kwargs)
  File "/data4/code185/ms-swift/swift/llm/argument/base_args/base_args.py", line 274, in get_model_processor
    return get_model_tokenizer(**kwargs)
  File "/data4/code185/ms-swift/swift/llm/model/register.py", line 571, in get_model_tokenizer
    model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
  File "/data4/code185/ms-swift/swift/llm/model/register.py", line 272, in get_model_tokenizer_with_flash_attn
    return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
  File "/data4/code185/ms-swift/swift/llm/model/register.py", line 241, in get_model_tokenizer_from_local
    model = automodel_class.from_pretrained(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
    return model_class.from_pretrained(
  File "/data4/code185/ms-swift/swift/llm/model/patcher.py", line 282, in _new_from_pretrained
    return from_pretrained(cls, *args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4399, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4833, in _load_pretrained_model
    disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py", line 787, in _load_state_dict_into_meta_model
    param = param[...]
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/cuda/__init__.py", line 289, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

似乎是/data4/code185/ms-swift/swift/llm/model/patcher.py 导致的,请问有什么办法可以解决吗,谢谢

@Jintao-Huang
Copy link
Collaborator Author

Image 应该不是这里的原因,你看看使用npu能不能跑起来 示例代码

https://modelscope.cn/models/Qwen/Qwen3-8B

@NianBroken
Copy link

请求增加一个用于将通过ms-swift微调后的模型转为GGUF格式文件的Notebook文件

@zhenhua
Copy link

zhenhua commented May 2, 2025

sft.py: error: ambiguous option: --model could match --model_type, --model_id_or_path, --model_revision, --model_name, --model_author, --model_layer_cls_name, --model_cache_dir

sft.py报错不支持直接用--model

@Jintao-Huang
Copy link
Collaborator Author

sft.py: error: ambiguous option: --model could match --model_type, --model_id_or_path, --model_revision, --model_name, --model_author, --model_layer_cls_name, --model_cache_dir

sft.py报错不支持直接用--model

升级一下swift>=3.4.0

@llp1992
Copy link

llp1992 commented May 2, 2025

Qwen3-30B-A3B训练成功,但Qwen3-32B megatron sft报错:

2025-05-02T03:37:00.069008389Z [rank24]: raise RuntimeError(
2025-05-02T03:37:00.069009658Z [rank24]: torch._dynamo.exc.TorchRuntimeError: Failed running call_function (*(FakeTensor(..., device='cuda:0', size=(90880, 37984)), (FakeTensor(..., device='cuda:0', size=(90880,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(90752,), dtype=torch.int64))), **{}):
2025-05-02T03:37:00.069011338Z [rank24]: Attempting to broadcast a dimension of length 90752 at -1! Mismatching argument at index 1 had torch.Size([90752]); but expected shape should be broadcastable to [90880]
2025-05-02T03:37:00.069012818Z
2025-05-02T03:37:00.069013908Z [rank24]: from user code:
2025-05-02T03:37:00.069015038Z [rank24]: File "xxxx/Megatron-LM-0.11.0/megatron/core/fusions/fused_cross_entropy.py", line 37, in calculate_predicted_logits
2025-05-02T03:37:00.069016538Z [rank24]: VocabParallelCrossEntropy.calculate_predicted_logits(
2025-05-02T03:37:00.069017918Z [rank24]: File "xxx/Megatron-LM-0.11.0/megatron/core/tensor_parallel/cross_entropy.py", line 59, in calculate_predicted_logits
2025-05-02T03:37:00.069019567Z [rank24]: predicted_logits_1d = logits_2d[arange_1d, masked_target_1d]
2025-05-02T03:37:00.069020817Z
2025-05-02T03:37:00.069022057Z [rank24]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
2025-05-02T03:37:00.069023517Z
2025-05-02T03:37:00.069024607Z
2025-05-02T03:37:00.069025767Z [rank24]: You can suppress this exception and fall back to eager by setting:
2025-05-02T03:37:00.069027127Z [rank24]: import torch._dynamo
2025-05-02T03:37:00.069028297Z [rank24]: torch._dynamo.config.suppress_errors = True

@Jintao-Huang
Copy link
Collaborator Author

可以看看是哪里抛出来的嘛,报错信息完整一些,最好是截图

@zhenhua
Copy link

zhenhua commented May 2, 2025

sft.py: error: ambiguous option: --model could match --model_type, --model_id_or_path, --model_revision, --model_name, --model_author, --model_layer_cls_name, --model_cache_dir
sft.py报错不支持直接用--model

升级一下swift>=3.4.0

嗯,升级后已经解决了

@no-execution
Copy link

train好的moe模型有测过benchmark吗?担心有数值问题

@llp1992
Copy link

llp1992 commented May 2, 2025

可以看看是哪里抛出来的嘛,报错信息完整一些,最好是截图

Qwen3的dense模型,megatron训练都会报这个错

Image

@Jintao-Huang
Copy link
Collaborator Author

@Jintao-Huang
Copy link
Collaborator Author

How to modify the response_prefix of the engine?

engine.default_template.template_meta.response_prefix = '<think>\n\n</think>\n\n'

@no-execution
Copy link

请问现在支持 cache packing好的dataset吗,代码里好像没有看见

@Jintao-Huang
Copy link
Collaborator Author

可以尝试 --streaming true,避免packing的时间。

缓存packing的数据集 后面会使用idx bin数据格式来做,请稍等

@yxk9810
Copy link

yxk9810 commented May 7, 2025

按照option1 设置数据集,加上loss scale 全参数微调qwen3-4b 会hang 住,等了半个小时才有动静,报nvcc 超时错误,训练参数如下:
MASTER_PORT=29501
NPROC_PER_NODE=4
CUDA_VISIBLE_DEVICES=3,4,5,6
swift sft
--model Qwen3-4B
--train_type full
--dataset dataset.jsonl
--torch_dtype bfloat16
--split_dataset_ratio 0.1
--max_steps 2000
--streaming true
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 1
--packing true
--eval_steps 200
--save_steps 200
--logging_steps 100
--max_length 10000
--warmup_ratio 0.05
--dataloader_num_workers 8
--dataset_num_proc 8
--save_total_limit 4
--save_only_model true
--output_dir output/Qwen3-4B
--deepspeed zero3
--use_liger_kernel true
--loss_scale ignore_empty_think
--attn_impl flash_attn

@Jintao-Huang
Copy link
Collaborator Author

按照option1 设置数据集,加上loss scale 全参数微调qwen3-4b 会hang 住,等了半个小时才有动静,报nvcc 超时错误,训练参数如下: MASTER_PORT=29501 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=3,4,5,6 swift sft --model Qwen3-4B --train_type full --dataset dataset.jsonl --torch_dtype bfloat16 --split_dataset_ratio 0.1 --max_steps 2000 --streaming true --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-6 --gradient_accumulation_steps 1 --packing true --eval_steps 200 --save_steps 200 --logging_steps 100 --max_length 10000 --warmup_ratio 0.05 --dataloader_num_workers 8 --dataset_num_proc 8 --save_total_limit 4 --save_only_model true --output_dir output/Qwen3-4B --deepspeed zero3 --use_liger_kernel true --loss_scale ignore_empty_think --attn_impl flash_attn

Image
MASTER_PORT=29501 \
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model Qwen/Qwen3-4B \
    --train_type full \
    --dataset AI-ModelScope/alpaca-gpt4-data-zh \
    --torch_dtype bfloat16 \
    --split_dataset_ratio 0.1 \
    --max_steps 2000 \
    --streaming true \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 1 \
    --packing true \
    --eval_steps 200 \
    --save_steps 200 \
    --logging_steps 100 \
    --max_length 10000 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --save_total_limit 4 \
    --save_only_model true \
    --output_dir output/Qwen3-4B \
    --deepspeed zero3 \
    --use_liger_kernel true \
    --loss_scale ignore_empty_think \
    --attn_impl flash_attn

我这里跑这个shell是正常的

@ypz-git
Copy link

ypz-git commented May 7, 2025

请求加入自我认知微调的python demo(非命令行版本)

@yxk9810
Copy link

yxk9810 commented May 7, 2025

按照option1 设置数据集,加上loss scale 全参数微调qwen3-4b 会hang 住,等了半个小时才有动静,报nvcc 超时错误,训练参数如下: MASTER_PORT=29501 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=3,4,5,6 swift sft --model Qwen3-4B --train_type full --dataset dataset.jsonl --torch_dtype bfloat16 --split_dataset_ratio 0.1 --max_steps 2000 --streaming true --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-6 --gradient_accumulation_steps 1 --packing true --eval_steps 200 --save_steps 200 --logging_steps 100 --max_length 10000 --warmup_ratio 0.05 --dataloader_num_workers 8 --dataset_num_proc 8 --save_total_limit 4 --save_only_model true --output_dir output/Qwen3-4B --deepspeed zero3 --use_liger_kernel true --loss_scale ignore_empty_think --attn_impl flash_attn

Image ``` MASTER_PORT=29501 \ NPROC_PER_NODE=4 \ CUDA_VISIBLE_DEVICES=0,1,2,3 \ swift sft \ --model Qwen/Qwen3-4B \ --train_type full \ --dataset AI-ModelScope/alpaca-gpt4-data-zh \ --torch_dtype bfloat16 \ --split_dataset_ratio 0.1 \ --max_steps 2000 \ --streaming true \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-6 \ --gradient_accumulation_steps 1 \ --packing true \ --eval_steps 200 \ --save_steps 200 \ --logging_steps 100 \ --max_length 10000 \ --warmup_ratio 0.05 \ --dataloader_num_workers 8 \ --dataset_num_proc 8 \ --save_total_limit 4 \ --save_only_model true \ --output_dir output/Qwen3-4B \ --deepspeed zero3 \ --use_liger_kernel true \ --loss_scale ignore_empty_think \ --attn_impl flash_attn ```

我这里跑这个shell是正常的

能否使用方案1的数据格式的数据验证下?我测试如果不用那个格式的数据也是可以的,但是response 加上, query 加上no_think好像就不行了

@Arcmoon-Hu
Copy link

megatron是不是只支持传入train_iters, 不支持epoch,另外如果使用了packing,怎么看packing后实际有多少样本

@Jintao-Huang
Copy link
Collaborator Author

正在支持 max_epochs参数,强制在对应epochs时终止训练并保存权重

megatron是不是只支持传入train_iters, 不支持epoch,另外如果使用了packing,怎么看packing后实际有多少样本

@Jintao-Huang
Copy link
Collaborator Author

能否使用方案1的数据格式的数据验证下?我测试如果不用那个格式的数据也是可以的,但是response 加上, query 加上no_think好像就不行了

https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh

这个试试呢

@Jintao-Huang
Copy link
Collaborator Author

你的数据格式 可以放一条看看不

@llp1992
Copy link

llp1992 commented May 8, 2025

Qwen3-30B-A3B 能支持megatron训练grpo吗?

@Arcmoon-Hu
Copy link

正在支持 max_epochs参数,强制在对应epochs时终止训练并保存权重

megatron是不是只支持传入train_iters, 不支持epoch,另外如果使用了packing,怎么看packing后实际有多少样本

那设置packing后,能看到packing后的实际样本量吗,如果知道的话倒是可以自己算一下,现在只能设置train_iters,packing下很难估计要设置多少

@EvilCalf
Copy link

EvilCalf commented May 8, 2025

现在qwen3 做dapo只能用vllm0.8.5,但是貌似还是zero3会有卡住不动的问题

@hjh0119
Copy link
Collaborator

hjh0119 commented May 9, 2025

@EvilCalf the internal mode is refactoring in #4097, you can use external mode for now

@Jintao-Huang
Copy link
Collaborator Author

正在支持 max_epochs参数,强制在对应epochs时终止训练并保存权重

megatron是不是只支持传入train_iters, 不支持epoch,另外如果使用了packing,怎么看packing后实际有多少样本

那设置packing后,能看到packing后的实际样本量吗,如果知道的话倒是可以自己算一下,现在只能设置train_iters,packing下很难估计要设置多少

会在命令行中打印统计量,你找找

#4125

@alanayu
Copy link

alanayu commented May 9, 2025

用megatron swift 训练完Qwen3-30B-A3B-mcore之后,在将checkpoint转为hf格式时报错
[rank0]: OSError: Can't load tokenizer for '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.

@Arcmoon-Hu
Copy link

DLC PPU训练MoE出现错误

Image
megatron sft
--load /mnt/Models/Qwen3-30B-A3B-Macore
--dataset xxx
--split-dataset-ratio 0.05
--tensor_model_parallel_size 2
--expert_model_parallel_size 8
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 0.01
--packing true
--sequence_parallel true
--micro_batch_size 2
--global_batch_size 32
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--train_iters 20000
--eval_iters 200
--finetune true
--use_flash_attn true
--loss_scale ignore_empty_think
--cross_entropy_loss_fusion true
--lr 2.5e-5
--lr_warmup_iters 100
--log_interval 1
--min_lr 1e-7
--save /mnt/data/Models/rag_conclusion/saved/Qwen3-30B-A3B-Macore-sft
--save_interval 2000
--max_length 8192
--num_workers 32
--no_save_optim true
--no_save_rng true
--dataset_num_proc 32
--model_author swift
--model_name swift-robot

transformers_engine版本是2.0,从这里安装的 参考
Megatron-LM是:pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.12.0

@Arcmoon-Hu
Copy link

DLC PPU训练MoE出现错误

Image megatron sft --load /mnt/Models/Qwen3-30B-A3B-Macore --dataset xxx --split-dataset-ratio 0.05 --tensor_model_parallel_size 2 --expert_model_parallel_size 8 --moe_grouped_gemm true --moe_shared_expert_overlap true --moe_aux_loss_coeff 0.01 --packing true --sequence_parallel true --micro_batch_size 2 --global_batch_size 32 --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --train_iters 20000 --eval_iters 200 --finetune true --use_flash_attn true --loss_scale ignore_empty_think --cross_entropy_loss_fusion true --lr 2.5e-5 --lr_warmup_iters 100 --log_interval 1 --min_lr 1e-7 --save /mnt/data/Models/rag_conclusion/saved/Qwen3-30B-A3B-Macore-sft --save_interval 2000 --max_length 8192 --num_workers 32 --no_save_optim true --no_save_rng true --dataset_num_proc 32 --model_author swift --model_name swift-robot

transformers_engine版本是2.0,从这里安装的 参考 Megatron-LM是:pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.12.0

现在是把--moe_grouped_gemm 设置为false了,目前看起来是正常运行的,但不太知道原因

@aabbccddwasd
Copy link

megatron swift可以支持Lora和Qlora吗

@Jintao-Huang
Copy link
Collaborator Author

/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959

--mcore_model '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959'

@Jintao-Huang
Copy link
Collaborator Author

megatron swift可以支持Lora和Qlora吗

计划中,还不支持

@Jintao-Huang
Copy link
Collaborator Author

moe_grouped_gemm

感觉和transformers_engine版本相关

@aabbccddwasd
Copy link

megatron swift可以支持Lora和Qlora吗

计划中,还不支持

能说下目前进度吗,很需要这两个特性。大佬们加油!

@52yyy
Copy link

52yyy commented May 12, 2025

Qwen3-30B-A3B 能支持megatron训练grpo吗?

同问?

@Arcmoon-Hu
Copy link

确认下多轮训练的时候是mask history的吧,即只算最后一轮的loss

@husthuke
Copy link

对于Qwen3-235B-A22B这个模型:
"num_experts": 128,
"num_experts_per_tok": 8,
"num_hidden_layers": 94,
"num_key_value_heads": 4,
大概需要多少机器资源以及如何配置tensor_model_parallel_size 、expert_model_parallel_size、pipeline_model_parallel_size这些参数才能比较高效的进行sft呢?

@Jintao-Huang
Copy link
Collaborator Author

确认下多轮训练的时候是mask history的吧,即只算最后一轮的loss

sft阶段是训练每一轮的response

@llp1992
Copy link

llp1992 commented May 14, 2025

hf模型导出为megatron,能支持切分tp吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests