🚀 Best Practices for Training Qwen3/Qwen3-MoE

**中文版 notebook**: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb

Qwen docs: https://qwen.readthedocs.io/en/latest/training/ms_swift.html

# English Version

We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The **CPT/SFT/DPO/GRPO** for Qwen3/Qwen3-MoE has been supported at the first time by the [ms-swift](https://github.com/modelscope/ms-swift) large model training framework. Meanwhile, it also supports the **Megatron** training (CPT/SFT) implementation for Qwen3/Qwen3-MoE, which is **10 times faster** than the training speed achieved using transformers on MoE models.

We will showcase a runnable fine-tuning demo and provide the format for custom datasets.

Before starting the fine-tuning process, please ensure that your environment is properly set up.

```bash
# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U
```

## Qwen3-8B SFT

The script for training Qwen3-8B is as follows, which can be run on the **free A10** computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook

```bash
# Training GPU memory: 22GB
# You can specify `--dataset AI-ModelScope/alpaca-gpt4-data-zh` to run the experiment
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn
```

The format for a **custom dataset** is as follows (the `system` field is optional). Simply specify `--dataset <dataset_path>`:

For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html
```jsonl
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
```

**Datasets without thinking** can be handled in two ways to reduce the disruption of thinking during fine-tuning:

**Option 1**: During training, additionally specify `--loss_scale ignore_empty_think` to ignore the loss calculation for `<think>\n\n</think>\n\n`, preventing the loss of thinking ability. 

Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh
```jsonl
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
```

**Option 2**: Add /no_think to the query in the dataset to avoid the loss of thinking ability.

Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh
```jsonl
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
```


**10-Minute Quick Self-Cognition Fine-Tuning Demo** (GPU Memory Usage: 22GB)

ref: https://github.com/modelscope/ms-swift/blob/51cafe59325603b2bf0f63cf688c659fbe9abc5d/swift/llm/dataset/dataset/llm.py#L835
```bash
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot
```

Inference and test the fine-tuning results:

```bash
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048
```

<img width="870" alt="Image" src="https://github.com/user-attachments/assets/e3f62d01-89f8-4085-8e69-42526e706e9f" />

## Qwen3-8B GRPO

Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html

The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:

```bash
pip install math_verify==0.5.2
```

The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.

```jsonl
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}
```

You can also train with custom reward functions or reward models. Columns in the dataset will be passed into `**kwargs` of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.py

```bash
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2
```

During training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.

The training script is as follows:

```bash
# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true 
```

## Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like **Qwen3**, **Qwen3-MoE**, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.

For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html

We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:

More **multi-node** launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

```bash
# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# Please ensure that the weight saving paths are the same for both nodes.
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true
```
Training loss (partial):

<img width="910" alt="Image" src="https://github.com/user-attachments/assets/9fe393aa-8299-4659-aa2f-be5d44f0730b" />

The custom dataset format is the same as `swift sft`, which can be found above. Specify `--dataset <dataset_path>`.

Below is the comparison of full-parameter training **speed/GPU memory** usage for the Qwen3-30B-A3B model using `megatron sft` and `swift sft`:

|                  | Megatron-LM | DeepSpeed-ZERO2 | DeepSpeed-ZERO3 |
| ---------------- | ----------- | --------------- | --------------- |
| Training Speed   | 9.6s/it     | -               | 91.2s/it        |
| GPU Memory Usage | 16 * 60GiB  | OOM             | 16 * 80GiB      |


# 中文版

非常高兴听到Qwen3和Qwen3-MoE的开源， [ms-swift](https://github.com/modelscope/ms-swift)大模型训练框架首发支持了Qwen3/Qwen3-MoE的**CPT/SFT/DPO/GRPO**，同时支持了Qwen3/Qwen3-MoE的**Megatron**训练(CPT/SFT)实现，在MoE模型上相比transformers实现的训练速度**快10倍**。

我们将展示可运行的微调demo，并给出自定义数据集的格式。

在开始微调之前，请确保您的环境已准备妥当。

```bash
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U
```

## Qwen3-8B SFT

对Qwen3-8B进行训练的脚本如下，可在魔搭提供的**免费算力A10**中运行：https://modelscope.cn/my/mynotebook
```bash
# 训练显存：22GB
# 你可以指定`--dataset AI-ModelScope/alpaca-gpt4-data-zh`来跑通实验
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn
```
**自定义数据集**格式如下（system字段可选），指定`--dataset <dataset_path>`即可：

参考自定义数据集文档：https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html

```jsonl
{"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\n浙江的省会在杭州。"}]}
```

**不带思考的数据集**可以有两种处理方式，来减少微调过程对思考的破坏：

**方案一**：在训练时额外指定`--loss_scale ignore_empty_think`，忽略`<think>\n\n</think>\n\n`的损失计算，避免思考能力的丢失。

demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh
```jsonl
{"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}
```

**方案二**：在数据集的query中额外增加`/no_think`，避免思考能力的丢失。

demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh
```jsonl
{"messages": [{"role": "user", "content": "浙江的省会在哪？ /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}
```

**10分钟快速自我认知微调Demo**（显存占用：22GB）

ref: https://github.com/modelscope/ms-swift/blob/51cafe59325603b2bf0f63cf688c659fbe9abc5d/swift/llm/dataset/dataset/llm.py#L835
```shell
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot
```

推理测试微调效果：
```shell
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048
```

<img width="828" alt="Image" src="https://github.com/user-attachments/assets/16876803-4390-4cd3-99c2-7549e23c437d" />



## Qwen3-8B GRPO
以Qwen3-8B为例，下面使用ms-swift框架对进行GRPO训练。更多关于GRPO，可以参考GRPO文档：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html

使用AI-MO/NuminaMath-TIR作为数据集，并使用accuracy函数计算模型回答的准确率奖励, 计算奖励需要安装以下环境：

```bash
pip install math_verify==0.5.2
```
自定义数据集格式与SFT类似，其中assistant部分不必需。如果使用accuracy奖励，则需要solution列来计算准确率。

```jsonl
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}
```

也可以使用自定义的奖励函数/奖励模型进行训练，数据集中的列会传到奖励函数的`**kwargs`中，自定义奖励函数的例子参考：swift/examples/train/grpo/plugin/plugin.py

```bash
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2
```
在训练过程中，我们使用vLLM来加速采样过程。设置num_infer_workers=8，我们为每个device都部署一个vLLM engine来加速采样过程。

训练脚本如下：

```bash
# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true 
```

## Qwen3-30B-A3B MoE SFT（Megatron-SWIFT）

ms-swift引入了Megatron的并行技术来加速大模型的训练，包括数据并行、张量并行、流水线并行、序列并行，上下文并行，专家并行。支持**Qwen3**、**Qwen3-MoE**、Qwen2.5、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。

对于环境准备（镜像）和HF与MCore模型权重的转换，可以参考Megatron-SWIFT训练文档，这里不进行介绍：https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html

我们使用DLC启动训练命令，训练环境是2机8 * 80GiB A800：

更多**多节点启动**方式参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

```bash
# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# 请确保两个节点的保存权重路径相同
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true
```
训练loss图（部分）：

<img width="910" alt="Image" src="https://github.com/user-attachments/assets/9fe393aa-8299-4659-aa2f-be5d44f0730b" />

效果截图：

<img width="1066" alt="Image" src="https://github.com/user-attachments/assets/1a924130-1954-43e9-9093-b019aeef5949" />

自定义数据集格式与`swift sft`相同，可以在本文上方找到，指定`--dataset <dataset_path>`即可。

使用`megatron sft`和`swift sft`进行Qwen3-30B-A3B模型全参数训练**速度/显存占用**对比如下：

|          | Megatron-LM | DeepSpeed-ZeRO2 | DeepSpeed-ZeRO3 |
| -------- | ----------- | --------------- | --------------- |
| 训练速度 | 9.6s/it     | -               | 91.2s/it        |
| 显存占用 | 16 * 60GiB  | OOM             | 16 * 80GiB      |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030

English Version

Qwen3-8B SFT

Qwen3-8B GRPO

Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

中文版

Qwen3-8B SFT

Qwen3-8B GRPO

Qwen3-30B-A3B MoE SFT（Megatron-SWIFT）

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Megatron-LM	DeepSpeed-ZERO2	DeepSpeed-ZERO3
Training Speed	9.6s/it	-	91.2s/it
GPU Memory Usage	16 * 60GiB	OOM	16 * 80GiB

	Megatron-LM	DeepSpeed-ZeRO2	DeepSpeed-ZeRO3
训练速度	9.6s/it	-	91.2s/it
显存占用	16 * 60GiB	OOM	16 * 80GiB

🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030

Description

English Version

Qwen3-8B SFT

Qwen3-8B GRPO

Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

中文版

Qwen3-8B SFT

Qwen3-8B GRPO

Qwen3-30B-A3B MoE SFT（Megatron-SWIFT）

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions