Description
中文版 notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb
Qwen docs: https://qwen.readthedocs.io/en/latest/training/ms_swift.html
English Version
We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The CPT/SFT/DPO/GRPO for Qwen3/Qwen3-MoE has been supported at the first time by the ms-swift large model training framework. Meanwhile, it also supports the Megatron training (CPT/SFT) implementation for Qwen3/Qwen3-MoE, which is 10 times faster than the training speed achieved using transformers on MoE models.
We will showcase a runnable fine-tuning demo and provide the format for custom datasets.
Before starting the fine-tuning process, please ensure that your environment is properly set up.
# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
pip install liger-kernel transformers -U
Qwen3-8B SFT
The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook
# Training GPU memory: 22GB
# You can specify `--dataset AI-ModelScope/alpaca-gpt4-data-zh` to run the experiment
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model Qwen/Qwen3-8B \
--train_type lora \
--dataset '<dataset-path>' \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 4 \
--eval_steps 50 \
--save_steps 50 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 2048 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--packing true \
--use_liger_kernel true \
--attn_impl flash_attn
The format for a custom dataset is as follows (the system
field is optional). Simply specify --dataset <dataset_path>
:
For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
Datasets without thinking can be handled in two ways to reduce the disruption of thinking during fine-tuning:
Option 1: During training, additionally specify --loss_scale ignore_empty_think
to ignore the loss calculation for <think>\n\n</think>\n\n
, preventing the loss of thinking ability.
Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
Option 2: Add /no_think to the query in the dataset to avoid the loss of thinking ability.
Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)
ref:
ms-swift/swift/llm/dataset/dataset/llm.py
Line 835 in 51cafe5
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model Qwen/Qwen3-8B \
--train_type lora \
--dataset 'swift/Qwen3-SFT-Mixin#2000' \
'swift/self-cognition:qwen3#600' \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 16 \
--eval_steps 50 \
--save_steps 50 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 2048 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--use_liger_kernel true \
--model_author swift \
--model_name swift-robot
Inference and test the fine-tuning results:
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--adapters output/vx-xxx/checkpoint-xxx \
--stream true \
--temperature 0 \
--max_new_tokens 2048

Qwen3-8B GRPO
Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html
The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:
pip install math_verify==0.5.2
The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}
You can also train with custom reward functions or reward models. Columns in the dataset will be passed into **kwargs
of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.py
--external_plugins examples/train/grpo/plugin/plugin.py \
--reward_funcs external_math_acc external_math_format \
--reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2
During training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.
The training script is as follows:
# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen3-8B \
--train_type full \
--dataset AI-MO/NuminaMath-TIR \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--learning_rate 1e-6 \
--save_total_limit 2 \
--logging_steps 5 \
--output_dir output \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--max_completion_length 4096 \
--vllm_max_model_len 8192 \
--reward_funcs accuracy \
--num_generations 16 \
--use_vllm true \
--vllm_gpu_memory_utilization 0.4 \
--sleep_level 1 \
--offload_model true \
--offload_optimizer true \
--gc_collect_after_offload true \
--deepspeed zero3 \
--num_infer_workers 8 \
--tensor_parallel_size 1 \
--temperature 1.0 \
--top_p 0.85 \
--report_to wandb \
--log_completions true \
--overlong_filter true
Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)
ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.
For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html
We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:
More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node
# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# Please ensure that the weight saving paths are the same for both nodes.
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
--load Qwen3-30B-A3B-Base-mcore \
--dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
--tensor_model_parallel_size 2 \
--expert_model_parallel_size 8 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 0.01 \
--micro_batch_size 1 \
--global_batch_size 16 \
--packing true \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--train_iters 2000 \
--eval_iters 50 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_iters 100 \
--min_lr 1e-6 \
--save megatron_output/Qwen3-30B-A3B-Base \
--eval_interval 200 \
--save_interval 200 \
--max_length 8192 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--use_flash_attn true
Training loss (partial):

The custom dataset format is the same as swift sft
, which can be found above. Specify --dataset <dataset_path>
.
Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using megatron sft
and swift sft
:
Megatron-LM | DeepSpeed-ZERO2 | DeepSpeed-ZERO3 | |
---|---|---|---|
Training Speed | 9.6s/it | - | 91.2s/it |
GPU Memory Usage | 16 * 60GiB | OOM | 16 * 80GiB |
中文版
非常高兴听到Qwen3和Qwen3-MoE的开源, ms-swift大模型训练框架首发支持了Qwen3/Qwen3-MoE的CPT/SFT/DPO/GRPO,同时支持了Qwen3/Qwen3-MoE的Megatron训练(CPT/SFT)实现,在MoE模型上相比transformers实现的训练速度快10倍。
我们将展示可运行的微调demo,并给出自定义数据集的格式。
在开始微调之前,请确保您的环境已准备妥当。
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
pip install liger-kernel transformers -U
Qwen3-8B SFT
对Qwen3-8B进行训练的脚本如下,可在魔搭提供的免费算力A10中运行:https://modelscope.cn/my/mynotebook
# 训练显存:22GB
# 你可以指定`--dataset AI-ModelScope/alpaca-gpt4-data-zh`来跑通实验
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model Qwen/Qwen3-8B \
--train_type lora \
--dataset '<dataset-path>' \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 4 \
--eval_steps 50 \
--save_steps 50 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 2048 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--packing true \
--use_liger_kernel true \
--attn_impl flash_attn
自定义数据集格式如下(system字段可选),指定--dataset <dataset_path>
即可:
{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\n浙江的省会在杭州。"}]}
不带思考的数据集可以有两种处理方式,来减少微调过程对思考的破坏:
方案一:在训练时额外指定--loss_scale ignore_empty_think
,忽略<think>\n\n</think>\n\n
的损失计算,避免思考能力的丢失。
demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh
{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}
方案二:在数据集的query中额外增加/no_think
,避免思考能力的丢失。
demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh
{"messages": [{"role": "user", "content": "浙江的省会在哪? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}
10分钟快速自我认知微调Demo(显存占用:22GB)
ref:
ms-swift/swift/llm/dataset/dataset/llm.py
Line 835 in 51cafe5
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model Qwen/Qwen3-8B \
--train_type lora \
--dataset 'swift/Qwen3-SFT-Mixin#2000' \
'swift/self-cognition:qwen3#600' \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 16 \
--eval_steps 50 \
--save_steps 50 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 2048 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--use_liger_kernel true \
--model_author swift \
--model_name swift-robot
推理测试微调效果:
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--adapters output/vx-xxx/checkpoint-xxx \
--stream true \
--temperature 0 \
--max_new_tokens 2048

Qwen3-8B GRPO
以Qwen3-8B为例,下面使用ms-swift框架对进行GRPO训练。更多关于GRPO,可以参考GRPO文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html
使用AI-MO/NuminaMath-TIR作为数据集,并使用accuracy函数计算模型回答的准确率奖励, 计算奖励需要安装以下环境:
pip install math_verify==0.5.2
自定义数据集格式与SFT类似,其中assistant部分不必需。如果使用accuracy奖励,则需要solution列来计算准确率。
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}
也可以使用自定义的奖励函数/奖励模型进行训练,数据集中的列会传到奖励函数的**kwargs
中,自定义奖励函数的例子参考:swift/examples/train/grpo/plugin/plugin.py
--external_plugins examples/train/grpo/plugin/plugin.py \
--reward_funcs external_math_acc external_math_format \
--reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2
在训练过程中,我们使用vLLM来加速采样过程。设置num_infer_workers=8,我们为每个device都部署一个vLLM engine来加速采样过程。
训练脚本如下:
# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen3-8B \
--train_type full \
--dataset AI-MO/NuminaMath-TIR \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--learning_rate 1e-6 \
--save_total_limit 2 \
--logging_steps 5 \
--output_dir output \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--max_completion_length 4096 \
--vllm_max_model_len 8192 \
--reward_funcs accuracy \
--num_generations 16 \
--use_vllm true \
--vllm_gpu_memory_utilization 0.4 \
--sleep_level 1 \
--offload_model true \
--offload_optimizer true \
--gc_collect_after_offload true \
--deepspeed zero3 \
--num_infer_workers 8 \
--tensor_parallel_size 1 \
--temperature 1.0 \
--top_p 0.85 \
--report_to wandb \
--log_completions true \
--overlong_filter true
Qwen3-30B-A3B MoE SFT(Megatron-SWIFT)
ms-swift引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3、Qwen3-MoE、Qwen2.5、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。
对于环境准备(镜像)和HF与MCore模型权重的转换,可以参考Megatron-SWIFT训练文档,这里不进行介绍:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html
我们使用DLC启动训练命令,训练环境是2机8 * 80GiB A800:
更多多节点启动方式参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node
# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# 请确保两个节点的保存权重路径相同
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
--load Qwen3-30B-A3B-Base-mcore \
--dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
--tensor_model_parallel_size 2 \
--expert_model_parallel_size 8 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 0.01 \
--micro_batch_size 1 \
--global_batch_size 16 \
--packing true \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--train_iters 2000 \
--eval_iters 50 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_iters 100 \
--min_lr 1e-6 \
--save megatron_output/Qwen3-30B-A3B-Base \
--eval_interval 200 \
--save_interval 200 \
--max_length 8192 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--use_flash_attn true
训练loss图(部分):

效果截图:

自定义数据集格式与swift sft
相同,可以在本文上方找到,指定--dataset <dataset_path>
即可。
使用megatron sft
和swift sft
进行Qwen3-30B-A3B模型全参数训练速度/显存占用对比如下:
Megatron-LM | DeepSpeed-ZeRO2 | DeepSpeed-ZeRO3 | |
---|---|---|---|
训练速度 | 9.6s/it | - | 91.2s/it |
显存占用 | 16 * 60GiB | OOM | 16 * 80GiB |