-
Notifications
You must be signed in to change notification settings - Fork 637
🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Model Inference:Thinking Mode:
Non-Thinking Mode:
Model Quantization:Qwen3-32B-AWQ: https://modelscope.cn/models/swift/Qwen3-32B-AWQ Qwen3-30B-A3B-AWQ: https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ Qwen3-235B-A22B-AWQ: https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ |
请问vllm版本选择多少 |
vllm==0.8.5 |
将HF格式的权重转为Megatron格式失败:
errors: |
It's still on the main branch now, and the version ms-swift==3.4.0 will be released tonight. |
请求增加对Qwen3-8B的自我认知训练的NoteBook文件 我在魔塔提供的PAI-DSW中使用“ |
能否添加全参数微调的脚本? |
You can refer to the example here and modify the https://github.com/modelscope/ms-swift/blob/main/examples/train/full/qwen2_5_32b.sh |
已加入自我认知微调的demo |
If I currently have data without a reasoning process, but I want to use this data to fine-tune Qwen3, should I simply add /no_think after the prompt and prefix the response with |
Perhaps you can refer to this for a solution: ms-swift/swift/llm/dataset/dataset/llm.py Line 835 in 51cafe5
|
如何将微调成功后的模型导出为GGUF格式? |
@Jintao-Huang 在不采用推理的情况下,是否仍然可以使用Qwen2.5 的模板微调模型? |
When using --packing true, please additionally use --attn_impl flash_attn. This was missed in the best practices. |
在华为NPU上运行Swift deploy失败:
似乎是/data4/code185/ms-swift/swift/llm/model/patcher.py 导致的,请问有什么办法可以解决吗,谢谢 |
请求增加一个用于将通过ms-swift微调后的模型转为GGUF格式文件的Notebook文件 |
sft.py: error: ambiguous option: --model could match --model_type, --model_id_or_path, --model_revision, --model_name, --model_author, --model_layer_cls_name, --model_cache_dir sft.py报错不支持直接用--model |
升级一下swift>=3.4.0 |
Qwen3-30B-A3B训练成功,但Qwen3-32B megatron sft报错: 2025-05-02T03:37:00.069008389Z [rank24]: raise RuntimeError( |
可以看看是哪里抛出来的嘛,报错信息完整一些,最好是截图 |
嗯,升级后已经解决了 |
train好的moe模型有测过benchmark吗?担心有数值问题 |
Updated to Qwen docs: https://qwen.readthedocs.io/en/latest/training/ms_swift.html |
How to modify the response_prefix of the engine?
|
请问现在支持 cache packing好的dataset吗,代码里好像没有看见 |
可以尝试 --streaming true,避免packing的时间。 缓存packing的数据集 后面会使用idx bin数据格式来做,请稍等 |
按照option1 设置数据集,加上loss scale 全参数微调qwen3-4b 会hang 住,等了半个小时才有动静,报nvcc 超时错误,训练参数如下: |
请求加入自我认知微调的python demo(非命令行版本) |
megatron是不是只支持传入train_iters, 不支持epoch,另外如果使用了packing,怎么看packing后实际有多少样本 |
正在支持 max_epochs参数,强制在对应epochs时终止训练并保存权重
|
https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh 这个试试呢 |
你的数据格式 可以放一条看看不 |
Qwen3-30B-A3B 能支持megatron训练grpo吗? |
那设置packing后,能看到packing后的实际样本量吗,如果知道的话倒是可以自己算一下,现在只能设置train_iters,packing下很难估计要设置多少 |
现在qwen3 做dapo只能用vllm0.8.5,但是貌似还是zero3会有卡住不动的问题 |
会在命令行中打印统计量,你找找 |
用megatron swift 训练完Qwen3-30B-A3B-mcore之后,在将checkpoint转为hf格式时报错 |
DLC PPU训练MoE出现错误
transformers_engine版本是2.0,从这里安装的 参考 |
现在是把--moe_grouped_gemm 设置为false了,目前看起来是正常运行的,但不太知道原因 |
megatron swift可以支持Lora和Qlora吗 |
--mcore_model '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959' |
计划中,还不支持 |
感觉和transformers_engine版本相关 |
能说下目前进度吗,很需要这两个特性。大佬们加油! |
同问? |
确认下多轮训练的时候是mask history的吧,即只算最后一轮的loss |
对于Qwen3-235B-A22B这个模型: |
sft阶段是训练每一轮的response |
hf模型导出为megatron,能支持切分tp吗? |
中文版 notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb
Qwen docs: https://qwen.readthedocs.io/en/latest/training/ms_swift.html
English Version
We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The CPT/SFT/DPO/GRPO for Qwen3/Qwen3-MoE has been supported at the first time by the ms-swift large model training framework. Meanwhile, it also supports the Megatron training (CPT/SFT) implementation for Qwen3/Qwen3-MoE, which is 10 times faster than the training speed achieved using transformers on MoE models.
We will showcase a runnable fine-tuning demo and provide the format for custom datasets.
Before starting the fine-tuning process, please ensure that your environment is properly set up.
Qwen3-8B SFT
The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook
The format for a custom dataset is as follows (the
system
field is optional). Simply specify--dataset <dataset_path>
:For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html
Datasets without thinking can be handled in two ways to reduce the disruption of thinking during fine-tuning:
Option 1: During training, additionally specify
--loss_scale ignore_empty_think
to ignore the loss calculation for<think>\n\n</think>\n\n
, preventing the loss of thinking ability.Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh
Option 2: Add /no_think to the query in the dataset to avoid the loss of thinking ability.
Demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh
10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)
ref:
ms-swift/swift/llm/dataset/dataset/llm.py
Line 835 in 51cafe5
Inference and test the fine-tuning results:
CUDA_VISIBLE_DEVICES=0 \ swift infer \ --adapters output/vx-xxx/checkpoint-xxx \ --stream true \ --temperature 0 \ --max_new_tokens 2048
Qwen3-8B GRPO
Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html
The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:
The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.
You can also train with custom reward functions or reward models. Columns in the dataset will be passed into
**kwargs
of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.pyDuring training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.
The training script is as follows:
Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)
ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.
For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html
We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:
More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node
Training loss (partial):
The custom dataset format is the same as
swift sft
, which can be found above. Specify--dataset <dataset_path>
.Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using
megatron sft
andswift sft
:中文版
非常高兴听到Qwen3和Qwen3-MoE的开源, ms-swift大模型训练框架首发支持了Qwen3/Qwen3-MoE的CPT/SFT/DPO/GRPO,同时支持了Qwen3/Qwen3-MoE的Megatron训练(CPT/SFT)实现,在MoE模型上相比transformers实现的训练速度快10倍。
我们将展示可运行的微调demo,并给出自定义数据集的格式。
在开始微调之前,请确保您的环境已准备妥当。
Qwen3-8B SFT
对Qwen3-8B进行训练的脚本如下,可在魔搭提供的免费算力A10中运行:https://modelscope.cn/my/mynotebook
自定义数据集格式如下(system字段可选),指定
--dataset <dataset_path>
即可:参考自定义数据集文档:https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html
不带思考的数据集可以有两种处理方式,来减少微调过程对思考的破坏:
方案一:在训练时额外指定
--loss_scale ignore_empty_think
,忽略<think>\n\n</think>\n\n
的损失计算,避免思考能力的丢失。demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo1.sh
方案二:在数据集的query中额外增加
/no_think
,避免思考能力的丢失。demo: https://github.com/modelscope/ms-swift/blob/main/examples/train/think_model/qwen3_demo2.sh
10分钟快速自我认知微调Demo(显存占用:22GB)
ref:
ms-swift/swift/llm/dataset/dataset/llm.py
Line 835 in 51cafe5
推理测试微调效果:
CUDA_VISIBLE_DEVICES=0 \ swift infer \ --adapters output/vx-xxx/checkpoint-xxx \ --stream true \ --temperature 0 \ --max_new_tokens 2048
Qwen3-8B GRPO
以Qwen3-8B为例,下面使用ms-swift框架对进行GRPO训练。更多关于GRPO,可以参考GRPO文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html
使用AI-MO/NuminaMath-TIR作为数据集,并使用accuracy函数计算模型回答的准确率奖励, 计算奖励需要安装以下环境:
自定义数据集格式与SFT类似,其中assistant部分不必需。如果使用accuracy奖励,则需要solution列来计算准确率。
也可以使用自定义的奖励函数/奖励模型进行训练,数据集中的列会传到奖励函数的
**kwargs
中,自定义奖励函数的例子参考:swift/examples/train/grpo/plugin/plugin.py在训练过程中,我们使用vLLM来加速采样过程。设置num_infer_workers=8,我们为每个device都部署一个vLLM engine来加速采样过程。
训练脚本如下:
Qwen3-30B-A3B MoE SFT(Megatron-SWIFT)
ms-swift引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3、Qwen3-MoE、Qwen2.5、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。
对于环境准备(镜像)和HF与MCore模型权重的转换,可以参考Megatron-SWIFT训练文档,这里不进行介绍:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html
我们使用DLC启动训练命令,训练环境是2机8 * 80GiB A800:
更多多节点启动方式参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node
训练loss图(部分):
效果截图:
自定义数据集格式与
swift sft
相同,可以在本文上方找到,指定--dataset <dataset_path>
即可。使用
megatron sft
和swift sft
进行Qwen3-30B-A3B模型全参数训练速度/显存占用对比如下:The text was updated successfully, but these errors were encountered: