Skip to content

Support Qwen3 series #4029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/Instruction/Megatron-SWIFT训练.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# Megatron-SWIFT训练

SWIFT引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行。支持Megatron训练的模型可以参考[支持的模型与数据集文档](./支持的模型和数据集.md)。
SWIFT引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3、Qwen3-MoE、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。完整支持的模型可以参考[支持的模型与数据集文档](./支持的模型和数据集.md)。

## 环境准备
使用Megatron-SWIFT,除了安装swift依赖外,还需要安装以下内容:
Expand Down
16 changes: 16 additions & 0 deletions docs/source/Instruction/支持的模型和数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,22 @@
|[Qwen/QwQ-32B-Preview](https://modelscope.cn/models/Qwen/QwQ-32B-Preview)|qwq_preview|qwq_preview|transformers>=4.37|✔|-|[Qwen/QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)|
|[Qwen/QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B)|qwq|qwq|transformers>=4.37|✔|-|[Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)|
|[Qwen/QwQ-32B-AWQ](https://modelscope.cn/models/Qwen/QwQ-32B-AWQ)|qwq|qwq|transformers>=4.37|✘|-|[Qwen/QwQ-32B-AWQ](https://huggingface.co/Qwen/QwQ-32B-AWQ)|
|[Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)|
|[Qwen/Qwen3-1.7B-Base](https://modelscope.cn/models/Qwen/Qwen3-1.7B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)|
|[Qwen/Qwen3-4B-Base](https://modelscope.cn/models/Qwen/Qwen3-4B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)|
|[Qwen/Qwen3-8B-Base](https://modelscope.cn/models/Qwen/Qwen3-8B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)|
|[Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base)|
|[Qwen/Qwen3-32B-Base](https://modelscope.cn/models/Qwen/Qwen3-32B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B-Base](https://huggingface.co/Qwen/Qwen3-32B-Base)|
|[Qwen/Qwen3-0.6B](https://modelscope.cn/models/Qwen/Qwen3-0.6B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)|
|[Qwen/Qwen3-1.7B](https://modelscope.cn/models/Qwen/Qwen3-1.7B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)|
|[Qwen/Qwen3-4B](https://modelscope.cn/models/Qwen/Qwen3-4B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)|
|[Qwen/Qwen3-8B](https://modelscope.cn/models/Qwen/Qwen3-8B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)|
|[Qwen/Qwen3-14B](https://modelscope.cn/models/Qwen/Qwen3-14B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)|
|[Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)|
|[Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base)|
|[Qwen/Qwen3-235B-A22B-Base](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Base](https://huggingface.co/Qwen/Qwen3-235B-A22B-Base)|
|[Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)|
|[Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)|
|[iic/gte_Qwen2-1.5B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)|
|[iic/gte_Qwen2-7B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-7B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)|
|[codefuse-ai/CodeFuse-QWen-14B](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B)|codefuse_qwen|codefuse|-|✘|coding|[codefuse-ai/CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B)|
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Instruction/Megatron-SWIFT-Training.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# Megatron-SWIFT Training

SWIFT incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and context parallelism. For models that support Megatron training, please refer to the [Supported Models and Datasets documentation](./Supported-models-and-datasets.md).
SWIFT incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports the pre-training and fine-tuning of models such as Qwen3, Qwen3-MoE, Llama3, and the Deepseek-R1 distillation series. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](./Supported-models-and-datasets.md).

## Environment Setup

Expand Down
16 changes: 16 additions & 0 deletions docs/source_en/Instruction/Supported-models-and-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,22 @@ The table below introduces the models integrated with ms-swift:
|[Qwen/QwQ-32B-Preview](https://modelscope.cn/models/Qwen/QwQ-32B-Preview)|qwq_preview|qwq_preview|transformers>=4.37|✔|-|[Qwen/QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)|
|[Qwen/QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B)|qwq|qwq|transformers>=4.37|✔|-|[Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)|
|[Qwen/QwQ-32B-AWQ](https://modelscope.cn/models/Qwen/QwQ-32B-AWQ)|qwq|qwq|transformers>=4.37|✘|-|[Qwen/QwQ-32B-AWQ](https://huggingface.co/Qwen/QwQ-32B-AWQ)|
|[Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)|
|[Qwen/Qwen3-1.7B-Base](https://modelscope.cn/models/Qwen/Qwen3-1.7B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base)|
|[Qwen/Qwen3-4B-Base](https://modelscope.cn/models/Qwen/Qwen3-4B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base)|
|[Qwen/Qwen3-8B-Base](https://modelscope.cn/models/Qwen/Qwen3-8B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)|
|[Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base)|
|[Qwen/Qwen3-32B-Base](https://modelscope.cn/models/Qwen/Qwen3-32B-Base)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B-Base](https://huggingface.co/Qwen/Qwen3-32B-Base)|
|[Qwen/Qwen3-0.6B](https://modelscope.cn/models/Qwen/Qwen3-0.6B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)|
|[Qwen/Qwen3-1.7B](https://modelscope.cn/models/Qwen/Qwen3-1.7B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)|
|[Qwen/Qwen3-4B](https://modelscope.cn/models/Qwen/Qwen3-4B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)|
|[Qwen/Qwen3-8B](https://modelscope.cn/models/Qwen/Qwen3-8B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)|
|[Qwen/Qwen3-14B](https://modelscope.cn/models/Qwen/Qwen3-14B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)|
|[Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B)|qwen3|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)|
|[Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base)|
|[Qwen/Qwen3-235B-A22B-Base](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-Base)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B-Base](https://huggingface.co/Qwen/Qwen3-235B-A22B-Base)|
|[Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)|
|[Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B)|qwen3_moe|qwen3|transformers>=4.51|✔|-|[Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)|
|[iic/gte_Qwen2-1.5B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-1.5B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)|
|[iic/gte_Qwen2-7B-instruct](https://modelscope.cn/models/iic/gte_Qwen2-7B-instruct)|qwen2_gte|dummy|-|✘|-|[Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)|
|[codefuse-ai/CodeFuse-QWen-14B](https://modelscope.cn/models/codefuse-ai/CodeFuse-QWen-14B)|codefuse_qwen|codefuse|-|✘|coding|[codefuse-ai/CodeFuse-QWen-14B](https://huggingface.co/codefuse-ai/CodeFuse-QWen-14B)|
Expand Down
37 changes: 37 additions & 0 deletions examples/train/megatron/qwen3_moe.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# ZeRO3: 91.2s/it; 16 * 80GiB
# Megatron-LM: 9.6s/it; 16 * 60GiB
# Launch using Alibaba Cloud DLC
# ref: https://github.com/modelscope/ms-swift/blob/main/examples/train/multi-node/dlc/train.sh
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
--load Qwen3-30B-A3B-Base-mcore \
--dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
--tensor_model_parallel_size 2 \
--expert_model_parallel_size 8 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 0.01 \
--micro_batch_size 1 \
--global_batch_size 16 \
--packing true \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--train_iters 2000 \
--eval_iters 50 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_iters 100 \
--min_lr 1e-6 \
--save megatron_output/Qwen3-30B-A3B-Base \
--eval_interval 200 \
--save_interval 200 \
--max_length 8192 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--use_flash_attn true
12 changes: 11 additions & 1 deletion swift/llm/dataset/dataset/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -829,9 +829,19 @@ def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
return super().preprocess(row)


class ThinkSelfCognitionPreprocessor(SelfCognitionPreprocessor):

def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]:
row['response'] = '<think>\n\n</think>\n\n' + row['response']
return super().preprocess(row)


register_dataset(
DatasetMeta(
ms_dataset_id='swift/self-cognition',
hf_dataset_id='modelscope/self-cognition',
preprocess_func=SelfCognitionPreprocessor(),
subsets=[
SubsetDataset(preprocess_func=SelfCognitionPreprocessor()),
SubsetDataset('think', preprocess_func=ThinkSelfCognitionPreprocessor()),
],
tags=['chat', 'self-cognition', '🔥']))
24 changes: 20 additions & 4 deletions swift/llm/model/model/qwen.py
Original file line number Diff line number Diff line change
Expand Up @@ -494,10 +494,22 @@ def _get_cast_dtype(self) -> torch.dtype:
LLMModelType.qwen3,
[
ModelGroup([
# Model('Qwen/Qwen3-0.6B-Base', 'Qwen/Qwen3-0.6B-Base'),
Model('Qwen/Qwen3-0.6B-Base', 'Qwen/Qwen3-0.6B-Base'),
Model('Qwen/Qwen3-1.7B-Base', 'Qwen/Qwen3-1.7B-Base'),
Model('Qwen/Qwen3-4B-Base', 'Qwen/Qwen3-4B-Base'),
Model('Qwen/Qwen3-8B-Base', 'Qwen/Qwen3-8B-Base'),
Model('Qwen/Qwen3-14B-Base', 'Qwen/Qwen3-14B-Base'),
Model('Qwen/Qwen3-32B-Base', 'Qwen/Qwen3-32B-Base'),
# instruct
Model('Qwen/Qwen3-0.6B', 'Qwen/Qwen3-0.6B'),
Model('Qwen/Qwen3-1.7B', 'Qwen/Qwen3-1.7B'),
Model('Qwen/Qwen3-4B', 'Qwen/Qwen3-4B'),
Model('Qwen/Qwen3-8B', 'Qwen/Qwen3-8B'),
Model('Qwen/Qwen3-14B', 'Qwen/Qwen3-14B'),
Model('Qwen/Qwen3-32B', 'Qwen/Qwen3-32B'),
]),
],
TemplateType.qwen,
TemplateType.qwen3,
get_model_tokenizer_with_flash_attn,
architectures=['Qwen3ForCausalLM'],
requires=['transformers>=4.51'],
Expand All @@ -508,10 +520,14 @@ def _get_cast_dtype(self) -> torch.dtype:
LLMModelType.qwen3_moe,
[
ModelGroup([
# Model('Qwen/Qwen3-15B-A2B-Base', 'Qwen/Qwen3-15B-A2B-Base'),
Model('Qwen/Qwen3-30B-A3B-Base', 'Qwen/Qwen3-30B-A3B-Base'),
Model('Qwen/Qwen3-235B-A22B-Base', 'Qwen/Qwen3-235B-A22B-Base'),
# instruct
Model('Qwen/Qwen3-30B-A3B', 'Qwen/Qwen3-30B-A3B'),
Model('Qwen/Qwen3-235B-A22B', 'Qwen/Qwen3-235B-A22B'),
]),
],
TemplateType.qwen,
TemplateType.qwen3,
get_model_tokenizer_with_flash_attn,
architectures=['Qwen3MoeForCausalLM'],
requires=['transformers>=4.51'],
Expand Down
15 changes: 14 additions & 1 deletion swift/llm/model/patcher.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
import os
from contextlib import contextmanager
from functools import wraps
from types import MethodType
Expand All @@ -7,9 +8,9 @@
import accelerate
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
from accelerate.utils import find_device
from packaging import version
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from torch.nn.parallel import DistributedDataParallel as DDP
from transformers import PreTrainedModel, dynamic_module_utils, trainer
Expand Down Expand Up @@ -343,3 +344,15 @@ def new_get_cached_module_file(pretrained_model_name_or_path, *args, **kwargs):
yield
finally:
dynamic_module_utils.get_cached_module_file = origin_get_cached_module_file


@contextmanager
def patch_tp_plan():
if not is_mp_ddp() or version.parse(transformers.__version__) < version.parse('4.50'):
yield
return
WORLD_SIZE = os.environ.get('WORLD_SIZE')
os.environ['_PATCH_WORLD_SIZE'] = WORLD_SIZE
os.environ.pop('WORLD_SIZE')
yield
os.environ['WORLD_SIZE'] = WORLD_SIZE
4 changes: 2 additions & 2 deletions swift/llm/model/register.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from swift.utils import get_dist_setting, get_logger, is_mp, is_unsloth_available, patch_getattr, use_torchacc
from .constant import ModelType
from .patcher import (patch_automodel, patch_automodel_for_sequence_classification, patch_get_dynamic_module,
patch_mp_ddp)
patch_mp_ddp, patch_tp_plan)
from .utils import AttnImpl, HfConfigFactory, ModelInfo, safe_snapshot_download

GetModelTokenizerFunction = Callable[..., Tuple[Optional[PreTrainedModel], PreTrainedTokenizerBase]]
Expand Down Expand Up @@ -567,7 +567,7 @@ def get_model_tokenizer(
kwargs['attn_impl'] = attn_impl
kwargs['rope_scaling'] = rope_scaling
kwargs['model_meta'] = model_meta
with patch_get_dynamic_module():
with patch_get_dynamic_module(), patch_tp_plan():
model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)

if not isinstance(processor, PreTrainedTokenizerBase) and hasattr(processor, 'tokenizer'):
Expand Down
1 change: 1 addition & 0 deletions swift/llm/template/constant.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ class LLMTemplateType:
qwen2_5 = 'qwen2_5'
qwen2_5_math = 'qwen2_5_math'
qwen2_5_math_prm = 'qwen2_5_math_prm'
qwen3 = 'qwen3'
qwq_preview = 'qwq_preview'
qwq = 'qwq'
marco_o1 = 'marco_o1'
Expand Down
8 changes: 6 additions & 2 deletions swift/llm/template/template/qwen.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ class Qwen2_5MathTemplateMeta(QwenTemplateMeta):
register_template(QwenTemplateMeta(LLMTemplateType.qwq_preview, default_system=qwq_preview_system))


class QwQTemplate(Template):
class ThinkingTemplate(Template):

def _swift_encode(self, inputs: StdTemplateInputs):
if not self.is_training:
Expand All @@ -56,7 +56,11 @@ def _swift_encode(self, inputs: StdTemplateInputs):


register_template(
QwenTemplateMeta(LLMTemplateType.qwq, default_system=None, response_prefix='<think>\n', template_cls=QwQTemplate))
QwenTemplateMeta(
LLMTemplateType.qwq, default_system=None, response_prefix='<think>\n', template_cls=ThinkingTemplate))

# '<think>\n\n</think>\n\n'
register_template(QwenTemplateMeta(LLMTemplateType.qwen3, default_system=None, template_cls=ThinkingTemplate))

register_template(Qwen2_5MathTemplateMeta(LLMTemplateType.qwen2_5_math))

Expand Down
2 changes: 1 addition & 1 deletion swift/utils/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def get_dist_setting() -> Tuple[int, int, int, int]:
"""return rank, local_rank, world_size, local_world_size"""
rank = int(os.getenv('RANK', -1))
local_rank = int(os.getenv('LOCAL_RANK', -1))
world_size = int(os.getenv('WORLD_SIZE', 1))
world_size = int(os.getenv('WORLD_SIZE') or os.getenv('_PATCH_WORLD_SIZE') or 1)
# compat deepspeed launch
local_world_size = int(os.getenv('LOCAL_WORLD_SIZE', None) or os.getenv('LOCAL_SIZE', 1))
return rank, local_rank, world_size, local_world_size
Expand Down
11 changes: 10 additions & 1 deletion tests/test_align/test_template/test_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ def test_qwen2_5():
assert response == response2


def test_qwen3():
pt_engine = PtEngine('Qwen/Qwen3-4B')
response = _infer_model(pt_engine)
pt_engine.default_template.template_backend = 'jinja'
response2 = _infer_model(pt_engine)
assert response == response2


def test_phi4():
pt_engine = PtEngine('LLM-Research/phi-4')
response = _infer_model(pt_engine)
Expand Down Expand Up @@ -408,4 +416,5 @@ def test_gemma3():
# test_moonlight()
# test_ling()
# test_gemma3()
test_glm4_0414()
# test_glm4_0414()
test_qwen3()