Skip to content

[megatron]Support packing & CP #4163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions docs/source/Instruction/Megatron-SWIFT训练.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,15 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- 🔥micro_batch_size: 每个device的批次大小,默认为1。
- 🔥global_batch_size: 总批次大小,等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。
- 🔥recompute_granularity: 重新计算激活的粒度,可选项为'full', 'selective'。其中full代表重新计算整个transformer layer,selective代表只计算transformer layer中的核心注意力部分。通常'selective'是推荐的。默认为'selective'。
- recompute_method: 该参数需将recompute_granularity设置为'full'才生效,可选项为'uniform', 'block'。默认为None。
- recompute_num_layers: 该参数需将recompute_granularity设置为'full'才生效,默认为None。若`recompute_method`设置为uniform,该参数含义为每个均匀划分的重新计算单元的transformer layers数量。例如你可以指定为`--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`。recompute_num_layers越大,显存占用越小,计算成本越大。默认为None。
- 🔥recompute_method: 该参数需将recompute_granularity设置为'full'才生效,可选项为'uniform', 'block'。默认为None。
- 🔥recompute_num_layers: 该参数需将recompute_granularity设置为'full'才生效,默认为None。若`recompute_method`设置为uniform,该参数含义为每个均匀划分的重新计算单元的transformer layers数量。例如你可以指定为`--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`。recompute_num_layers越大,显存占用越小,计算成本越大。默认为None。
- recompute_modules: 选项包括"core_attn", "moe_act", "layernorm", "mla_up_proj", "mlp", "moe" ,默认值为,["core_attn"]。例如在MoE训练时,你可以通过指定`--recompute_granularity selective --recompute_modules core_attn moe`降低显存。其中"core_attn"、"mlp" 和 "moe" 使用常规检查点,"moe_act"、"layernorm" 和 "mla_up_proj" 使用输出丢弃检查点。
- "core_attn":重新计算 Transformer 层中的核心注意力部分。
- "mlp":重新计算密集的 MLP 层。
- "moe":重新计算 MoE 层。
- "moe_act":重新计算 MoE 中的 MLP 激活函数部分。
- "layernorm":重新计算 input_layernorm 和 pre_mlp_layernorm。
- "mla_up_proj":重新计算 MLA 上投影和 RoPE 应用部分。
- deterministic_mode: 确定性模式,这会导致训练速度下降,默认为False。
- 🔥train_iters: 训练的总迭代次数,默认为None。
- 🔥log_interval: log的时间间隔(单位:iters),默认为5。
Expand All @@ -146,6 +153,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- no_rope_fusion: 默认为False。指定`--no_rope_fusion true`用于禁止rope融合。
- no_gradient_accumulation_fusion: 默认为False。指定`--no_gradient_accumulation_fusion true`用于禁用梯度累加融合。
- 🔥cross_entropy_loss_fusion: 启动交叉熵损失计算融合。默认为False。
- calculate_per_token_loss: 根据全局批次中的非填充token数量来对交叉熵损失进行缩放。默认为True。
- 🔥attention_backend: 使用的注意力后端 (flash、fused、unfused、local、auto)。默认为 auto。
- optimizer: 优化器类型,可选为'adam'、'sgd'。默认为adam。
- dataloader_type: 默认为'cyclic',可选为'single', 'cyclic', 'external'。若开启`--streaming`,则设置为`external`。
Expand Down Expand Up @@ -195,6 +203,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- 🔥use_distributed_optimizer: 使用分布式优化器。默认为True。
- 🔥tensor_model_parallel_size: tp数,默认为1。
- 🔥pipeline_model_parallel_size: pp数,默认为1。
- decoder_first_pipeline_num_layers: decoder第一个流水线阶段所包含的Transformer层数。默认为 None,表示将Transformer层数平均分配到所有流水线阶段。
- decoder_last_pipeline_num_layers: decoder最后一个流水线阶段所包含的Transformer层数。默认为 None,表示将Transformer层数平均分配到所有流水线阶段。
- 🔥sequence_parallel: 启动序列并行的优化器。默认为False。
- 🔥context_parallel_size: cp数,默认为1。
- tp_comm_overlap: 启用张量并行通信与GEMM(通用矩阵乘法)内核的重叠(降低通信耗时)。默认为False。
Expand Down Expand Up @@ -260,8 +270,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- moe_shared_expert_intermediate_size: 共享专家的总FFN隐藏层大小。如果有多个共享专家,它应等于 `num_shared_experts * ffn_size_of_each_shared_expert`。 默认为None。自动从config.json读取。
- moe_router_topk: 每个token路由到的专家数量。默认为None。自动从config.json读取。
- moe_router_pre_softmax: 为MoE启用预softmax路由,这意味着softmax会在top-k选择之前进行。默认为None。自动从config.json读取。
- moe_aux_loss_coeff: 辅助损失的缩放系数:建议的初始值为 1e-2。默认为None。自动从config.json读取。
- expert_model_parallel_size: 专家并行数,默认为1。
- 🔥moe_aux_loss_coeff: 辅助损失的缩放系数:建议的初始值为 1e-2。默认为None。自动从config.json读取。
- 🔥expert_model_parallel_size: 专家并行数,默认为1。
- moe_token_dispatcher_type: 要使用的token分发器类型。可选选项包括 'allgather'、'alltoall' 和 'alltoall_seq'。默认值为 'alltoall'。
- moe_grouped_gemm: 当每个rank包含多个专家时,通过在多个流中启动多个本地 GEMM 内核,利用 TransformerEngine中的GroupedLinear提高利用率和性能。默认为False。
- moe_router_load_balancing_type: 确定路由器的负载均衡策略。可选项为"aux_loss"、"seq_aux_loss"、"sinkhorn"、"none"。默认值为 "aux_loss"。
Expand Down
18 changes: 14 additions & 4 deletions docs/source_en/Instruction/Megatron-SWIFT-Training.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,15 @@ The speed comparison of full-parameter training for Dense/MoE models using `mega
- 🔥micro_batch_size: Batch size per device, default is 1.
- 🔥global_batch_size: Total batch size, equivalent to `micro_batch_size * data parallel size * gradient accumulation steps`. Default is 16.
- 🔥recompute_granularity: Granularity of activation recomputation, options are 'full', 'selective'. 'full' means recomputing the entire transformer layer, while 'selective' means only recomputing the core attention part of the transformer layer. 'selective' is generally recommended. Default is 'selective'.
- recompute_method: This parameter takes effect only when recompute_granularity is set to 'full', options are 'uniform', 'block'. Default is None.
- recompute_num_layers: This parameter takes effect only when recompute_granularity is set to 'full'. Default is None. If `recompute_method` is set to uniform, this parameter specifies the number of transformer layers in each uniformly divided recomputation unit. For example, you can specify `--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`. The larger the recompute_num_layers, the smaller the memory usage but higher computation cost. Default is None.
- 🔥recompute_method: This parameter takes effect only when recompute_granularity is set to 'full', options are 'uniform', 'block'. Default is None.
- 🔥recompute_num_layers: This parameter takes effect only when recompute_granularity is set to 'full'. Default is None. If `recompute_method` is set to uniform, this parameter specifies the number of transformer layers in each uniformly divided recomputation unit. For example, you can specify `--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`. The larger the recompute_num_layers, the smaller the memory usage but higher computation cost. Default is None.
- recompute_modules: Options include "core_attn", "moe_act", "layernorm", "mla_up_proj", "mlp", and "moe". The default value is `["core_attn"]`. For example, during MoE training, you can reduce memory usage by specifying `--recompute_granularity selective --recompute_modules core_attn moe`. Among these, "core_attn", "mlp", and "moe" use normal checkpointing, while "moe_act", "layernorm", and "mla_up_proj" use output-discarding checkpointing.
- "core_attn": Recomputes the core attention part of the Transformer layer.
- "mlp": Recomputes the dense MLP layer.
- "moe": Recomputes the MoE layer.
- "moe_act": Recomputes the MLP activation function part in the MoE module.
- "layernorm": Recomputes the input_layernorm and pre_mlp_layernorm.
- "mla_up_proj": Recomputes the MLA up-projection and RoPE application parts.
- deterministic_mode: Deterministic mode, which may lead to slower training speed, default is False.
- 🔥train_iters: Total number of training iterations, default is None.
- 🔥log_interval: Log interval (unit: iters), default is 5.
Expand All @@ -151,6 +158,7 @@ The speed comparison of full-parameter training for Dense/MoE models using `mega
- no_rope_fusion: Default is False. Specify `--no_rope_fusion true` to disable rope fusion.
- no_gradient_accumulation_fusion: Default is False. Specify `--no_gradient_accumulation_fusion true` to disable gradient accumulation fusion.
- 🔥cross_entropy_loss_fusion: Enables cross-entropy loss calculation fusion. Default is False.
- calculate_per_token_loss: Scales the cross-entropy loss according to the number of non-padded tokens in the global batch. Default is True.
- 🔥attention_backend: The attention backend to use (flash, fused, unfused, local, auto). Defaults to auto.
- optimizer: Optimizer type, options are 'adam', 'sgd'. Default is adam.
- dataloader_type: Default is 'cyclic', options are 'single', 'cyclic', 'external'. If `--streaming` is enabled, set it to external.
Expand Down Expand Up @@ -203,6 +211,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
- 🔥use_distributed_optimizer: Use a distributed optimizer. Default is True.
- 🔥tensor_model_parallel_size: TP (Tensor Parallelism) size, default is 1.
- 🔥pipeline_model_parallel_size: PP (Pipeline Parallelism) size, default is 1.
- decoder_first_pipeline_num_layers: The number of Transformer layers in the first pipeline stage of the decoder. Default is None, which means the Transformer layers are evenly distributed across all pipeline stages.
- decoder_last_pipeline_num_layers: The number of Transformer layers in the last pipeline stage of the decoder. Default is None, which means the Transformer layers are evenly distributed across all pipeline stages.
- 🔥sequence_parallel: Enable sequence parallel optimization. Default is False.
- 🔥context_parallel_size: CP (Context Parallelism) size, default is 1.
- tp_comm_overlap: Overlap tensor parallel communication with GEMM (General Matrix Multiplication) kernels (to reduce communication time). Default is False.
Expand Down Expand Up @@ -273,8 +283,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
- moe_shared_expert_intermediate_size: The total FFN hidden layer size for shared experts. If there are multiple shared experts, it should equal `num_shared_experts * ffn_size_of_each_shared_expert`. Default is None. Automatically read from config.json.
- moe_router_topk: The number of experts each token is routed to. Default is None. Automatically read from config.json.
- moe_router_pre_softmax: Enable pre-softmax routing for MoE, meaning that softmax will be applied before top-k selection. Default is None. Automatically read from config.json.
- moe_aux_loss_coeff: Scaling coefficient for the auxiliary loss: the recommended initial value is 1e-2. Default is None. Automatically read from config.json.
- expert_model_parallel_size: The degree of expert parallelism, default is 1.
- 🔥moe_aux_loss_coeff: Scaling coefficient for the auxiliary loss: the recommended initial value is 1e-2. Default is None. Automatically read from config.json.
- 🔥expert_model_parallel_size: The degree of expert parallelism, default is 1.
- moe_token_dispatcher_type: The type of token dispatcher to use. Options include 'allgather', 'alltoall', and 'alltoall_seq'. Default is 'alltoall'.
- moe_grouped_gemm: When each rank contains multiple experts, improve utilization and performance by launching multiple local GEMM kernels across multiple streams using GroupedLinear in TransformerEngine. Default is False.
- moe_router_load_balancing_type: Determines the load balancing strategy for the router. Options are "aux_loss", "seq_aux_loss", "sinkhorn", "none". Default is "aux_loss".
Expand Down
22 changes: 20 additions & 2 deletions swift/llm/template/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1062,6 +1062,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
encoded['labels'] = labels
encoded['loss_scale'] = loss_scale
if self.use_megatron:
self._handle_megatron_cp(encoded)
encoded['labels'] = encoded['labels'][1:] + [-100]
encoded['position_ids'] = list(range(len(encoded['labels'])))
elif encoded.get('labels') is not None:
Expand All @@ -1072,6 +1073,15 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
encoded[k] = None
return encoded

def _handle_megatron_cp(self, encoded: Dict[str, Any]) -> None:
cp_size = self.sequence_parallel_size
if cp_size == 1:
return
input_ids = encoded['input_ids']
padding_len = math.ceil(len(input_ids) / (cp_size * 2)) * (cp_size * 2) - len(input_ids)
input_ids += [self.tokenizer.pad_token_id] * padding_len
encoded['labels'] += [-100] * padding_len

def debug_logger(self, inputs):
if not strtobool(os.getenv('SWIFT_DEBUG', 'false')):
return
Expand Down Expand Up @@ -1352,11 +1362,19 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in

if self.use_megatron:
padding_to = math.ceil(max(seq_lens) / 128) * 128
cp_size = self.sequence_parallel_size
if cp_size > 1:
padding_len = padding_to - seq_lens[0]
position_ids = res['position_ids'][0].tolist()
position_ids += list(range(cp_size * 2)) * (padding_len // (cp_size * 2))
res['position_ids'][0] = torch.tensor(position_ids)

for key, pad_value in zip(keys, pad_values):
if key not in res:
continue
if padding_to is not None:
if self.use_megatron and key == 'position_ids' and self.sequence_parallel_size > 1:
pass
elif padding_to is not None:
padding_len = padding_to - seq_lens[0]
if padding_len > 0:
res[key][0] = F.pad(res[key][0], (0, padding_len) if padding_right else (padding_len, 0),
Expand All @@ -1365,7 +1383,7 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in

# multimodal
res.update(self._data_collator_mm_data(batch))
if use_torchacc() or self.sequence_parallel_size > 1:
if not self.use_megatron and (use_torchacc() or self.sequence_parallel_size > 1):
res = self._torchacc_xtuner_data_collator(res, padding_to, self.tokenizer, padding_side)

return res
Expand Down
6 changes: 5 additions & 1 deletion swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
import os
import sys
from dataclasses import asdict, dataclass
from dataclasses import asdict, dataclass, field
from typing import Any, Dict, List, Literal, Optional, Tuple, Union

import torch
Expand Down Expand Up @@ -31,6 +31,7 @@ class MegatronArguments(ExtraMegatronArguments):
recompute_granularity: Literal['selective', 'full'] = 'selective'
recompute_method: Literal['uniform', 'block'] = None
recompute_num_layers: Optional[int] = None
recompute_modules: List[str] = field(default_factory=lambda: ['core_attn'])
use_cpu_initialization: bool = False
deterministic_mode: bool = False
train_iters: Optional[int] = None
Expand All @@ -42,6 +43,7 @@ class MegatronArguments(ExtraMegatronArguments):
no_rope_fusion: bool = False
no_gradient_accumulation_fusion: bool = False
cross_entropy_loss_fusion: bool = False
calculate_per_token_loss: bool = True
use_flash_attn: bool = False
attention_backend: str = 'auto' # flash, fused, unfused, local, auto
optimizer: Literal['adam', 'sgd'] = 'adam'
Expand Down Expand Up @@ -84,6 +86,8 @@ class MegatronArguments(ExtraMegatronArguments):
use_distributed_optimizer: bool = True
tensor_model_parallel_size: int = 1
pipeline_model_parallel_size: int = 1
decoder_first_pipeline_num_layers: Optional[int] = None
decoder_last_pipeline_num_layers: Optional[int] = None
sequence_parallel: bool = False
context_parallel_size: int = 1
tp_comm_overlap: bool = False
Expand Down
4 changes: 1 addition & 3 deletions swift/megatron/argument/train_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,7 @@ def _init_save(self):
os.makedirs(self.save, exist_ok=True)

def __post_init__(self):
if self.sequence_parallel_size > 1:
# please use `--sequence_parallel` or `--context_parallel_size`.
self.sequence_parallel_size = 1
self.sequence_parallel_size = self.context_parallel_size
self.load = to_abspath(self.load, check_path_exist=True)
BaseArguments.__post_init__(self)
self._init_save()
Expand Down
Loading
Loading