使用megatron swift sft微调Qwen3-30B-A3B之后，checkpoint无法转回huggingface格式 #4147

alanayu · 2025-05-09T07:28:07Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)
按照https://huggingface.co/Qwen/Qwen3-30B-A3B/discussions/3中的Qwen3-30B-A3B的微调方法训练之后，checkpoint无法使用swift export方法转换回huggingface 格式

格式转换脚本
CUDA_VISIBLE_DEVICES=0
swift export
--model /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010
--model_type qwen3_moe
--torch_dtype bfloat16
--to_hf true
--test_convert_precision false
--output_dir /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf

上述格式转换脚本处理Qwen3-30B-A3B-hf-mcore可以成功，处理经过训练后的checkpoint报错

报错信息

run sh: /usr/bin/python /mnt/nvme2/ms-swift/swift/cli/export.py --model /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010 --model_type qwen3_moe --torch_dtype bfloat16 --to_hf true --test_convert_precision false --output_dir /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf
[INFO:swift] Successfully registered /mnt/nvme2/ms-swift/swift/llm/dataset/data/dataset_info.json.
[INFO:swift] rank: 0, local_rank: 0, world_size: 1, local_world_size: 1
[INFO:swift] Loading the model using model_dir: /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010
[INFO:swift] args.output_dir: /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf
[INFO:swift] Global seed set to 42
[INFO:swift] args: ExportArguments(model='/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010', model_type='qwen3_moe', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=None, num_labels=None, problem_type=None, rope_scaling=None, device_map=None, max_memory={}, local_repo_path=None, template='qwen3', system=None, max_length=2048, truncation_strategy='delete', max_pixels=None, agent_template=None, norm_bbox=None, response_prefix=None, padding_side='right', loss_scale='default', sequence_parallel_size=1, use_chat_template=True, template_backend='swift', dataset=[], val_dataset=[], split_dataset_ratio=0.01, data_seed=42, dataset_num_proc=1, dataset_shuffle=True, val_dataset_shuffle=False, streaming=False, interleave_prob=None, stopping_strategy='first_exhausted', shuffle_buffer_size=1000, enable_cache=False, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, model_name=[None, None], model_author=[None, None], custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=None, temperature=None, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, ckpt_dir=None, load_dataset_config=None, lora_modules=[], tuner_backend='peft', train_type='lora', adapters=[], external_plugins=[], seed=42, model_kwargs={}, load_args=True, load_data_args=False, use_hf=False, hub_token=None, custom_register_path=[], ignore_args_error=False, use_swift_lora=False, merge_lora=False, safe_serialization=True, max_shard_size='5GB', output_dir='/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf', quant_n_samples=256, quant_batch_size=1, group_size=128, to_ollama=False, to_mcore=False, to_hf=True, mcore_model=None, thread_count=None, test_convert_precision=False, push_to_hub=False, hub_model_id=None, hub_private_repo=False, commit_message='update files', to_peft_format=False, exist_ok=False)
[INFO:swift] Start time of running main: 2025-05-09 06:56:15.652839
[INFO:swift] local_repo_path: /root/.cache/modelscope/hub/_github/Megatron-LM
/usr/local/lib/python3.12/dist-packages/modelopt/torch/nas/plugins/init.py:16: DeprecationWarning: The 'megatron.core.transformer.custom_layers.transformer_engine'
module is deprecated and will be removed in 0.10.0. Please use
'megatron.core.extensions.transformer_engine' instead.
from .megatron import *
[NeMo W 2025-05-09 06:56:21 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use matplotlib.colormaps[name] or matplotlib.colormaps.get_cmap() or pyplot.get_cmap() instead.
cm = get_cmap("Set1")

[INFO:swift] Loading the model using model_dir: /mnt/liufei/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/nvme2/ms-swift/swift/cli/export.py", line 5, in
[rank0]: export_main()
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/export/export.py", line 50, in export_main
[rank0]: return SwiftExport(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/base.py", line 47, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/export/export.py", line 37, in run
[rank0]: convert_mcore2hf(args)
[rank0]: File "/mnt/nvme2/ms-swift/swift/megatron/utils/convert.py", line 92, in convert_mcore2hf
[rank0]: hf_model, processor = get_model_tokenizer(**kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/model/register.py", line 571, in get_model_tokenizer
[rank0]: model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/model/register.py", line 272, in get_model_tokenizer_with_flash_attn
[rank0]: return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/model/register.py", line 209, in get_model_tokenizer_from_local
[rank0]: tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1028, in from_pretrained
[rank0]: return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 2046, in from_pretrained
[rank0]: raise EnvironmentError(
[rank0]: OSError: Can't load tokenizer for '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.
[rank0]:[W509 06:56:23.802697360 ProcessGroupNCCL.cpp:1427] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

从报错信息来看，似乎checkpoint里找不到tokenizer的信息

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

torch
'2.6.0a0+ecf3bae40a.nv25.01'

transformers
'4.51.3'

swift
'3.5.0.dev0'

CUDA
12.8

Additional context
Add any other context about the problem here(在这里补充其他信息)

The text was updated successfully, but these errors were encountered:

firefighter-eric · 2025-05-09T07:49:53Z

--model换成--mcore_model试试

alanayu · 2025-05-09T08:07:33Z

运行
CUDA_VISIBLE_DEVICES=0
swift export
--mcore_model /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010
--model_type qwen3_moe
--torch_dtype bfloat16
--to_hf true
--test_convert_precision false
--output_dir /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf
后报错

run sh: /usr/bin/python /mnt/nvme2/ms-swift/swift/cli/export.py --mcore_model /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010 --model_type qwen3_moe --torch_dtype bfloat16 --to_hf true --test_convert_precision false --output_dir /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf
[INFO:swift] Successfully registered /mnt/nvme2/ms-swift/swift/llm/dataset/data/dataset_info.json.
[INFO:swift] rank: 0, local_rank: 0, world_size: 1, local_world_size: 1
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/nvme2/ms-swift/swift/cli/export.py", line 5, in
[rank0]: export_main()
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/export/export.py", line 50, in export_main
[rank0]: return SwiftExport(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/base.py", line 18, in init
[rank0]: self.args = self._parse_args(args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/base.py", line 30, in _parse_args
[rank0]: args, remaining_argv = parse_args(self.args_class, args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/utils/utils.py", line 151, in parse_args
[rank0]: args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/hf_argparser.py", line 358, in parse_args_into_dataclasses
[rank0]: obj = dtype(**inputs)
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "", line 99, in init
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/argument/export_args.py", line 104, in post_init
[rank0]: BaseArguments.post_init(self)
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/argument/base_args/base_args.py", line 153, in post_init
[rank0]: ModelArguments.post_init(self)
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/argument/base_args/model_args.py", line 151, in post_init
[rank0]: raise ValueError(f'Please set --model <model_id_or_path>, model: {self.model}') [rank0]: ValueError: Please set --model <model_id_or_path>, model: None
[rank0]:[W509 08:05:31.525682905 ProcessGroupNCCL.cpp:1427] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

命令提示设置--model

Jintao-Huang · 2025-05-09T08:09:01Z

--mcore_model /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用megatron swift sft微调Qwen3-30B-A3B之后，checkpoint无法转回huggingface格式 #4147

使用megatron swift sft微调Qwen3-30B-A3B之后，checkpoint无法转回huggingface格式 #4147

alanayu commented May 9, 2025

firefighter-eric commented May 9, 2025

alanayu commented May 9, 2025

Jintao-Huang commented May 9, 2025

使用megatron swift sft微调Qwen3-30B-A3B之后，checkpoint无法转回huggingface格式 #4147

使用megatron swift sft微调Qwen3-30B-A3B之后，checkpoint无法转回huggingface格式 #4147

Comments

alanayu commented May 9, 2025

firefighter-eric commented May 9, 2025

alanayu commented May 9, 2025

Jintao-Huang commented May 9, 2025