-
Notifications
You must be signed in to change notification settings - Fork 636
使用megatron swift sft微调Qwen3-30B-A3B之后,checkpoint无法转回huggingface格式 #4147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
--model换成--mcore_model试试 |
运行 run sh: 命令提示设置--model |
|
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
按照https://huggingface.co/Qwen/Qwen3-30B-A3B/discussions/3中的Qwen3-30B-A3B的微调方法训练之后,checkpoint无法使用swift export方法转换回huggingface 格式
格式转换脚本
CUDA_VISIBLE_DEVICES=0
swift export
--model /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010
--model_type qwen3_moe
--torch_dtype bfloat16
--to_hf true
--test_convert_precision false
--output_dir /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf
上述格式转换脚本处理Qwen3-30B-A3B-hf-mcore可以成功,处理经过训练后的checkpoint报错
报错信息
run sh:
/usr/bin/python /mnt/nvme2/ms-swift/swift/cli/export.py --model /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010 --model_type qwen3_moe --torch_dtype bfloat16 --to_hf true --test_convert_precision false --output_dir /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf
[INFO:swift] Successfully registered
/mnt/nvme2/ms-swift/swift/llm/dataset/data/dataset_info.json
.[INFO:swift] rank: 0, local_rank: 0, world_size: 1, local_world_size: 1
[INFO:swift] Loading the model using model_dir: /mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010
[INFO:swift] args.output_dir:
/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf
[INFO:swift] Global seed set to 42
[INFO:swift] args: ExportArguments(model='/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010', model_type='qwen3_moe', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=None, num_labels=None, problem_type=None, rope_scaling=None, device_map=None, max_memory={}, local_repo_path=None, template='qwen3', system=None, max_length=2048, truncation_strategy='delete', max_pixels=None, agent_template=None, norm_bbox=None, response_prefix=None, padding_side='right', loss_scale='default', sequence_parallel_size=1, use_chat_template=True, template_backend='swift', dataset=[], val_dataset=[], split_dataset_ratio=0.01, data_seed=42, dataset_num_proc=1, dataset_shuffle=True, val_dataset_shuffle=False, streaming=False, interleave_prob=None, stopping_strategy='first_exhausted', shuffle_buffer_size=1000, enable_cache=False, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, model_name=[None, None], model_author=[None, None], custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=None, temperature=None, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, ckpt_dir=None, load_dataset_config=None, lora_modules=[], tuner_backend='peft', train_type='lora', adapters=[], external_plugins=[], seed=42, model_kwargs={}, load_args=True, load_data_args=False, use_hf=False, hub_token=None, custom_register_path=[], ignore_args_error=False, use_swift_lora=False, merge_lora=False, safe_serialization=True, max_shard_size='5GB', output_dir='/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v0-20250506-143334/iter_0016029_hf', quant_n_samples=256, quant_batch_size=1, group_size=128, to_ollama=False, to_mcore=False, to_hf=True, mcore_model=None, thread_count=None, test_convert_precision=False, push_to_hub=False, hub_model_id=None, hub_private_repo=False, commit_message='update files', to_peft_format=False, exist_ok=False)
[INFO:swift] Start time of running main: 2025-05-09 06:56:15.652839
[INFO:swift] local_repo_path: /root/.cache/modelscope/hub/_github/Megatron-LM
/usr/local/lib/python3.12/dist-packages/modelopt/torch/nas/plugins/init.py:16: DeprecationWarning: The 'megatron.core.transformer.custom_layers.transformer_engine'
module is deprecated and will be removed in 0.10.0. Please use
'megatron.core.extensions.transformer_engine' instead.
from .megatron import *
[NeMo W 2025-05-09 06:56:21 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use
matplotlib.colormaps[name]
ormatplotlib.colormaps.get_cmap()
orpyplot.get_cmap()
instead.cm = get_cmap("Set1")
[INFO:swift] Loading the model using model_dir: /mnt/liufei/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/nvme2/ms-swift/swift/cli/export.py", line 5, in
[rank0]: export_main()
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/export/export.py", line 50, in export_main
[rank0]: return SwiftExport(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/base.py", line 47, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/export/export.py", line 37, in run
[rank0]: convert_mcore2hf(args)
[rank0]: File "/mnt/nvme2/ms-swift/swift/megatron/utils/convert.py", line 92, in convert_mcore2hf
[rank0]: hf_model, processor = get_model_tokenizer(**kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/model/register.py", line 571, in get_model_tokenizer
[rank0]: model, processor = get_function(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/model/register.py", line 272, in get_model_tokenizer_with_flash_attn
[rank0]: return get_model_tokenizer_from_local(model_dir, model_info, model_kwargs, load_model, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/nvme2/ms-swift/swift/llm/model/register.py", line 209, in get_model_tokenizer_from_local
[rank0]: tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py", line 1028, in from_pretrained
[rank0]: return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 2046, in from_pretrained
[rank0]: raise EnvironmentError(
[rank0]: OSError: Can't load tokenizer for '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/mnt/nvme2/train/megatron_output/Qwen3-30B-A3B/v2-20250509-050959/iter_0000010' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.
[rank0]:[W509 06:56:23.802697360 ProcessGroupNCCL.cpp:1427] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
从报错信息来看,似乎checkpoint里找不到tokenizer的信息
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
torch
'2.6.0a0+ecf3bae40a.nv25.01'
transformers
'4.51.3'
swift
'3.5.0.dev0'
CUDA
12.8
Additional context
Add any other context about the problem here(在这里补充其他信息)
The text was updated successfully, but these errors were encountered: