Skip to content

Streaming + Packing + resume_from_checkpoint时出现报错 #4083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hertz-pj opened this issue May 5, 2025 · 3 comments
Open

Streaming + Packing + resume_from_checkpoint时出现报错 #4083

hertz-pj opened this issue May 5, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@hertz-pj
Copy link

hertz-pj commented May 5, 2025

Describe the bug
在使用Streaming + Packing + resume_from_checkpoint时报错,目测是再跳过已训练的batch时出现的问题
错误日志:

[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/cli/sft.py", line 7, in <module>
[rank0]:     sft_main()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 281, in sft_main
[rank0]:     return SwiftSft(args).main()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/base.py", line 47, in main
[rank0]:     result = self.run()
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 147, in run
[rank0]:     return self.train(trainer)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 207, in train
[rank0]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/trainers/mixin.py", line 321, in train
[rank0]:     res = super().train(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2241, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2482, in _inner_training_loop
[rank0]:     epoch_dataloader = skip_first_batches(epoch_dataloader, steps_trained_in_current_epoch)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/data_loader.py", line 1338, in skip_first_batches
[rank0]:     dataset = dataloader.dataset
[rank0]:               ^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'DataLoaderDispatcher' object has no attribute 'dataset'

启动脚本:

swift sft \
    --custom_register_path train/custom_model.py \
    --model $model_path \
    --model_type $model_type \
    --dataset  $train_data_path  \
    --val_dataset  $val_data_path  \
    --dataset_num_proc 1 \
    --train_type full \
    --torch_dtype bfloat16 \
    --num_train_epochs $epoch \
    --per_device_train_batch_size $batch_size \
    --per_device_eval_batch_size $batch_size \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps 8 \
    --eval_steps 10000000 \
    --save_steps 20000 \
    --logging_steps 100 \
    --max_steps 10000000 \
    --max_length $max_length \
    --output_dir $output_dir \
    --warmup_ratio 0 \
    --packing true \
    --attn_impl flash_attn \
    --streaming true \
    --resume_from_checkpoint $checkpoint-80000 \
    --dataloader_num_workers 1 2>&1 | tee $output_dir/train.log

Your hardware and system info
torch==2.5.1
ms-swift==3.4.0

@Jintao-Huang Jintao-Huang added the bug Something isn't working label May 5, 2025
@hertz-pj
Copy link
Author

hertz-pj commented May 6, 2025

有计划什么时间修复该bug吗,或者一些绕过该bug的trick方案。

@Jintao-Huang
Copy link
Collaborator

可以先尝试--resume_only_model true

@hertz-pj
Copy link
Author

hertz-pj commented May 6, 2025

可以先尝试--resume_only_model true

感谢回复,这应该不太行,需要optimizer和数据集信息来继续训练。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants