Skip to content

训练中被强行终止:args.max_epochs <= math.ceil(state.epoch),TypeError: '<=' not supported between instances of 'NoneType' and 'int' #4161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
YUANMU227 opened this issue May 10, 2025 · 2 comments

Comments

@YUANMU227
Copy link

[rank0]: File "/home/ms-swift/swift/trainers/callback.py", line 95, in on_epoch_end
[rank0]: if args.max_epochs <= math.ceil(state.epoch):
[rank0]: TypeError: '<=' not supported between instances of 'NoneType' and 'int'
Train: 50%|███████████████████████████████ | 269/538 [18:24<18:24, 4.11s/it]
[rank0]:[W511 00:19:28.037565019 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

训练中被强行终止
--num_train_epochs设置为2,已经保存两个文件:checkpoint-100、checkpoint-200

发现最近有更新,是否和这个更新有关?
#4125

Image

正在支持 max_epochs参数,强制在对应epochs时终止训练并保存权重
megatron是不是只支持传入train_iters, 不支持epoch,另外如果使用了packing,怎么看packing后实际有多少样本

@Jintao-Huang
Copy link
Collaborator

已经修复了

@YUANMU227
Copy link
Author

已经修复了

很强,速度很快啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants