Skip to content

steps如何计算的 #3954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
toufunao opened this issue Apr 22, 2025 · 3 comments
Open

steps如何计算的 #3954

toufunao opened this issue Apr 22, 2025 · 3 comments

Comments

@toufunao
Copy link

我使用了以下脚本进行训练,数据集大小约为33000条数据,per_device_batch_size=16,gradient_accumenlation_steps=32,epochs=3,4张GPU。
nproc_per_node=4

NPROC_PER_NODE=$nproc_per_node
CUDA_VISIBLE_DEVICES=0,1,2,3
swift pt
--model Qwen/Qwen2.5-7B
--train_type full
--dataset $CUSTOM_DATASET
--torch_dtype bfloat16
--num_train_epochs 3
--per_device_train_batch_size 16
--per_device_eval_batch_size 1
--learning_rate 1e-5
--gradient_accumulation_steps $(expr 128 / $nproc_per_node)
--packing true
--eval_steps 10
--save_steps 50
--save_total_limit 2
--logging_steps 5
--deepspeed zero3
--max_length 8192
--warmup_ratio 0.05
--save_only_model true
--output_dir XXXXX

如果正常计算应该是33000*3/16/32/4=48,但是实际进度条显示是193steps。请问ms_swift如何自动计算step数的?

@Jintao-Huang
Copy link
Collaborator

加了packing

@Jintao-Huang
Copy link
Collaborator

或者你看看 NPROC_PER_NODE是否设置正常

@toufunao toufunao reopened this Apr 23, 2025
@toufunao
Copy link
Author

加了packing
谢谢指正,刚刚重新看了一下启动脚本,并没有使用packing,使用了sequence_parallel进行训练。 验证NPROC_PER_NODE也是正常的,world_size在log中也是4。但是step数和手动计算的值仍然有误差

nproc_per_node=4
NPROC_PER_NODE=$nproc_per_node
CUDA_VISIBLE_DEVICES=0,1,2,3
swift pt
--model Qwen/Qwen2.5-7B
--train_type full
--dataset $CUSTOM_DATASET
--torch_dtype bfloat16
--num_train_epochs 3
--sequence_parallel 4
--per_device_train_batch_size 16
--per_device_eval_batch_size 1
--learning_rate 1e-5
--gradient_accumulation_steps $(expr 128 / $nproc_per_node)
--eval_steps 10
--save_steps 50
--save_total_limit 2
--logging_steps 5
--deepspeed zero3
--max_length 8192
--warmup_ratio 0.05
--save_only_model true
--output_dir XXXXX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants