We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在大数据集上,恢复训练无法正常进行,表象是卡在进度0%
在小数据集上可以正常进行
是参数没传对吗?
大数据集的训练脚本
CUDA_VISIBLE_DEVICES=0,1,2,3 \ HF_ENDPOINT=https://hf-mirror.com \ NPROC_PER_NODE=4 \ nohup swift sft \ --model /HDD/train/output/Qwen2.5-0.5B-Instruct/v1-20250401-234003/checkpoint-68086 \ --resume_from_checkpoint /HDD/train/output/checkpoint-68086/v9-20250420-214017/checkpoint-2000 \ --resume_only_model false \ --custom_register_path /HDD/train/convert/convert.py \ --train_type full \ --torch_dtype bfloat16 \ --system "" \ --dataset "/HDD/train/data/dataset/lemon/all.27946228.jsonl" \ --val_dataset "/HDD/train/data/dataset/ecspell/test.jsonl" \ --streaming true \ --truncation_strategy delete \ --num_train_epochs 1 \ --learning_rate 5.89e-4 \ --lr_scheduler_type 'cosine_with_min_lr' \ --lr_scheduler_kwargs '{"min_lr":0.00001}' \ --eval_steps 500 \ --max_steps 10713 \ --save_steps 500 \ --max_length 220 \ --warmup_ratio 0 \ --gradient_accumulation_steps 66 \ --per_device_train_batch_size 12 \ --per_device_eval_batch_size 12 \ --save_total_limit 5 > ./log/lemon.log &
The text was updated successfully, but these errors were encountered:
遇到了同样的问题,急求解决方案!
Sorry, something went wrong.
No branches or pull requests
在大数据集上,恢复训练无法正常进行,表象是卡在进度0%
在小数据集上可以正常进行
是参数没传对吗?
大数据集的训练脚本
The text was updated successfully, but these errors were encountered: