We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用的参数是
NCCL_IB_DISABLE=1 NCCL_IB_MERGE_NICS=0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NNODES=$NNODES NODE_RANK=$NODE_RANK MASTER_ADDR=$MASTER_ADDR NPROC_PER_NODE=$NPROC_PER_NODE swift sft --model $MODEL_PATH --train_type lora --torch_dtype bfloat16 --output_dir $OUTPUT_DIR --num_train_epochs 5 --max_new_tokens 256 --eval_strategy steps --eval_steps 20 --save_steps 50 --gradient_checkpointing true --gradient_accumulation_steps 2 --per_device_train_batch_size 1 --weight_decay 0.01 --learning_rate 1e-5 --logging_steps 10 --dataset $DATASET_PATH --deepspeed zero3 --save_on_each_node true --save_only_model true --lora_rank 8 --lora_alpha 16
训练不会报错,但是读取checkpoint 进行merge时会报错safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
The text was updated successfully, but these errors were encountered:
单个磁盘嘛
Sorry, something went wrong.
--save_on_each_node true去掉试试
--save_on_each_node true
No branches or pull requests
使用的参数是
NCCL_IB_DISABLE=1
NCCL_IB_MERGE_NICS=0
NCCL_DEBUG=INFO
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$NNODES
NODE_RANK=$NODE_RANK
MASTER_ADDR=$MASTER_ADDR
NPROC_PER_NODE=$NPROC_PER_NODE
swift sft --model $MODEL_PATH
--train_type lora
--torch_dtype bfloat16
--output_dir $OUTPUT_DIR
--num_train_epochs 5
--max_new_tokens 256
--eval_strategy steps
--eval_steps 20
--save_steps 50
--gradient_checkpointing true
--gradient_accumulation_steps 2
--per_device_train_batch_size 1
--weight_decay 0.01
--learning_rate 1e-5
--logging_steps 10
--dataset $DATASET_PATH
--deepspeed zero3
--save_on_each_node true
--save_only_model true
--lora_rank 8
--lora_alpha 16
训练不会报错,但是读取checkpoint 进行merge时会报错safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
The text was updated successfully, but these errors were encountered: