Description
使用的参数是
NCCL_IB_DISABLE=1
NCCL_IB_MERGE_NICS=0
NCCL_DEBUG=INFO
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$NNODES
NODE_RANK=$NODE_RANK
MASTER_ADDR=$MASTER_ADDR
NPROC_PER_NODE=$NPROC_PER_NODE
swift sft --model $MODEL_PATH
--train_type lora
--torch_dtype bfloat16
--output_dir $OUTPUT_DIR
--num_train_epochs 5
--max_new_tokens 256
--eval_strategy steps
--eval_steps 20
--save_steps 50
--gradient_checkpointing true
--gradient_accumulation_steps 2
--per_device_train_batch_size 1
--weight_decay 0.01
--learning_rate 1e-5
--logging_steps 10
--dataset $DATASET_PATH
--deepspeed zero3
--save_on_each_node true
--save_only_model true
--lora_rank 8
--lora_alpha 16
训练不会报错,但是读取checkpoint 进行merge时会报错safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer