Skip to content

多机多卡zero3 lora微调后 merge读取时 报错safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer #3854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tytcc opened this issue Apr 12, 2025 · 2 comments

Comments

@tytcc
Copy link

tytcc commented Apr 12, 2025

使用的参数是

NCCL_IB_DISABLE=1
NCCL_IB_MERGE_NICS=0
NCCL_DEBUG=INFO
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$NNODES
NODE_RANK=$NODE_RANK
MASTER_ADDR=$MASTER_ADDR
NPROC_PER_NODE=$NPROC_PER_NODE
swift sft --model $MODEL_PATH
--train_type lora
--torch_dtype bfloat16
--output_dir $OUTPUT_DIR
--num_train_epochs 5
--max_new_tokens 256
--eval_strategy steps
--eval_steps 20
--save_steps 50
--gradient_checkpointing true
--gradient_accumulation_steps 2
--per_device_train_batch_size 1
--weight_decay 0.01
--learning_rate 1e-5
--logging_steps 10
--dataset $DATASET_PATH
--deepspeed zero3
--save_on_each_node true
--save_only_model true
--lora_rank 8
--lora_alpha 16

训练不会报错,但是读取checkpoint 进行merge时会报错safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

@Jintao-Huang
Copy link
Collaborator

单个磁盘嘛

@Jintao-Huang
Copy link
Collaborator

--save_on_each_node true去掉试试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants