Skip to content

Bug! After resuming training, the info line doesn't show details anymore ... #3993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tjoymeed opened this issue Apr 25, 2025 · 3 comments

Comments

@tjoymeed
Copy link

As you can see, after resuming from training, the [INFO:swift] line got disappeared with only blanks "--------------------"

All I did was adding this line in the training script:

--resume_from_checkpoint /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400

Nothing else.

What's wrong?

Could anybody please help me?

Thanks a lot!


[INFO:swift] --------------------------------------
INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache

@slin000111
Copy link
Collaborator

--load_args true

@tjoymeed
Copy link
Author

The problem is when I do --resume_from_checkpoint,

all the parameters in the training sh script file are kept the same as the previous training runs, except only for the "--resume_from_checkpoint".

So why do I need to "load_args" explicitly?

Also, in the manual, it says when training it should be "False".

Thanks a lot!

@slin000111
Copy link
Collaborator

slin000111 commented Apr 28, 2025

If you only add the parameter resume_from_checkpoint and do not delete other parameters, there is no need to add load_args.
In addition, the latest main branch does not reproduce the above problem, the [INFO:swift] line got disappeared with only blanks "--------------------".

The training script is as follows, only the parameter resume_from_checkpoint is added to the original training script.

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen2.5-7B-Instruct \
    --dataset 'ServiceNow-AI/R1-Distill-SFT:v1#2000' \
    --resume_from_checkpoint '/mnt/workspace/output/v4-20250428-102132/checkpoint-100' \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 1 \
    --gradient_checkpointing_kwargs '{"use_reentrant": false}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants