Bug! After resuming training, the info line doesn't show details anymore ... #3993

tjoymeed · 2025-04-25T05:07:21Z

As you can see, after resuming from training, the [INFO:swift] line got disappeared with only blanks "--------------------"

All I did was adding this line in the training script:

--resume_from_checkpoint /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400

Nothing else.

What's wrong?

Could anybody please help me?

Thanks a lot!

[INFO:swift] --------------------------------------
INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:27 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:28 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache
INFO 04-24 22:03:29 [prefix_caching_block.py:479] Successfully reset prefix cache

The text was updated successfully, but these errors were encountered:

slin000111 · 2025-04-27T09:57:14Z

--load_args true

tjoymeed · 2025-04-27T16:51:27Z

The problem is when I do --resume_from_checkpoint,

all the parameters in the training sh script file are kept the same as the previous training runs, except only for the "--resume_from_checkpoint".

So why do I need to "load_args" explicitly?

Also, in the manual, it says when training it should be "False".

Thanks a lot!

slin000111 · 2025-04-28T03:11:19Z

If you only add the parameter resume_from_checkpoint and do not delete other parameters, there is no need to add load_args.
In addition, the latest main branch does not reproduce the above problem, the [INFO:swift] line got disappeared with only blanks "--------------------".

The training script is as follows, only the parameter resume_from_checkpoint is added to the original training script.

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen2.5-7B-Instruct \
    --dataset 'ServiceNow-AI/R1-Distill-SFT:v1#2000' \
    --resume_from_checkpoint '/mnt/workspace/output/v4-20250428-102132/checkpoint-100' \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 1 \
    --gradient_checkpointing_kwargs '{"use_reentrant": false}'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug! After resuming training, the info line doesn't show details anymore ... #3993

Bug! After resuming training, the info line doesn't show details anymore ... #3993

tjoymeed commented Apr 25, 2025

slin000111 commented Apr 27, 2025

tjoymeed commented Apr 27, 2025

slin000111 commented Apr 28, 2025 •

edited

Loading

Bug! After resuming training, the info line doesn't show details anymore ... #3993

Bug! After resuming training, the info line doesn't show details anymore ... #3993

Comments

tjoymeed commented Apr 25, 2025

slin000111 commented Apr 27, 2025

tjoymeed commented Apr 27, 2025

slin000111 commented Apr 28, 2025 • edited Loading

slin000111 commented Apr 28, 2025 •

edited

Loading