Skip to content

[WARNING:swift] No training was carried out, which may be due to the dataset being too small or incorrect usage of resume_from_checkpoint. #3863

Closed
@Henchen99

Description

@Henchen99

I finished my GRPO training, however on the final epoch it gave me this message:

Train: 100%|█████████████████████████████████████| 811/812 [53:11:32<03:56, 236.12s/it]
[WARNING:swift] No training was carried out, which may be due to the dataset being too small or incorrect usage of resume_from_checkpoint.
[INFO:swift] End time of running main: 2025-04-13 06:05:03.922258

And so I did not get any trained checkpoints from this after such a long training time. I have been able to get checkpoints before, but I just changed some hyperparameters (data remains the same 16k size) and I do not use resume_from_checkpoint, so I am confused why this happened and how to fix.

My bash script is this, which I run with the command bash train_GRPO.sh:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 \
NPROC_PER_NODE=5 \
swift rlhf \
    --rlhf_type grpo \
    --model /vault/ultraz/open-r1/data/llama3-3b-lora-checkpoint_1023 \
    --model_type llama3_2 \
    --train_type lora \
    --dataset /vault/ultraz/unsloth_grpo/skythought_no_tokens.csv \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --eval_steps 2000 \
    --save_steps 2000 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 8 \
    --reward_funcs accuracy format cosine repetition \
    --num_generations 4 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.9 \
    --vllm_max_model_len 5000 \
    --max_completion_length 5000 \
    --max_length 7000 \
    --num_infer_workers 2 \
    --deepspeed zero3 \
    --temperature 1.0 \
    --system examples/train/grpo/prompt.txt \
    --deepspeed zero2 \
    --log_completions true \

My library versions are:

vllm==0.8.3
ms_swift version: 3.3.0.dev0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions