Description
During GRPO training, I noticed something odd in the wandb logs:
train/completions/max_length
sometimes exceeds the max_completion_length
I set in the script (which is 512). Also completions/clipped_ratio
stays at 0, even when completions clearly run past the limit.
Running the exact same code with Huggingface TRL, however:
train/completions/max_length
never goes beyond max_completion_length
, and
completions/clipped_ratio
is non-zero, which makes sense because that metric should reflect the fraction of truncated completions.
MS-SWIFT GRPO training with max_completion_length
=512
HF TRL GRPO training with max_completion_length
= 512
Also performance of trained model using ms-swift is worse than TRL's one even with same hyper-parameters (not sure about reason...)
ms-swift == 3.6.0.dev (installing from July 7's git repo)
trl == 0.18.1
script I used:
CUDA_VISIBLE_DEVICES=3
NPROC_PER_NODE=1
swift rlhf
--rlhf_type grpo
--model Qwen/Qwen2.5-7B
--external_plugins PLUG_IN
--reward_funcs REWARD
--train_type full
--loss_type bnpo
--torch_dtype bfloat16
--dataset CUSTOM_DATASET
--max_length 512
--max_completion_length 512
--num_train_epochs 3
--seed 42
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--gradient_accumulation_steps 4
--learning_rate 1e-6
--lr_scheduler_type="constant_with_warmup"
--temperature 0.9
--warmup_ratio 0.05
--max_grad_norm 0.2
--save_strategy="steps"
--save_steps 250
--save_total_limit 20
--logging_steps 1
--output_dir output/BLEUBERI_1000/1gpu
--dataloader_num_workers 4
--num_generations 8
--system 'You are a helpful assistant.'
--deepspeed zero3_offload
--log_completions true
--report_to wandb
--num_iterations 1
--use_hf 1
--split_dataset_ratio 0
--weight_decay 0.0
--adam_beta2 0.999
--top_p 1.0