Skip to content

fix grpo doc #3920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
fix doc
  • Loading branch information
hjh0119 committed Apr 17, 2025
commit 763a6e4217278dbe15f29ccf9c83f2d2da7d0fe8
2 changes: 1 addition & 1 deletion docs/source/BestPractices/GRPO多模态训练.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ register_dataset(

```json
{
'images': [{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xe0\x00\x00\x01@\x08\x06\x00\x00\x00d\xc8\xafB`\x82 ...', 'path': 'CLEVR_trainA_000000.png'}],
'images': ['image_path1', 'image_path2'],
'messages': [{'role': 'user', 'content': 'How many items are there in the image? Output the thinking process in <think> </think> and\n final answer (number) in <answer> </answer> tags.'}],
'solution': '<answer> 3 </answer>'
}
Expand Down
8 changes: 6 additions & 2 deletions docs/source/Instruction/GRPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,11 @@ pip install math_verify # reward function
pip install -U trl
```

**注意**:训练过程中 loss 接近0 是正常情况, 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
**FAQ**
1. 训练过程中 loss 接近0 是正常情况, 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
2. 训练的steps怎么计算? 参考[issue](https://github.com/modelscope/ms-swift/issues/3912)
3. clip_ratio为什么总是1? 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)


## 集群支持

Expand Down Expand Up @@ -112,7 +116,7 @@ A conversation between User and Assistant. The user asks a question, and the Ass

## 参数与运行脚本
参数
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_eval_batch_size * nproc_per_node 整除
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_batch_size * nproc_per_node 整除
- max_completion_length: 采样生成的最大长度,默认为512
- ds3_gather_for_generation: 该参数适用于DeepSpeed ZeRO-3。如果启用,策略模型权重将被收集用于生成,从而提高生成速度。然而,禁用此选项允许训练超出单个GPU VRAM的模型,尽管生成速度会变慢。禁用此选项与vLLM生成不兼容。默认为True
- reward_funcs: 奖励函数,根据模型生成结果进行打分,内置accuracy、format、cosine和repetition四个rule-based函数,详细见 swift/plugin/orm.py 文件
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/BestPractices/GRPO-Multi-Modal-Training.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ The purpose of redefining the dataset preprocessor here is to modify the query.

```json
{
'images': [{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xe0\x00\x00\x01@\x08\x06\x00\x00\x00d\xc8\xafB`\x82 ...', 'path': 'CLEVR_trainA_000000.png'}],
'images': ['image_path1', 'image_path2'],
'messages': [{'role': 'user', 'content': 'How many items are there in the image? Output the thinking process in <think> </think> and\n final answer (number) in <answer> </answer> tags.'}],
'solution': '<answer> 3 </answer>'
}
Expand Down
9 changes: 7 additions & 2 deletions docs/source_en/Instruction/GRPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@ pip install math_verify # reward function
pip install -U trl
```

**Note**: It is normal for the loss to approach zero during training. Refer to this [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851) for more details.
**FAQ**
1. It is normal for the loss to approach zero during training. Refer to this [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851) for more details.
2. How to calculate the training steps? Refer to this [issue](https://github.com/modelscope/ms-swift/issues/3912) for more details.
3. Why is the clip_ratio always 1? Refer to this [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851) for more details.



## Cluster Support

Expand Down Expand Up @@ -115,7 +120,7 @@ In addition to rule-based reward functions, this framework also supports using r
## Arguments and Execution Script
Arguments

- num_generations: The number of samples for each prompt, referred to as the G value in the paper, needs to be divisible by per_device_eval_batch_size * - nproc_per_node.
- num_generations: The number of samples for each prompt, referred to as the G value in the paper, needs to be divisible by per_device_batch_size * - nproc_per_node.
- max_completion_length: The maximum length for sampling generation, default is 512.
- ds3_gather_for_generation: This parameter applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation, improving generation speed. However, disabling this option allows training models that exceed the VRAM capacity of a single GPU, albeit at the cost of slower generation. Disabling this option is not compatible with vLLM generation. The default is True.
- reward_funcs: Reward functions to score the results generated by the model. Includes built-in accuracy, format , cosine and repetition rule-based functions, detailed in the swift/plugin/orm.py file.
Expand Down