fix async generate and doc

modelscope · hjh0119 · May 22, 2025 · Apr 7, 2025 · Apr 14, 2025 · Apr 14, 2025
commit 1fb25db0235c9c9a790e1d0ac94420930fb018bf
diff --git a/docs/source/Instruction/GRPO.md b/docs/source/Instruction/GRPO.md
@@ -11,7 +11,7 @@ pip install -U trl
 ```
 
 **更新日志**
-
+- **2025-05-13** — Internal部分重构，支持vLLM>=0.8
 - **2025-05-11** — 支持生成式奖励模型，通过 reward_model_plugin 自定义奖励模型逻辑。有关更多详细信息，请参阅[自定义奖励模型](#自定义奖励模型)部分。
 - **2025-04-30** — external vllm server 的启动命令改为 `swift rollout`
 
@@ -27,38 +27,62 @@ pip install -U trl
 
 GRPO 训练框架支持集成高性能推理引擎（如 vLLM）来加速采样过程，提供以下两种部署模式：
 
-### 1. 内部集成模式 (Internal)
+### 1. Colocate Mode
+
+- 训练与推理共享GPU资源，在 Trainer 内部启动推理服务，
 
-- 在Trainer内部直接启动推理服务
-- 提供两种资源分配策略：
-  - **协同模式 (Colocate)**: 训练与推理共享GPU资源
-  - **异步模式 (Async)**: 训练与推理使用独立GPU资源
+启动参数
+```bash
+--vllm_mode colocate
+```
 
-### GRPO训练资源配置方案
-| 配置场景                 | NPROC_PER_NODE | num_infer_workers | 资源分配说明             |
-|--------------------------|----------------|------------------|------------------------|
-| **Colocate**   | =总GPU数      | =总GPU数          | 训练和推理共享全部GPU资源              |
-| **Async**      | =训练卡数      | =推理卡数         | 必须满足：训练卡数 + 推理卡数 = 总GPU数 |
+#### Colocate 模式下的显存优化方案
+在 Colocate 模式下运行时，容易出现显存不足（OOM）的情况。以下是几种有效的显存优化方法和参数配置：
 
-**注：**
-1. 在Colocate模式下推荐设置`sleep_level=1`, 在模型训练时释放vLLM占用显存
-2. 总GPU数指可见的GPU设备总数
+1. 在训练阶段，释放 vLLM 占用的显存：
 
-### 2. 外部服务模式 (External)
-连接外部的 vLLM 推理服务器
-使用时，使用以下参数配置外部 vLLM 服务器
 ```bash
---vllm_server_host <服务器IP> \
---vllm_server_port <服务端口> \
---vllm_server_timeout <超时时间> \
+--sleep_level 1
+```
+
+2. 在vLLM 推理阶段，释放训练模型和优化器占用的显存：
+
+```bash
+--offload_optimizer true \
+--offload_model true \
+--gc_collect_after_offload true \
+```
+
+3. 在vLLM中使用 Tensor Parallel 技术：
+
+```bash
+--tensor_parallel_size [tp_size]
+```
+
+4. 分批 Gather 模型权重（zero3下同步 vLLM 权重时）：
+```bash
+--move_model_batches [批次数量]
 ```
+
+### 2. Async Mode
+
+- 训练与推理资源分离，在外面启动单独的推理服务器
+
 使用`swift rollout`命令部署vLLM 服务器, 现仅支持vLLM backend
 ```bash
 CUDA_VISIBLE_DEVICES=2 \
 swift rollout \
   --model Qwen/Qwen2.5-VL-7B-Instruct \
   --tensor_parallel_size 2 \
 ```
+
+训练使用以下参数配置外部 vLLM 服务器
+```bash
+--vllm_server_host <服务器IP> \
+--vllm_server_port <服务端口> \
+--vllm_server_timeout <超时时间> \
+```
+
 完整脚本可以参考[这里](../../../examples/train/grpo/multi_node/Qwen2_5_32B_full.sh)
 
 

diff --git a/docs/source_en/Instruction/GRPO.md b/docs/source_en/Instruction/GRPO.md
@@ -29,42 +29,67 @@ pip install -U trl
 
 The GRPO training framework supports the integration of high-performance inference engines (such as vLLM) to accelerate the sampling process, offering the following two deployment modes:
 
-### 1. Internal Integration Mode
+### 1. Colocate Mode
 
-- Launch the inference service directly within the Trainer.
-- Provides two resource allocation strategies:
-  - **Colocate Mode**: Training and inference share GPU resources.
-  - **Async Mode**: Training and inference use separate GPU resources.
+Training and inference share GPU resources; the inference service is started internally within the Trainer.
 
-### GRPO Training Resource Allocation Scheme
+Launch Parameters
+```bash
+--vllm_mode colocate
+```
 
-| Configuration Scenario  | NPROC_PER_NODE | num_infer_workers | Resource Allocation Description       |
-|-------------------------|----------------|-------------------|---------------------------------------|
-| **Colocate**            | = Total GPUs   | = Total GPUs      | Training and inference share all GPU resources. |
-| **Async**               | = Training GPUs| = Inference GPUs  | Must satisfy: Training GPUs + Inference GPUs = Total GPUs. |
+#### Memory Optimization Strategies in Colocate Mode
+When running in Colocate Mode , out-of-memory (OOM) errors are common due to simultaneous training and inference workloads. Below are effective memory optimization strategies and configuration parameters:
 
-**Note:**
-1. In Colocate mode, it is recommended to set `sleep_level=1` to release the GPU memory occupied by vLLM during model training.
-2. Total GPUs refers to the total number of visible GPU devices.
+1. Release vLLM memory during training:
 
-### 2. External Service Mode
+```bash
+--sleep_level 1
+```
 
-Connect to an external vLLM inference server.
-When using this mode, configure the external vLLM server with the following parameters:
+2. Offload training model and optimizer memory during vLLM inference:
 
 ```bash
---vllm_server_host <Server IP> \
---vllm_server_port <Server Port> \
---vllm_server_timeout <Timeout> \
+--offload_optimizer true \
+--offload_model true \
+--gc_collect_after_offload true \
 ```
 
-Deploy the vLLM server using the `swift rollout` command. Currently, only the vLLM backend is supported.
+3. Use Tensor Parallelism in vLLM:
+
+```bash
+--tensor_parallel_size [tp_size]
+```
+
+4. Batched gathering of model weights (when synchronizing vLLM weights under ZeRO-3):
+
+```bash
+--move_model_batches [number_of_batches]
+```
+
+
+### 2. Async Mode
+
+Training and inference use separate resources; a dedicated inference server is launched externally.
+
+Deploy the vLLM server using the swift rollout command. Currently, only the vLLM backend is supported:
+
 ```bash
 CUDA_VISIBLE_DEVICES=2 \
 swift rollout \
   --model Qwen/Qwen2.5-VL-7B-Instruct \
   --tensor_parallel_size 2 \
 ```
+
+Use the following parameters in training to connect to an external vLLM server:
+
+```bash
+--vllm_mode server \
+--vllm_server_host <Server IP> \
+--vllm_server_port <Server Port> \
+--vllm_server_timeout <Timeout> \
+```
+
 The complete script can be found [here](../../../examples/train/grpo/multi_node/Qwen2_5_32B_full.sh) .
 
 ## Reward Functions

diff --git a/swift/trainers/rlhf_trainer/grpo_trainer.py b/swift/trainers/rlhf_trainer/grpo_trainer.py
@@ -559,7 +559,7 @@ def _infer(self, inputs: InputsType, request_config: RequestConfig, is_global_in
         # keys from InferRequest
         per_device_size = len(inputs)
         if is_global_inputs:
-            per_device_size /= self.accelerator.num_processes
+            per_device_size //= self.accelerator.num_processes
         infer_inputs = [{
             k: v
             for k, v in inp.items() if k in ['messages', 'images', 'audios', 'videos', 'tools', 'objects']
@@ -722,7 +722,7 @@ def infer_task():
         def done(future):
             try:
                 result = future.result()
-                current_queue.put(DataCache(inputs, result))
+                current_queue.put(DataCache(all_inputs, result))
             except Exception as e:
                 logger.error('Error in async_infer callback: %s', str(e))