add-minuerU

sangchengmeng · sangchengmeng · commit f523ebfa1098 · 2025-11-25T07:02:25.000Z
diff --git a/_posts/2025-11-11-minerU.md b/_posts/2025-11-11-minerU.md
@@ -1,77 +1,72 @@
 ---
-title: Accelerating MinuerU Multimodal Inference with LightLLM
+title: 使用 LightLLM 加速 MinuerU 多模态推理
 tags:
 - MTC Team
 - New Feature
 excerpt: |
-  LightLLM now provides optimized support for the MinuerU multimodal model: we reduce RPyC communication overhead, speed up image preprocessing, and refine ViT batching and downstream dispatch to significantly improve end-to-end performance.
+  LightLLM 现已对 MinuerU 多模态模型提供优化支持：我们减少了 RPyC 通信开销，加速图像预处理，并优化 ViT 批处理与下游调度逻辑，从而显著提升端到端性能。
 ---
 
-In LightLLM, multimodal inference consists of two stages: first, input images are preprocessed and fed into the vision encoder to obtain image embeddings; then these embeddings are concatenated with text embeddings and passed into the LLM for generation.
+在 LightLLM 中，多模态推理主要分为两个阶段：首先对输入图像进行预处理并送入视觉编码器以获得图像嵌入；然后将这些图像嵌入与文本嵌入拼接，送入 LLM 进行生成。
 
-While integrating MinuerU, we optimized the communication layer, refactored ViT batching and dispatch logic, and streamlined the image preprocessing pipeline. These changes yield clear performance gains across different resolutions and hardware setups.
+在集成 MinuerU 的过程中，我们优化了通信层，重构了 ViT 的批处理与调度逻辑，并梳理了图像预处理流水线。这些改动在不同分辨率和硬件配置下都带来了明显的性能收益。
 
-## MinuerU Multimodal Inference Flow in LightLLM
+## LightLLM 中的 MinuerU 多模态推理流程
 
-1. **Image preprocessing** (resize, normalization, and other operations based on the visual spec).
-2. Use **RPyC to call the remote ViT** and generate image embeddings.
-3. **Embedding fusion**: concatenate image embeddings with text embeddings.
-4. **LLM decoding**: feed the combined sequence into the LLM for generation.
+1. **图像预处理**（根据视觉规范进行缩放、归一化以及其他操作）。
+2. 使用 **RPyC 调用远端 ViT**，生成图像嵌入。
+3. **嵌入融合**：将图像嵌入与文本嵌入拼接。
+4. **LLM 解码**：将组合后的序列输入 LLM 进行生成。
 
-During integration, we noticed that the TCP behavior of RPyC, along with the tight coupling between ViT batch size and downstream dispatch, was a major source of latency—especially for small images and high-concurrency workloads.
+在集成过程中，我们注意到 RPyC 的 TCP 行为，以及 ViT 批大小与下游调度之间的强耦合，是一个主要的延迟来源——在小图像和高并发场景下尤为明显。
 
-## Reducing RPyC Overhead with `TCP_NODELAY`
+## 使用 `TCP_NODELAY` 降低 RPyC 开销
 
-Because TCP enables Nagle’s algorithm by default, small packets can be coalesced, introducing extra wait time for certain RPyC calls. We explicitly enable **`TCP_NODELAY`** on the RPyC connection to avoid this.
+在推理过程中，我们注意到， 默认的 RPyC 操作会带来20ms左右固定的延时，这是由于 TCP 默认启用 Nagle 算法，小包会被合并发送，这会给某些 RPyC 调用引入额外等待时间。
+为避免这一点，我们在 RPyC 连接上显式开启 **`TCP_NODELAY`**。
 
-On an H200, using **448×448** low-resolution images in batch inference, we observed that enabling `TCP_NODELAY` **cut latency roughly in half**.
+开启 **`TCP_NODELAY`**之后， 在 H200 上，使用 **448×448** 的低分辨率图像做批量推理时，我们观察到启用 `TCP_NODELAY` 后，在一个固定的测试集上（大约1000张448×448分辨率的图片），推理QPS从之前的 30 req/s提升至 60 req/s，吞吐翻倍，**首字延迟大大减少**。
 
-Performance comparison:
+在大量小请求或对延迟有严格 SLA 的场景下，这一改动尤为受益。
 
-| Metric | LightLLM-before | LightLLM-opti |
-|:--|:--:|:--:|
-| QPS (req/s) | 30.20 | 63.03 |
-| Prefill P50 (s) | 3.70 | 2.80 |
-| Decode P50 (ms) | 30.0 | 28.0 |
-
-This change is especially beneficial in scenarios with many small requests or strict latency SLAs.
+## 优化 ViT 批处理与调度行为
 
-## Optimizing ViT Batching and Dispatch Behavior
+此前，ViT 的批大小完全由参数 `visual_infer_batch_size` 决定：vit会按照 `visual_infer_batch_size` 的大小进行批次推理，推理之后会积累 `visual_infer_batch_size` 张图像的embeding，一旦达到阈值就触发 `infer_imgs`，并**立刻**将相关请求向下游派发。在显存较小的GPU（如4090D）上进行推理时， `visual_infer_batch_size` 只能设置为1, 设置较大值将导致OOM（Out of Memory），这导致：
 
-Previously, the ViT batch size was fully determined by `infer_batch_size`: the pipeline accumulated images in `images_need_infer`, triggered `infer_imgs` once the threshold was reached, and **immediately** dispatched the associated requests downstream. This led to:
+- 每次 ViT 推理后都会产生小粒度的 `send_pyobj` 调用，增加开销。
+- visual进程每次以bs=1将图片embedding传递给的llm端，llm端prefill固定为1并不能完全发挥GPU的性能
 
-- **Strong coupling** between **ViT batch size** and **downstream dispatch pattern**.
-- A burst of small `send_pyobj` calls after each ViT inference, amplifying RPyC overhead.
+我们重构了主循环逻辑，将 ViT 批处理与embeding的派发 **解耦**。这样可以：
 
-We refactored the main loop logic to **decouple ViT batching from request dispatch**. This allows us to:
+- **减少小粒度 RPyC 消息的数量**，摊薄 `send_pyobj` 开销。
+- 用 `visual_infer_batch_size` 保持 ViT 的高利用率，同时通过 `visual_send_batch_size` 让下游 LLM prefill 更容易形成高效批次。
+- 在高并发场景下降低 **端到端延迟抖动**。
 
-- **Reduce the number of small-granularity RPyC messages**, amortizing `send_pyobj` overhead.
-- Use `infer_batch_size` to keep ViT highly utilized, while `send_batch_size` makes it easier for downstream LLM prefill to form efficient batches.
-- Lower **end-to-end latency jitter** under high concurrency.
+## 加速图像预处理
 
-## Speeding Up Image Preprocessing
+同时我们发现当图片像素过大时（比如4K，8K图片）， **图像预处理** 的时间开销对整个多模态推理性能的影响不容忽视。对此，我们精简了 **图像预处理** 中的一些步骤。
+在transformers中，qwen2-vl系列模型的图片预处理操作是在CPU端进行的，我们对比发现resize等一系列插值操作在GPU上的耗时要明显高于CPU上的耗时。对于一张4K图片来说，resize操作在CPU上耗时在20ms左右，而在GPU上仅需3ms。对此，我们将一些操作移至GPU去做，大大减少了**图像预处理** 的开销。
 
-We also streamlined several steps in **image preprocessing**, reducing CPU overhead before the vision encoder. This further improves end-to-end latency, particularly in high-concurrency benchmarks where preprocessing becomes a bottleneck.
+## Flash-Attn算子的选择
 
-## Performance Evaluation
+众所周知，在大模型的推理中，Attention算子是一个耗时较高的算子，许多厂商都对Flash-Attention算子进行了实现。在H系列卡上，毫无疑问性能最佳的是Tri Dao写的flash-attention3算子，但在4090卡上各个厂商的FA算子所使用的略有不同。对此，我们在4090上对比了常见的几种开源仓库中的Flash-Attention的性能。性能如下：
+[SHAPE] B=1 H=16 L=7956 D=80 dtype=torch.bfloat16 （以MinuerU2.5模型为例）
+sgl_kernel:2.711ms 
+xFormers:2.791ms 
+torch.sdpa:2.906ms
 
-We benchmarked LightLLM (with MinuerU integrated) against vLLM on different hardware, using the same model and comparable configurations.
+由此，我们将flash-attenton的实现替换为sgl_kernel中的实现（sgl_kernel在4090D上的FA算子为Flashinfer上的Flash-Attention）。
 
-### H200, 1080×1080 Resolution, 10 Concurrency
+## 性能评估
 
-| Metric | vLLM | LightLLM |
-|:--|:--:|:--:|
-| QPS (req/s) | 7.29 | 8.38 |
-| Prefill P50 (s) | 1.11 | 1.06 |
-| Decode P50 (ms) | 5.4 | 6.4 |
+在4090D硬件上，我们将适配了 MinuerU 模型的推理框架 LightLLM 与 vLLM 进行对比测试，使用相同模型和可比的配置。
 
-### RTX 4090D, 1080×1080 Resolution, 10 Concurrency
+### RTX 4090D，10 并发，含1000张图片的测试集
 
-| Metric | vLLM | LightLLM |
+| 指标 | vLLM | LightLLM |
 |:--|:--:|:--:|
-| QPS (req/s) | 1.40 | 1.61 |
-| Prefill P50 (s) | 1.14 | 2.26 |
-| Prefill P100 (s) | 5.65 | 3.23 |
-| Decode P50 (ms) | 5.88 | 3.79 |
+| QPS (req/s) | 1.40 | 1.56 |
+| Prefill P50 (ms) | 1140 | 640 |
+| Decode P50 (ms) | 5.88 | 5.80 |
 
-Overall, the results show that **MinuerU on LightLLM achieves slightly higher QPS than on vLLM** under comparable settings, while the optimized communication, batching, and preprocessing strategies help stabilize and improve end-to-end performance.
+总体来看，在可比设置下，**运行在 LightLLM 上的 MinuerU QPS 略高于 vLLM**，同时经过优化的通信、批处理和预处理策略帮助稳定并提升了端到端性能。