Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1180 commits
Select commit Hold shift + click to select a range
fc46e94
Remove duplicate entry in vllm.attention.__all__ (#23296)
russellb Aug 21, 2025
9474927
[CI Bugfix] Fix CI by fully removing --enable-prompt-adapter (#23284)
mgoin Aug 21, 2025
794baf8
[Optimization] Make new_block_ids None if empty (#23262)
WoosukKwon Aug 21, 2025
d527ab9
[CPU] Refactor CPU W8A8 scaled_mm (#23071)
bigPYJ1151 Aug 21, 2025
d499194
[CI/Build] Split out mm processor tests (#23260)
DarkLight1337 Aug 21, 2025
5524da1
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support (#23035)
Josephasafg Aug 21, 2025
42cca78
[Compile] Fix Compile Warning SM100 Cutlass MLA (#23287)
yewentao256 Aug 21, 2025
eeb71e5
[Model][VLM] Support R-4B Model (#23246)
yannqi Aug 21, 2025
d5d73c4
[CI] Delete images older than 24h. (#23291)
QiliangCui Aug 21, 2025
a4a7e15
[CI] Block the cu126 wheel build while broken (#23285)
mgoin Aug 21, 2025
32273ef
[Sampler] Support returning final logprobs (#22387)
22quinn Aug 21, 2025
d7277cd
[Bugfix] Fix extra whitespace in strings caused by newline (#23272)
DarkLight1337 Aug 21, 2025
3b04ade
[BugFix] Fix Python 3.9 Support (#23306)
jaredoconnell Aug 21, 2025
26105b5
[Model] Add LFM2 architecture (#22845)
paulpak58 Aug 21, 2025
b1115ff
[Refactor] Simplify code for MM budget (#23310)
DarkLight1337 Aug 21, 2025
19d3e3c
[Doc] Fix batch-level DP example (#23325)
DarkLight1337 Aug 21, 2025
a0da309
[Performance] V1 Pooling Models E2E Performance Optimization (#23162)
noooop Aug 21, 2025
44b7c6f
[V1] Remove unnecessary check for main thread (#23298)
robertgshaw2-redhat Aug 21, 2025
f7f3296
[Bugfix] set system_message in phi4mini chat template (#23309)
zhuangqh Aug 21, 2025
f9034f3
[Multimodal] Always enable hashing mm data (#23308)
ywang96 Aug 21, 2025
8483595
[ci/build] Fix abi tag for aarch64 (#23329)
youkaichao Aug 21, 2025
aee3ad2
Migrate MolmoImageInputs to TensorSchema (#22022)
bbeckca Aug 21, 2025
d14b0d3
Fix nvfp4 swizzling (#23140)
yiliu30 Aug 21, 2025
c2af529
add tg-mxfp4-moe-test (#22540)
IwakuraRein Aug 21, 2025
6c11e51
[Bug] Fix R1 Accuracy 0 Bug (#23294)
yewentao256 Aug 21, 2025
07413f6
[Bugfix] Fix port conflict by obtaining a list of open ports upfront …
minosfuture Aug 21, 2025
9d95b6f
[Misc] Misc code cleanup/simplification (#23304)
njhill Aug 21, 2025
83b5808
[BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message (#…
heheda12345 Aug 21, 2025
545bbfd
[Misc] Convert VLLM_TORCH_PROFILER_DIR path to absolute (#23191)
andyxning Aug 21, 2025
57f3eca
[Core] Always use tensor cores for Flashinfer Decode Wrapper (#23214)
pavanimajety Aug 21, 2025
fdef2a4
Make sure that vectorize_with_alignment produced vectorized global lo…
elvircrn Aug 21, 2025
45a5195
[Structured Outputs] Refactor bitmask construction into get_grammar_b…
WoosukKwon Aug 21, 2025
4d07ebf
[CI] Clean up actions: remove helm, publish workflows and improve pr …
simon-mo Aug 21, 2025
b0f6cfb
[CI] improve pr comments bot (#23380)
simon-mo Aug 21, 2025
da9caae
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm (#23265)
mgoin Aug 21, 2025
9d48e2c
Always use cache mounts when installing vllm to avoid populating pip …
tvalentyn Aug 21, 2025
6b4e2a0
[Feature][Responses API] Support logprobs(non-stream) (#23319)
kebe7jun Aug 21, 2025
9eb9476
[Core] Support custom executor qualname (#23314)
22quinn Aug 22, 2025
5420d20
[Kernel] Add FP8 support with FlashMLA backend (#22668)
MatthewBonanni Aug 22, 2025
a4aa575
[Deprecation] Remove `prompt_token_ids` arg fallback in `LLM.generate…
DarkLight1337 Aug 22, 2025
c02e054
Migrate MllamaImagePixelInputs to TensorSchema (#22020)
bbeckca Aug 22, 2025
926fcf7
[CI/Build] Skip Idefics3 and SmolVLM generation test again (#23356)
Isotr0py Aug 22, 2025
f7e5140
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improve…
yewentao256 Aug 22, 2025
b3a9c50
[CI] Add end-to-end V1 min_tokens test coverage (#22495)
arjunbreddy22 Aug 22, 2025
acf11b8
[Misc] Add gemma3 chat template with pythonic-style function calling …
philipchung Aug 22, 2025
25233c7
[New Model] Add Seed-Oss model (#23241)
FoolPlayer Aug 22, 2025
3d717bc
[Attention] Refactor AttentionMetadata Preparation for Encoder-only M…
heheda12345 Aug 22, 2025
8b25ae8
[P/D][Nixl] Make kv cache register compatible with hybrid memory allo…
sfeng33 Aug 22, 2025
adf1313
[gpt-oss] add input/output usage in responses api when harmony contex…
gcalmettes Aug 22, 2025
5c83480
Migrate MiniCPMOAudioInputs to TensorSchema (#21847)
bbeckca Aug 22, 2025
1c30ae0
[Bugfix] Fix pooling models on CPU backend (#23392)
bigPYJ1151 Aug 22, 2025
92c5302
[V0 Deprecation] Remove V0 LoRA test (#23418)
jeejeelee Aug 22, 2025
5ff7880
[Misc] Move M-RoPE init logic to _init_mrope_positions (#23422)
WoosukKwon Aug 22, 2025
18bbc98
[Attention] Allow V1 flash_attn to support cross-attention (#23297)
russellb Aug 22, 2025
8aa599a
[misc] Remove outdate comment about runai_model_streamer (#23421)
carlory Aug 22, 2025
c5d0b5e
[Doc] Update the doc for log probs + prefix caching (#23399)
heheda12345 Aug 22, 2025
7814c6d
[Misc] local import code clean (#23420)
andyxning Aug 22, 2025
f4a55f5
[Bug fix] Dynamically setting the backend variable for genai_perf_tes…
namanlalitnyu Aug 22, 2025
4f8a714
[Fix] Bump triton version in rocm-build requirements (#21630)
bringlein Aug 22, 2025
0cfaec0
[Bugfix]: Installing dev environment due to pydantic incompatible ver…
hickeyma Aug 22, 2025
555e30c
[Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support (#23337)
PapaGoose Aug 22, 2025
497e0cb
[BugFix] Fix the issue where image embeddings were incorrectly split.…
bppps Aug 22, 2025
50bf85d
fix(tests): Ensure reliable CUDA cache clearing in MoE test (#23416)
AzizCode92 Aug 22, 2025
2fdc400
Add unit tests for batched guided and non-guided requests (#23389)
sarckk Aug 22, 2025
67073c0
[Doc]: fix various typos in multiple files (#23179)
didier-durand Aug 22, 2025
ff25bfb
[Model] Add Ovis2.5 PP support (#23405)
Isotr0py Aug 22, 2025
3d6894e
[Bugfix] Fix broken Florence-2 model (#23426)
Isotr0py Aug 22, 2025
22babad
[Quantization] Allow GGUF quantization to skip unquantized layer (#23…
Isotr0py Aug 22, 2025
8aa9d69
add an env var for path to pre-downloaded flashinfer cubin files (#22…
842974287 Aug 22, 2025
a42b855
[CI/Build] add EP dependencies to docker (#21976)
zhewenl Aug 22, 2025
143bc29
[PERF] PyTorch Symmetric Memory All-Reduce (#20759)
ilmarkov Aug 22, 2025
103a825
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ n…
rasmith Aug 22, 2025
e910038
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out …
elvischenv Aug 22, 2025
f3f15db
[BugFix] Fix batch updates for pooling models (#23398)
njhill Aug 23, 2025
ddf0d07
[BugFix] Fix `MinPLogitsProcessor.update_states()` (#23401)
njhill Aug 23, 2025
382bde7
[Model] Support DP for ViT on MiniCPM-V-4 (#23327)
david6666666 Aug 23, 2025
e134bb8
[UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh (#…
mgoin Aug 23, 2025
578b3d4
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs (#…
fengli1702 Aug 23, 2025
c30b187
Add glm4.5v tp2,4 fp8 config on H100_80GB (#23443)
chenxi-yang Aug 23, 2025
fbaa487
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless l…
DarkLight1337 Aug 23, 2025
7e09574
fix(tests): Correct unreachable assertion in truncation test (#23425)
AzizCode92 Aug 23, 2025
f2a40d7
Support DeepSeek-V3.1 tool call (#23454)
Xu-Wenqing Aug 23, 2025
465a8d6
[Misc] Modify CacheConfig import (#23459)
jeejeelee Aug 23, 2025
67bd525
[gpt-oss] Streaming Output for Python Tool (#23409)
ZJY0516 Aug 24, 2025
ad9ffde
Migrate Pixtral inputs to TensorSchema (#23472)
bbeckca Aug 24, 2025
42d3a4d
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks (…
22quinn Aug 24, 2025
4a88e15
Migrate Paligemma inputs to TensorSchema (#23470)
bbeckca Aug 24, 2025
4384249
[kernel] Support W4A8 on Hopper (#23198)
czhu-cohere Aug 24, 2025
f35ad06
[Misc] update dict parse to EPLBConfig from json dumps to dict unpack…
lengrongfu Aug 24, 2025
17c43e5
(Misc): add missing test for zero truncation size. (#23457)
teekenl Aug 24, 2025
aee6b56
[New Model]Donut model (#23229)
princepride Aug 24, 2025
194b7bc
[Model] Enable BLOOM on V1 (#23488)
DarkLight1337 Aug 24, 2025
086c121
[Misc] Remove unused slot_mapping buffer (#23502)
WoosukKwon Aug 24, 2025
d262c13
fix incompatibililty with non cuda platform for nvfp4 (#23478)
luccafong Aug 24, 2025
045c520
[Doc: ]fix various typos in multiple files (#23487)
didier-durand Aug 25, 2025
c737530
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 (#23504)
minosfuture Aug 25, 2025
66cb149
Frontend: Adding LM Format Enforcer support to V1 engine (#22564)
noamgat Aug 25, 2025
664007b
[Bugfix] Fix Qwen2.5-VL quantized model weights loading (#23512)
zifeitong Aug 25, 2025
12a82c6
[Misc] Unified linear print info (#23516)
jeejeelee Aug 25, 2025
8e534ae
Migrate tarsier inputs to TensorSchema (#23500)
bbeckca Aug 25, 2025
760cb67
Migrate skyworkr1v inputs to TensorSchema (#23499)
bbeckca Aug 25, 2025
f78c28c
Migrate DonutImagePixelInputs to TensorSchema (#23509)
bbeckca Aug 25, 2025
3e4eb1b
[Bugfix] Fix Dense module loading for sentence-transformers embedding…
FFFfff1FFFfff Aug 25, 2025
990dc37
[gpt-oss] use reasoning channel for reasoning text in serving_chat (#…
yuguo68 Aug 25, 2025
7b15eba
[Refactor] Dynamic `target` and `content` for prompt updates (#23411)
DarkLight1337 Aug 25, 2025
b8c4744
[Core][Multimodal] Track encode cache entries by mm_hash and enable e…
fake0fan Aug 25, 2025
928538a
[Fix] DeepSeek V3.1 tool parser error message (#23492)
skyloevil Aug 25, 2025
62996b0
Feature/benchmark/random mm data/images (#23119)
h-brenoskuk Aug 25, 2025
12c996a
[Bugfix] Allow dynamic number of patches for llava_onevision (#23525)
DarkLight1337 Aug 25, 2025
73ce938
[misc] add shanghai meetup (#23535)
youkaichao Aug 25, 2025
4d26868
[Attention] Unify mamba and attention backend selection (#23171)
ayushsatyam146 Aug 25, 2025
89fab31
[Doc] Add caution for API server scale-out (#23550)
DarkLight1337 Aug 25, 2025
e51dc07
[Refactor] Pass `tokenizer` explicitly instead of binding to prompt u…
DarkLight1337 Aug 25, 2025
457c2cf
Updates to Flex + VLLm integration (#21416)
drisspg Aug 25, 2025
01445c6
[Bugfix] Fix Qwen3 MoE GPTQ inference (#23490)
Isotr0py Aug 25, 2025
5f64f3c
[Refactor] Refactor persistent buffers with CpuGpuBuffer (#23515)
WoosukKwon Aug 25, 2025
9c5eae2
[test][RL] Add sleep level 2 test and fix reload with sleep mode (#23…
22quinn Aug 25, 2025
7b4fc41
[Kernel] Add fused grouped_topk kernel for MoE (#23274)
xyang16 Aug 25, 2025
74d4c65
[Bugfix][V1][P/D]Fix the issue where repeated requests for the same i…
Abatom Aug 25, 2025
df7f89f
[XPU] Delay BF16 check to worker init for spawn compatibility (#22979)
chaojun-zhang Aug 25, 2025
bd7980a
[TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. (#23574)
patemotter Aug 25, 2025
eab6f40
[Docs] Update Documentation of Cohere Command-A Models (#23584)
Terrencezzj Aug 25, 2025
7323ee8
[Misc] Simplify FlashInfer attention metadata (#23585)
WoosukKwon Aug 25, 2025
ed08f90
[Misc] Add release note draft to PR template (#23598)
simon-mo Aug 25, 2025
32599ef
[CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu…
mgoin Aug 26, 2025
a6b8ab6
Update Flashinfer to 0.2.14.post1 (#23537)
weireweire Aug 26, 2025
f3e336e
[Bug] Fix DeepGEMM Env Control (#23591)
yewentao256 Aug 26, 2025
f2d9520
[CI/Build] Use vLLM client's user agent to fetch images (#23561)
DarkLight1337 Aug 26, 2025
a668488
Remove graph_pool as member of VllmBackend and argument to CUDAGraphW…
Copilot Aug 26, 2025
8df8082
[Disagg][Perf] Use CUDA event sync instead of blocking `tolist` to av…
liuzijing2014 Aug 26, 2025
a4a6de6
[CI/Build] Fix typo in #23561 (#23616)
DarkLight1337 Aug 26, 2025
335ca74
[fix] fix seed-oss-parser (#23560)
FoolPlayer Aug 26, 2025
2b76c52
[mypy] Fix incorrect type hint for EAGLE3 support (#23617)
DarkLight1337 Aug 26, 2025
0db4cee
[Benchmarks] add benchmark for embedding models (#23000)
ZJY0516 Aug 26, 2025
949ed80
[Docs] Fix titles for multi-file examples that are rendered in the do…
hmellor Aug 26, 2025
69dafdb
Fix CLI parameter documentation inconsistency in pooling_models.md (#…
oneraghavan Aug 26, 2025
f150043
[Bugfix] Fix Qwen25VL packed_modules_mapping (#23604)
jeejeelee Aug 26, 2025
5ab4f17
[Bugfix] Fix scheduling when repeated images in one request (#23544)
ywang96 Aug 26, 2025
55229d5
[V1] Enable V1 for compute capability < 8.0 + FP32 (#23614)
DarkLight1337 Aug 26, 2025
43299f6
Fix nits from #20059 (#23548)
hmellor Aug 26, 2025
6a6c41a
Fix writing benchmark results with tuple keys (#23633)
huydhn Aug 26, 2025
3512a4f
[Perf] Remove duplicated NVFP4 blockscales to save memory (#23379)
mgoin Aug 26, 2025
b4ac27b
[Model] fix DeepSeek e_score_correction_bias dtype to fp32 (#23640)
jeejeelee Aug 26, 2025
bcf79b8
[Bugfix] Add missing enable_log_outputs parameter to init_app_state f…
lordmathis Aug 26, 2025
f664cd9
feat: add usage to TranscriptionResponse (text and json response_form…
gcalmettes Aug 26, 2025
f88c974
Support FlashAttention Backend for Hybrid SSM Models (#23299)
heheda12345 Aug 26, 2025
1fb1881
[Docs] Fix broken links to `docs/api/summary.md` (#23637)
hmellor Aug 26, 2025
3efebb1
[Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) (#…
OYE93 Aug 26, 2025
92af8eb
[Kernel] Added flashinfer fp8 per-tensor gemms (#22895)
nvjullin Aug 26, 2025
e28d1e1
[Doc]: fix various spelling issues in multiple files (#23636)
didier-durand Aug 26, 2025
89b76c3
[CPU] add cpu fused moe pytorch native implementation (#23146)
TianyuLi0 Aug 26, 2025
204558f
[ROCm] Starting to add AMD code reviewers for ROCm components (#23496)
hongxiayang Aug 26, 2025
1012240
[Docs] Reduce requirements for docs build (#23651)
hmellor Aug 26, 2025
1221095
[Bugfix] fix bf16 multimodal model hash (#23623)
yuekaizhang Aug 26, 2025
dad809c
[model] support qwen2audio embedding input (#23625)
yuekaizhang Aug 26, 2025
b90d384
[Misc] Add override for allreduce fusion thresholds (#23639)
nvjullin Aug 26, 2025
3c429f2
[CI] [Doc]: Add GH Action for auto labeling issues with `rocm` tag (#…
vllmellm Aug 26, 2025
9baf4c0
[Bugfix] Fix cuda event usage with CPU model runner (#23643)
bigPYJ1151 Aug 26, 2025
f99713a
[Docs] Fix warnings in `mkdocs build` (#23649)
Zerohertz Aug 26, 2025
55ced96
[Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for …
tdoublep Aug 26, 2025
bc50bb4
[v1] Add cross-attention KV cache support for encoder-decoder models …
russellb Aug 26, 2025
24c4a6b
[Bugfix] Fix incorrect original shape in hashing (#23672)
DarkLight1337 Aug 26, 2025
506786a
[Misc] Fix comments in `tests/kernels/quantization` (#23675)
ZJY0516 Aug 26, 2025
a120953
[Model] Enable video support for InternVL3.5 models (#23658)
Isotr0py Aug 26, 2025
981fde7
[doc] Hybrid KV Cache Manager design doc (#22688)
heheda12345 Aug 26, 2025
b9d80bf
Enhance the pre-notification policy (#23532)
sidhpurwala-huzaifa Aug 26, 2025
6a56bff
[Docs] Move quant supported hardware table to README (#23663)
hmellor Aug 26, 2025
e07bb41
[V1][P/D]P2pNcclConnector supports flashinfer (#23536)
Abatom Aug 26, 2025
c9bcc12
[V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in …
tdoublep Aug 26, 2025
4435856
[Compile] Fix Cmake Warning (#23689)
yewentao256 Aug 26, 2025
ed63d51
[Bugfix] UnboundLocalError when GptOss reasoning specified (#23054)
coval3nte Aug 27, 2025
b207f01
feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 (#23695)
zixuanzhang226 Aug 27, 2025
3ca36ca
[Feature][Responses API] Support MCP tool in background mode (#23494)
wuhang2014 Aug 27, 2025
496d0a6
fix pynccl reduce_scatter (#23648)
youzhedian Aug 27, 2025
8c32c4c
[quantization] use channel scales for w4a8 + misc fixes (#23570)
czhu-cohere Aug 27, 2025
97273a2
[gpt-oss] Enable unit test for response API harmony integration (#23533)
heheda12345 Aug 27, 2025
a989369
[Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 (#23678)
mgoin Aug 27, 2025
7d7d2a0
[Docs] Fix math rendering in docs (#23676)
hmellor Aug 27, 2025
0a86210
[Bugfix][gpt-oss] passing the cache config in gpt-oss (#23613)
frank-wei Aug 27, 2025
a8439dc
[Bugfix]: Qwen3 Coder Tool Parser (#23099)
ranpox Aug 27, 2025
7f416f7
[Core] Asynchronous h2d in merge_multimodal_embeddings via pinned mem…
huachenheli Aug 27, 2025
66b2f22
[Model] Add Ernie4.5 VL Model Support (#22514)
CSWYF3634076 Aug 27, 2025
fb4345f
[Frontend] Add --log-error-stack to print stack trace for error respo…
heheda12345 Aug 27, 2025
5da9022
[Frontend] Optimize beam search performance by limiting concurrency (…
heheda12345 Aug 27, 2025
072be89
[Quantization] Expand compressed-tensors MoE matching logic to suppor…
dsikka Aug 27, 2025
61be621
[XPU] Add xpu torch.compile support (#22609)
jikunshang Aug 27, 2025
f556f93
[CI/Build] Remove redundant LoRA model tests (#23706)
jeejeelee Aug 27, 2025
8b423af
[Bugfix] fix when config.yaml config value is list parse error (#23528)
lengrongfu Aug 27, 2025
600fa4a
[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)
DarkLight1337 Aug 27, 2025
f1b97c6
[XPU]fix cuda event used in XPU model runner (#23708)
jikunshang Aug 27, 2025
a60f9f4
[CI/Build] Remove redundant register in model init tests (#23715)
DarkLight1337 Aug 27, 2025
9c70d2e
[Docs] Fix an admonition important (#23726)
windsonsea Aug 27, 2025
4998de9
Optimize input preparation for FlashInfer [2/N] (#23174)
WoosukKwon Aug 27, 2025
fff48d0
[Misc] Move CpuGpuBuffer to vllm/v1/utils.py (#23728)
WoosukKwon Aug 27, 2025
841515a
[FlashInfer] Cache hyper params in metadata builder (#23732)
WoosukKwon Aug 27, 2025
61c1175
[CI/Build] Reduce LoRA layer test cases (#23721)
jeejeelee Aug 27, 2025
d532e23
[XPU] Fix OOM issue for data parallel with Ray backend (#22500)
faaany Aug 27, 2025
f18d0e1
[Docs] Fix a 1-2-3 list and style issues in tpu.md (#23729)
windsonsea Aug 27, 2025
7eb290f
[model] Support MiniCPM-V 4.5 (#23586)
tc-mb Aug 27, 2025
4d7326d
[Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled…
cndoit18 Aug 27, 2025
e8945ca
[Misc] Remove unnecessary `_send_reconfig_message()` in `core_client.…
njhill Aug 27, 2025
424f8b2
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-b…
tdoublep Aug 27, 2025
aaca477
[Model] Explicit `default_pooling_type` interface (#23736)
DarkLight1337 Aug 27, 2025
410ecb8
Add vLLM Korea Meetup in the README.md and meetups.md (#23746)
rebel-hongseok Aug 27, 2025
1bcb935
Fix pre-commit on main (#23747)
hmellor Aug 27, 2025
a3983d5
[Model] Interface to enable batch-level DP support (#23733)
DarkLight1337 Aug 27, 2025
8c1d986
Only run `get_attr_docs` if generating help text (#23723)
hmellor Aug 27, 2025
de1bb52
[Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt (#23666)
yewentao256 Aug 27, 2025
79ed3fd
[Model] Enable native HF format InternVL support (#23742)
Isotr0py Aug 27, 2025
ed51563
[Doc]: upgrade version of crate-ci tool for improved typo detection (…
didier-durand Aug 27, 2025
fa8a2fa
[LogitsProcs] Deduplicate built-in LP implementation logic (#23362)
njhill Aug 27, 2025
7be5f5b
[Docs] Remove in-tree Gaudi install instructions (#23628)
hmellor Aug 27, 2025
afa3e06
[BugFix] Fix topk_softmax assert (#19764)
ProExpertProg Aug 27, 2025
8323233
[Model] Merge `SupportsMultiModalWithRawInput` with `SupportsMultiMod…
DarkLight1337 Aug 27, 2025
91f6600
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Tex…
tdoublep Aug 27, 2025
fbe97db
[Docs] Fix warnings in `mkdocs build` (continued) (#23743)
Zerohertz Aug 27, 2025
9ca3b5f
ci: Add arm64 docker build to release pipeline (#23210)
seemethere Aug 27, 2025
9bc6170
Disable `torch.compile` for dynamic rope models in Transformers backe…
hmellor Aug 27, 2025
f2f783d
[Multimodal] Generate mm_hash based on request metadata when caching …
ywang96 Aug 27, 2025
e715ac4
[V1][Mamba] - Enable V1 by default for Mamba Models (#23650)
Josephasafg Aug 27, 2025
36ad5a4
DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 (#23608)
zyongye Aug 27, 2025
f0c4bb7
[Bugfix] Fix Marlin NVFP4 for modelopt (#23659)
mgoin Aug 27, 2025
873816b
[Feature] Add `VLLM_DISABLE_PAD_FOR_CUDAGRAPH` to Avoid Hang Issue (#…
yewentao256 Aug 27, 2025
4fcb57b
[Bugfix] Fix for V1 priority scheduling crashes at preemption (#23713)
Hanchenli Aug 28, 2025
f01a7ad
Migrate Qwen inputs to TensorSchema (#23473)
bbeckca Aug 28, 2025
5b9c41f
[Feature] models: pass layer prefix to replace_linear_class for per-l…
Shrey1306 Aug 28, 2025
df7d16e
[Perf] Tune configs for triton block fp8 gemm H100/H200 (#23748)
mgoin Aug 28, 2025
96a5321
Gracefully handle edge cases in harmony utils (#23155)
Ithanil Aug 28, 2025
b66e297
[CI] make all multi-gpu weight loading tests run nightly (#23792)
killershrimp Aug 28, 2025
5bbc0ae
Add deprecation warning for lora_extra_vocab_size (#23635)
ahengljh Aug 28, 2025
dc5cc66
[Transform] [Quantization] Add transforms to compressed tensors (#22486)
kylesayrs Aug 28, 2025
ba3a5f4
[CI] enable idefics3 and fuyu-8b test in multimodal test (#23790)
ZJY0516 Aug 28, 2025
571a5a3
[Bugfix] when set offline model running error (#23711)
lengrongfu Aug 28, 2025
eb7a14c
[Kernel] cuda kernels for upcoming decode context parallel feature (#…
youzhedian Aug 28, 2025
907bd2a
[New Model]: Support GteNewModelForSequenceClassification (#23524)
noooop Aug 28, 2025
6f857c2
[Model] Add PP support and VLM backbone compatability for GPT-OSS (#2…
Isotr0py Aug 28, 2025
728904b
[FIXBUG] Add return_success parameter to moe_wna16_weight_loader func…
JartX Aug 28, 2025
5f70938
[Doc]: fix typos in .md files (including those of #23751) (#23825)
didier-durand Aug 28, 2025
549f9c8
[CI/Build][Bugfix] Fix Qwen VL tests on CPU (#23818)
bigPYJ1151 Aug 28, 2025
77bbe2d
[BugFix][Spec Decode] Use float64 for uniform_probs (#23803)
WoosukKwon Aug 28, 2025
c322eb7
[Model] [gpt-oss] fix gpt-oss pp support (#23815)
ZJY0516 Aug 28, 2025
61ea8df
[Doc]: fix typos in Python scripts (#23828)
didier-durand Aug 28, 2025
bb3f0f7
[Bugfix] Fix benchmark_moe.py for blockwise fp8. (#23823)
crischeng Aug 28, 2025
dd59828
[CI] Fix linting error on main (#23835)
tdoublep Aug 28, 2025
bd818fa
[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen…
nvpohanh Aug 28, 2025
8ce890f
[Bugfix] Add fake mode around passes (#23349)
angelayi Aug 28, 2025
3655d88
[ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime (…
jeanschmidt Aug 28, 2025
1aab3a8
Add scale_config.yml file for Meta autoscalers for GH Actions (#23840)
jeanschmidt Aug 28, 2025
a5d43ce
Migrate Llama4ImagePatchInputs to TensorSchema (#22021)
bbeckca Aug 28, 2025
52f5a72
update bc linter
zhewenl Aug 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
23 changes: 21 additions & 2 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
<a href="../{x86_wheel_html_escaped}">{x86_wheel}</a><br/>
<a href="../{arm_wheel_html_escaped}">{arm_wheel}</a><br/>
</body>
</html>
"""
Expand All @@ -21,7 +22,25 @@

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# sync the abi tag with .buildkite/scripts/upload-wheels.sh
if "x86_64" in filename:
x86_wheel = filename
arm_wheel = filename.replace("x86_64", "aarch64").replace(
"manylinux1", "manylinux2014"
)
elif "aarch64" in filename:
x86_wheel = filename.replace("aarch64", "x86_64").replace(
"manylinux2014", "manylinux1"
)
arm_wheel = filename
else:
raise ValueError(f"Unsupported wheel: {filename}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename, wheel_html_escaped=filename.replace("+", "%2B"))
template.format(
x86_wheel=x86_wheel,
x86_wheel_html_escaped=x86_wheel.replace("+", "%2B"),
arm_wheel=arm_wheel,
arm_wheel_html_escaped=arm_wheel.replace("+", "%2B"),
)
)
12 changes: 0 additions & 12 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml

This file was deleted.

1 change: 0 additions & 1 deletion .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,3 @@ Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
Expand Down
54 changes: 25 additions & 29 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This directory contains two sets of benchmark for vllm.
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.

See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.

## Performance benchmark quick overview

Expand All @@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
## Trigger the benchmark

Performance benchmark will be triggered when:

- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.

Expand All @@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
```

Runtime environment variables:

- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
Expand All @@ -46,12 +48,14 @@ Runtime environment variables:
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.

Nightly benchmark will be triggered when:

- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.

## Performance benchmark details

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
>
### Latency test

Here is an example of one test inside `latency-tests.json`:
Expand All @@ -74,21 +78,21 @@ Here is an example of one test inside `latency-tests.json`:
In this example:

- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.

### Throughput test

The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

### Serving test

We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```json
[
Expand All @@ -100,7 +104,6 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
Expand All @@ -118,8 +121,8 @@ Inside this example:

- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`

The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

Expand All @@ -135,27 +138,20 @@ The raw benchmarking results (in the format of json files) are in the `Artifacts

The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.

Here is an example using the script to compare result_a and result_b without detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`
Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output length, max concurrency and qps.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`

| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|----------------------------------------|----------------------------------------|----------|
| 0 | 142.633982 | 156.526018 | 1.097396 |
| 1 | 241.620334 | 294.018783 | 1.216863 |
| 2 | 218.298905 | 262.664916 | 1.203235 |
| 3 | 242.743860 | 299.816190 | 1.235113 |
| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |

Here is an example using the script to compare result_a and result_b with detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 |
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 |
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 |
A comparison diagram will be generated below the table.
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />

## Nightly test details

Expand All @@ -164,9 +160,9 @@ See [nightly-descriptions.md](nightly-descriptions.md) for the detailed descript
### Workflow

- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
- Inside each container, we run [scripts/run-nightly-benchmarks.sh](scripts/run-nightly-benchmarks.sh), which will probe the serving engine of the current container.
- The `scripts/run-nightly-benchmarks.sh` will parse the workload described in [nightly-tests.json](tests/nightly-tests.json) and launch the right benchmark for the specified serving engine via `scripts/launch-server.sh`.
- At last, we run [scripts/summary-nightly-results.py](scripts/summary-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.

### Nightly tests

Expand All @@ -176,6 +172,6 @@ In [nightly-tests.json](tests/nightly-tests.json), we include the command line a

The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.

WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `scripts/run-nightly-benchmarks.sh` and `scripts/launch-server.sh`.

WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
21 changes: 11 additions & 10 deletions .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Nightly benchmark annotation

## Description

Expand All @@ -13,15 +14,15 @@ Please download the visualization scripts in the post

- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```

And the results will be inside `./benchmarks/results`.
34 changes: 17 additions & 17 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
## Setup

- Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we use r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
- 8x Nvidia A100 GPUs
- Workload:
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

## Known issues

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Performance benchmarks descriptions

## Latency tests

Expand Down
Loading