Skip to content

Commit 3eb4d4d

Browse files
authored
Update llama4 documentation to use the right settings for long context (#48)
1 parent de34b1c commit 3eb4d4d

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

_posts/2025-04-05-llama4.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,35 +25,39 @@ On 8x H100 GPUs:
2525
* Scout (up to 1M context):
2626

2727
```
28-
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
28+
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
2929
--tensor-parallel-size 8 \
30-
--max-model-len 1000000
30+
--max-model-len 1000000 --override-generation-config='{"attn_temperature_tuning": true}'
3131
```
3232

3333
* Maverick (up to \~430K context):
3434

3535
```
36-
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
36+
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
3737
--tensor-parallel-size 8 \
38-
--max-model-len 430000
38+
--max-model-len 430000 --override-generation-config='{"attn_temperature_tuning": true}'
3939
```
4040

4141
On 8x H200 GPUs:
4242

4343
* Scout (up to 3.6M context):
4444

4545
```
46-
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
46+
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
4747
--tensor-parallel-size 8 \
48-
--max-model-len 3600000
48+
--max-model-len 3600000 --override-generation-config='{"attn_temperature_tuning": true}'
4949
```
5050

5151
* Maverick (up to 1M context):
5252

5353
```
54-
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
54+
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
5555
--tensor-parallel-size 8
56+
--max-model-len 1000000 --override-generation-config='{"attn_temperature_tuning": true}'
5657
```
58+
59+
Note: we highly recommend to turn on attn_temperature_tuning to improve accuracy for long contexts longer than 32K tokens, and VLLM_DISABLE_COMPILE_CACHE=1 is required.
60+
5761
**Multimodality:**
5862

5963
The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py).
@@ -70,7 +74,6 @@ While more performance enhancements are on the way, we believe the Llama 4 model
7074

7175
* **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
7276
* **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
73-
* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
7477

7578
**Other Hardware Support & Quantizations:**
7679

0 commit comments

Comments
 (0)