Skip to content

Commit 1c2bf70

Browse files
iofu728liyucheng09Starmys
authored
Hotfix(MInference): fix the configs in pip (microsoft#14)
Co-authored-by: Yucheng Li <[email protected]> Co-authored-by: Chengruidong Zhang <[email protected]>
1 parent 00666fb commit 1c2bf70

File tree

3 files changed

+9
-2
lines changed

3 files changed

+9
-2
lines changed

MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
recursive-include csrc *.cu
22
recursive-include csrc *.cpp
3+
4+
recursive-include minference *.json

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ pipe(prompt, max_length=10)
8080
```
8181

8282
for vLLM,
83+
> For now, please use vllm==0.4.1
8384
8485
```diff
8586
from vllm import LLM, SamplingParams

minference/modules/minference_forward.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,17 @@
44
import inspect
55
import json
66
import os
7+
import warnings
78
from importlib import import_module
89

910
from transformers.models.llama.modeling_llama import *
1011
from transformers.utils.import_utils import _is_package_available
1112

1213
if _is_package_available("vllm"):
13-
from vllm.attention.backends.flash_attn import *
14+
try:
15+
from vllm.attention.backends.flash_attn import *
16+
except:
17+
warnings.warn("Only support 'vllm==0.4.1'. Please update your vllm version.")
1418

1519
from ..ops.block_sparse_flash_attention import block_sparse_attention
1620
from ..ops.pit_sparse_flash_attention_v2 import vertical_slash_sparse_attention
@@ -768,7 +772,7 @@ def forward(
768772
key: torch.Tensor,
769773
value: torch.Tensor,
770774
kv_cache: torch.Tensor,
771-
attn_metadata: AttentionMetadata[FlashAttentionMetadata],
775+
attn_metadata,
772776
kv_scale: float,
773777
layer_idx: int,
774778
) -> torch.Tensor:

0 commit comments

Comments
 (0)