Skip to content

Commit 010b708

Browse files
committed
Feature(MInference): update experiments details
1 parent abc5aba commit 010b708

File tree

4 files changed

+32
-7
lines changed

4 files changed

+32
-7
lines changed

experiments/README.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,15 @@
1+
# Experimemts
2+
3+
- [Offline Kernel-Aware Sparse Pattern Search](#Offline-Kernel-Aware-Sparse-Pattern-Search)
4+
- [MInference Benchmark Experiments](#MInference-Benchmark-Experiments)
5+
- [End-to-End Benchmark](#End-to-End-Benchmark)
6+
- [Micro-Benchmark](#Micro-Benchmark)
7+
- [MInference Downstream Tasks Experiments](#MInference-Downstream-Tasks-Experiments)
8+
- [InfiniteBench](#InfiniteBench)
9+
- [RULER](#RULER)
10+
- [PPL](#PPL)
11+
- [Needle in A Haystack](#Needle-in-A-Haystack)
12+
113
## Offline Kernel-Aware Sparse Pattern Search
214

315
You can use the following scripts to search for the optimal head sparse pattern:
@@ -19,7 +31,8 @@ python run_infinitebench.py \
1931

2032
## MInference Benchmark Experiments
2133

22-
Note: All experiments were run on a single A100 GPU with 80GB of VRAM.
34+
> [!NOTE]
35+
> All experiments were run on a single A100 GPU with 80GB of VRAM.
2336
2437
Environment parameters:
2538
- CUDA 12.3
@@ -62,12 +75,16 @@ python experiments/benchmarks/benchmark_e2e.py --run_benchmark
6275
1000K 1765.56387 107.85639 328.58551 179.12031
6376
```
6477

78+
> [!TIP]
79+
> Based on our tests, **a single A100 can support up to 1.8M** context prompts during the pre-filling stage using LLaMA-3-8B-4M with **bf16**.
80+
6581
### Micro-Benchmark
6682

6783

6884
## MInference Downstream Tasks Experiments
6985

70-
Note: All of these experiments were run on one A100 GPUs with 80GB of VRAM. You may need to modify commands to fit your own computing environment (e.g., changing the batch size, the max memory per GPU, the number of GPUs, etc)
86+
> [!NOTE]
87+
> All of these experiments were run on one A100 GPUs with 80GB of VRAM. You may need to modify commands to fit your own computing environment (e.g., changing the batch size, the max memory per GPU, the number of GPUs, etc)
7188
7289
### InfiniteBench
7390

@@ -78,7 +95,7 @@ InfiniteBench consists of the following tasks: `kv_retrieval`, `longbook_choice_
7895
1. Run InfiniteBench with `MInference`:
7996

8097
```bash
81-
bash experiments/infinite_bench/run_infinitebench.sh gradientai/Llama-3-8B-Instruct-262k 128000 -1 minference
98+
bash experiments/infinite_bench/run_infinitebench.sh gradientai/Llama-3-8B-Instruct-262k 160000 -1 minference
8299
```
83100

84101
2. Experimental results

experiments/infinite_bench/args.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ def parse_args() -> Namespace:
7070
"inf_llm",
7171
"flash_attn",
7272
"minference",
73+
"minference_with_dense",
7374
"dilated1",
7475
"dilated2",
7576
],

minference/minference_configuration.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,18 @@
77

88

99
class MInferenceConfig:
10-
ATTENTION_TYPES = [
10+
MINFERENCE_ATTENTION_TYPES = [
1111
"minference",
12+
"vllm",
13+
]
14+
STASTIC_ATTENTION_TYPES = [
1215
"minference_with_dense",
1316
"static",
1417
"dilated1",
1518
"dilated2",
1619
"streaming",
1720
"inf_llm",
18-
"vllm",
21+
"hf",
1922
]
2023

2124
def __init__(
@@ -33,7 +36,7 @@ def __init__(
3336
):
3437
super(MInferenceConfig, self).__init__()
3538
assert (
36-
attn_type in self.ATTENTION_TYPES
39+
attn_type in self.MINFERENCE_ATTENTION_TYPES + self.STASTIC_ATTENTION_TYPES
3740
), f"The attention_type {attn_type} you specified is not supported."
3841
self.attn_type = attn_type
3942
self.config_path = self.update_config_path(config_path, model_name)
@@ -46,7 +49,7 @@ def __init__(
4649
self.attn_kwargs = attn_kwargs
4750

4851
def update_config_path(self, config_path: str, model_name: str):
49-
if config_path is not None:
52+
if config_path is not None or self.attn_type in self.STASTIC_ATTENTION_TYPES:
5053
return config_path
5154
assert (
5255
model_name in MODEL2PATH

minference/models_patch.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66
from .minference_configuration import MInferenceConfig
77
from .patch import minference_patch, minference_patch_vllm, patch_hf
88

9+
os.environ["TOKENIZERS_PARALLELISM"] = "false"
10+
911

1012
class MInference:
1113
def __init__(
@@ -76,6 +78,8 @@ def patch_model(self, model):
7678
attn_type="streaming",
7779
attn_kwargs={"n_local": 3968, "n_init": 128, **self.config.attn_kwargs},
7880
)
81+
elif self.config.attn_type == "hf":
82+
pass
7983
elif self.config.attn_type == "inf_llm":
8084
model = patch_hf(
8185
model,

0 commit comments

Comments
 (0)