Skip to content

Support vllm quantization #4003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
- enforce_eager: vllm使用pytorch eager模式还是建立cuda graph,默认为`False`。设置为True可以节约显存,但会影响效率
- 🔥limit_mm_per_prompt: 控制vllm使用多图,默认为`None`。例如传入`--limit_mm_per_prompt '{"image": 5, "video": 2}'`
- vllm_max_lora_rank: 默认为`16`。vllm对于lora支持的参数
- vllm_quantization: vllm可以在内部量化模型,参数支持的值详见[这里](https://docs.vllm.ai/en/latest/serving/engine_args.html)
- enable_prefix_caching: 开启vllm的自动前缀缓存,节约重复查询前缀的处理时间。默认为`False`


Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,7 @@ Parameter meanings can be found in the [vllm documentation](https://docs.vllm.ai
- enforce_eager: Determines whether vllm uses PyTorch eager mode or constructs a CUDA graph, default is `False`. Setting it to True can save memory but may affect efficiency.
- 🔥limit_mm_per_prompt: Controls the use of multiple media in vllm, default is `None`. For example, you can pass in `--limit_mm_per_prompt '{"image": 5, "video": 2}'`.
- vllm_max_lora_rank: Default is `16`. This is the parameter supported by vllm for lora.
- - vllm_quantization: vllm is able to quantize model with this argument,supported values can be found [here](https://docs.vllm.ai/en/latest/serving/engine_args.html).
- enable_prefix_caching: Enable the automatic prefix caching of vllm to save processing time for querying repeated prefixes. The default is `False`.

### Merge Arguments
Expand Down
2 changes: 2 additions & 0 deletions swift/llm/argument/infer_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ class VllmArguments:
limit_mm_per_prompt: Optional[Union[dict, str]] = None # '{"image": 5, "video": 2}'
vllm_max_lora_rank: int = 16
enable_prefix_caching: bool = False
vllm_quantization: Optional[str] = None

def __post_init__(self):
self.limit_mm_per_prompt = ModelArguments.parse_to_dict(self.limit_mm_per_prompt)
Expand All @@ -96,6 +97,7 @@ def get_vllm_engine_kwargs(self):
'enable_lora': len(adapters) > 0,
'max_loras': max(len(adapters), 1),
'enable_prefix_caching': self.enable_prefix_caching,
'quantization': self.vllm_quantization,
}
if dist.is_initialized():
kwargs.update({'device': dist.get_rank()})
Expand Down
4 changes: 4 additions & 0 deletions swift/llm/infer/infer_engine/vllm_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ def __init__(
num_infer_workers: int = 1,
enable_sleep_mode: bool = False,
distributed_executor_backend: Optional[str] = None,
quantization: Optional[str] = None,
engine_kwargs: Optional[Dict[str, Any]] = None,
) -> None:
self.use_async_engine = use_async_engine
Expand Down Expand Up @@ -94,6 +95,7 @@ def __init__(
device=device,
distributed_executor_backend=distributed_executor_backend,
enable_sleep_mode=enable_sleep_mode,
quantization=quantization,
engine_kwargs=engine_kwargs,
)
nnodes = get_node_setting()[1]
Expand Down Expand Up @@ -130,6 +132,7 @@ def _prepare_engine_kwargs(
enable_prefix_caching: bool = False,
distributed_executor_backend: Optional[str] = None,
enable_sleep_mode: bool = False,
quantization: Optional[str] = None,
engine_kwargs: Optional[Dict[str, Any]] = None,
) -> None:
if engine_kwargs is None:
Expand All @@ -156,6 +159,7 @@ def _prepare_engine_kwargs(
if 'enable_sleep_mode' in parameters:
engine_kwargs['enable_sleep_mode'] = enable_sleep_mode

engine_kwargs['quantization'] = quantization
model_info = self.model_info
if self.config.architectures is None:
architectures = {'deepseek_vl2': ['DeepseekVLV2ForCausalLM']}[self.model_meta.model_type]
Expand Down