Fixed VLLM Cache #881

XenonMolecule · 2024-04-21T06:47:15Z

Previously, the VLLM HFClient used different URLs for different requests, messing up the cache. This PR copies the way that the HG TGI Client handles multiple URLs and ports, which is a wrapper call function that overrides the URL and port in kwargs to be the same for all calls to the LM from the instantiated class. Since the specific URL/port for the request is now ignored (as the cache ignores the arg parameter), the cache will store the results of the call regardless of the specific URL it was routed to. This leads to significantly better performance with multi-host VLLM models!

Fixed VLLM Cache Fully

8ba9a24

XenonMolecule changed the title ~~Fixed VLLM Cache Fully~~ Fixed VLLM Cache Apr 21, 2024

okhat merged commit 362350b into main Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed VLLM Cache #881

Fixed VLLM Cache #881

Uh oh!

XenonMolecule commented Apr 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixed VLLM Cache #881

Fixed VLLM Cache #881

Uh oh!

Conversation

XenonMolecule commented Apr 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants