Skip to content

Conversation

@XenonMolecule
Copy link
Collaborator

Previously, the VLLM HFClient used different URLs for different requests, messing up the cache. This PR copies the way that the HG TGI Client handles multiple URLs and ports, which is a wrapper call function that overrides the URL and port in kwargs to be the same for all calls to the LM from the instantiated class. Since the specific URL/port for the request is now ignored (as the cache ignores the arg parameter), the cache will store the results of the call regardless of the specific URL it was routed to. This leads to significantly better performance with multi-host VLLM models!

@XenonMolecule XenonMolecule changed the title Fixed VLLM Cache Fully Fixed VLLM Cache Apr 21, 2024
@okhat okhat merged commit 362350b into main Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants