Skip to content

server : PoC implementation of "interim" server #13400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 9, 2025

This PR acts as a PoC to illustrate my idea in #13367

The way it works is to spawn an "interim" server that exposes /load endpoint.

For example:

# run server without specifying model
llama-server

# then, load it via API
curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"hf_repo": "ggml-org/gemma-3-4b-it-GGUF"}' \
  http://localhost:8080/load

The implementation separates run_interim_server and run_main_server because the run_main_server can be converted to creating a child process, though I'm not sure if this is preferable way to go.

WDYT about this approach @ggerganov @slaren ?

@ngxson ngxson marked this pull request as draft May 9, 2025 09:21
@ggerganov
Copy link
Member

Nice.

Maybe the interim API should also have a logic to route main API requests to the respective server based on the model id. This way 3rd party apps can always communicate with a single network port.

@ngxson
Copy link
Collaborator Author

ngxson commented May 9, 2025

Yes that can be a good idea. I'm thinking about abstract out the HTTP server implementation, so we can implement the routing logic more easily.

In anyway, I think separating the HTTP layer and handler code will be one of our main goal in the very short term, before we can even do anything else. The problem is that server.cpp currently takes 30 seconds to compile, which make development not very pleasant 😂

@isaac-mcfadyen
Copy link
Contributor

Maybe the interim API should also have a logic to route main API requests to the respective server based on the model id. This way 3rd party apps can always communicate with a single network port.

In case it helps, the way llama-swap does this (it's a third-party tool with a similar idea) is by adding an endpoint to "pass through" the request to the model named in the path.

I.e to route to the model with ID gemma-3-4b-it-GGUF (loading it if needed) :

curl -X POST http://127.0.0.1:8080/upstream/gemma-3-4b-it-GGUF/v1/chat/completions # etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants