Skip to content

What's the best way to persist a loaded LLaMA model across multiple queries in a long-running process? #14579

Answered by officiallyutso
YuXing404 asked this question in Q&A
Discussion options

You must be logged in to vote

To keep the model loaded and avoid reloading it every time, you should definitely use a long-running process where the model context stays in memory.

Best Practices:

Use the Python bindings (llama-cpp-python) if you're working in Python. They wrap the C API cleanly and allow persistent contexts.

from llama_cpp import Llama

llm = Llama(model_path="path/to/your/model.gguf", n_ctx=4096, n_threads=8)

The llm object keeps the model in memory. You can call llm(prompt) multiple times without reloading.

Maintain a rolling context manually: Concatenate the prompt + past few interactions to simulate a short-term memory/chat history. Keep track of the token limit (n_ctx) to avoid context overflow.

I…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@YuXing404
Comment options

Answer selected by YuXing404
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants