CPU Inference Trick with KV Cache Reuse — Sub-200ms Calls with llama.cpp #14556
env3d
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I’ve been using a simple trick to speed up local inference on CPU by reusing llama.cpp's KV cache. Seeing sub-200ms per call, even in GitHub Codespaces.
The trick:
Load a system prompt once, reuse the KV cache for repeated short inputs
Structure prompts so only the input changes (maximizes cache hits)
Wrapped it into a "prompt-as-function" pattern for easier reuse
Setup:
Uses llama-cpp-python with .gguf models (tested 0.5B to 7B)
Bundles pre-built llama.cpp binaries (no compilation — fast Codespace startup)
KV cache reuse through simple prompt templating
Results:
With cache reuse: 150–200ms per call (0.5B model)
Without it: 1000ms+ per call (same setup)
7B model stays under ~2.5s on CPU
Makes local models much more usable for classification, extraction, etc. — no GPU or API needed.
Repo with examples + Codespaces setup:
https://github.com/env3d/prompt-as-function
Nothing groundbreaking — just a practical trick that made local LLMs more usable for my needs. I’m planning to use this in an AI literacy class. Curious if others are doing similar KV cache tricks or have ideas to push this further.
Beta Was this translation helpful? Give feedback.
All reactions