CPU Inference Trick with KV Cache Reuse — Sub-200ms Calls with llama.cpp #14556

env3d · 2025-07-07T00:28:24Z

env3d
Jul 7, 2025

I’ve been using a simple trick to speed up local inference on CPU by reusing llama.cpp's KV cache. Seeing sub-200ms per call, even in GitHub Codespaces.

The trick:

Load a system prompt once, reuse the KV cache for repeated short inputs
Structure prompts so only the input changes (maximizes cache hits)
Wrapped it into a "prompt-as-function" pattern for easier reuse

Setup:

Uses llama-cpp-python with .gguf models (tested 0.5B to 7B)
Bundles pre-built llama.cpp binaries (no compilation — fast Codespace startup)
KV cache reuse through simple prompt templating

Results:

With cache reuse: 150–200ms per call (0.5B model)
Without it: 1000ms+ per call (same setup)
7B model stays under ~2.5s on CPU

Makes local models much more usable for classification, extraction, etc. — no GPU or API needed.

Repo with examples + Codespaces setup:
https://github.com/env3d/prompt-as-function

Nothing groundbreaking — just a practical trick that made local LLMs more usable for my needs. I’m planning to use this in an AI literacy class. Curious if others are doing similar KV cache tricks or have ideas to push this further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU Inference Trick with KV Cache Reuse — Sub-200ms Calls with llama.cpp #14556

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

CPU Inference Trick with KV Cache Reuse — Sub-200ms Calls with llama.cpp #14556

Uh oh!

env3d Jul 7, 2025

Replies: 0 comments

env3d
Jul 7, 2025