kleidiai: add optimized per-channel kernels for Q8_0 #16993

chaxu01 · 2025-11-04T10:07:51Z

Benchmarks from MacBook M4:

W/ KleidiAI

GGML_KLEIDIAI_SME=1 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        504.01 ± 2.70 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         93.68 ± 0.16 |

GGML_KLEIDIAI_SME=0 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        193.94 ± 1.22 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         43.45 ± 0.34 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        692.11 ± 0.71 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |       132.24 ± 16.44 |

W/O KleidiAI

./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |         44.39 ± 0.52 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         41.61 ± 0.25 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        156.83 ± 0.62 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |        115.41 ± 1.82 |

chaxu01 · 2025-11-06T12:35:46Z

Hi @ggerganov, this PR adds Q8_0 optimization kernels for the KleidiAI backend.
The CI shows three failed cases, but they appear to be unrelated (KleidiAI isn’t enabled in those jobs).
Please take a look when you have a moment, thanks!

ggerganov · 2025-11-06T16:15:32Z

@chaxu01 Shall we first merge the CI runner (#17021) and then this PR?

chaxu01 · 2025-11-07T07:56:37Z

@ggerganov That makes sense — #17021 will add the Arm-hosted Graviton4 runner for kleidiai builds and tests, we are working on verify and confirm it internally and it may take a bit longer to complete than we initially expected.

In the meantime, the kleidiai builds and tests for this PR have already been verified successfully here:
https://github.com/ggml-org/llama.cpp/actions/runs/19065056994/job/54453545058?pr=16993

Given that the changes are self-contained and verified, we could proceed with merging this PR first if that works for you.

ggerganov

Let's way for @slaren to take a look as well.

kleidiai: add optimized per-channel kernels for Q8_0

2e03c52

chaxu01 requested review from ggerganov and slaren as code owners November 4, 2025 10:07

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16993: kleidiai: add optimized per-channel kernels for Q8_0 auroralabs-loci/llama.cpp#76

Open

ggerganov approved these changes Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kleidiai: add optimized per-channel kernels for Q8_0 #16993

kleidiai: add optimized per-channel kernels for Q8_0 #16993

chaxu01 commented Nov 4, 2025

Uh oh!

chaxu01 commented Nov 6, 2025

Uh oh!

ggerganov commented Nov 6, 2025

Uh oh!

chaxu01 commented Nov 7, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kleidiai: add optimized per-channel kernels for Q8_0 #16993

Are you sure you want to change the base?

kleidiai: add optimized per-channel kernels for Q8_0 #16993

Conversation

chaxu01 commented Nov 4, 2025

Uh oh!

chaxu01 commented Nov 6, 2025

Uh oh!

ggerganov commented Nov 6, 2025

Uh oh!

chaxu01 commented Nov 7, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants