Skip to content

Conversation

@chaxu01
Copy link
Collaborator

@chaxu01 chaxu01 commented Nov 4, 2025

Benchmarks from MacBook M4:

W/ KleidiAI

GGML_KLEIDIAI_SME=1 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        504.01 ± 2.70 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         93.68 ± 0.16 |

GGML_KLEIDIAI_SME=0 ./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |        193.94 ± 1.22 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         43.45 ± 0.34 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        692.11 ± 0.71 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |       132.24 ± 16.44 |

W/O KleidiAI

./bin/llama-bench -m ./Llama-3.2-1B-Instruct-Q8_0.gguf -ngl 0 -t 1,4
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           pp512 |         44.39 ± 0.52 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       1 |           tg128 |         41.61 ± 0.25 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           pp512 |        156.83 ± 0.62 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       4 |           tg128 |        115.41 ± 1.82 |

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025
@chaxu01
Copy link
Collaborator Author

chaxu01 commented Nov 6, 2025

Hi @ggerganov, this PR adds Q8_0 optimization kernels for the KleidiAI backend.
The CI shows three failed cases, but they appear to be unrelated (KleidiAI isn’t enabled in those jobs).
Please take a look when you have a moment, thanks!

@ggerganov
Copy link
Member

@chaxu01 Shall we first merge the CI runner (#17021) and then this PR?

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Nov 7, 2025

@ggerganov That makes sense — #17021 will add the Arm-hosted Graviton4 runner for kleidiai builds and tests, we are working on verify and confirm it internally and it may take a bit longer to complete than we initially expected.

In the meantime, the kleidiai builds and tests for this PR have already been verified successfully here:
https://github.com/ggml-org/llama.cpp/actions/runs/19065056994/job/54453545058?pr=16993

Given that the changes are self-contained and verified, we could proceed with merging this PR first if that works for you.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's way for @slaren to take a look as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants