arm64: optimize q6_k_q8_k kernel with i8mm #13519

cyb70289 · 2025-05-14T01:36:48Z

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.

40% ~ 54% S_PP uplift for all batch sizes
16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------

Make sure to read the contributing guidelines before submitting a PR

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q6_k quantization model. - 40% ~ 54% S_PP uplift for all batch sizes - 16% ~ 47% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- | PP | TG | B | S_PP t/s | S_TG t/s | | | | | original | this pr | original | this pr | |-------|--------|------|----------|----------|----------|----------| | 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 | | 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 | | 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 | | 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 | | 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 | | 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 | --------------------------------------------------------------------- ```

cyb70289 · 2025-05-14T02:07:49Z

I run local CI and verified this PR. There's a failure from master branch about q4_0 quantization, not related to this PR.

slaren

I see ~30% end to end improvement on M3 Max, and ~60% on Snapdragon XE, with pp128 on a Q6K model. test-backend-ops vs Metal passes. Good job!

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 14, 2025

slaren approved these changes May 14, 2025

View reviewed changes

slaren merged commit 5ab5d5f into ggml-org:master May 14, 2025
44 checks passed

cyb70289 deleted the q6k branch May 15, 2025 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64: optimize q6_k_q8_k kernel with i8mm #13519

arm64: optimize q6_k_q8_k kernel with i8mm #13519

cyb70289 commented May 14, 2025

cyb70289 commented May 14, 2025

slaren left a comment

arm64: optimize q6_k_q8_k kernel with i8mm #13519

arm64: optimize q6_k_q8_k kernel with i8mm #13519

Conversation

cyb70289 commented May 14, 2025

cyb70289 commented May 14, 2025

slaren left a comment

Choose a reason for hiding this comment