Skip to content

sycl: use oneDNN for matrices multiplication #12972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 15, 2025
Merged

Conversation

lslusarczyk
Copy link
Contributor

@lslusarczyk lslusarczyk commented Apr 16, 2025

Summary:

  • SYCL is doing matrix multiplication using oneDNN instead of MKL
  • getting back to previous behavior by not using oneDNN and working with MKL can be forced by a new runtime flag: GGML_SYCL_DISABLE_DNN
  • compilation with oneDNN may be explicitly disabled by cmake flag: GGML_SYCL_DNN
  • several assertions were added, but they do not introduce any additional limitations, just document the code and limitations and assumptions which were already in the code before
  • some cleanups to use less code, remove unused variables

This is the first change towards making SYCL graphs working by avoiding using calls which issue sycl::wait inside recording phase. MKL uses waits a lot while oneDNN does not. Also using oneDNN should be preferred over using MKL in AI - hence the change.

Runtime flag is meant to ease performance comparison between MKL and oneDNN and is a safeguard for the custemers in case someone will find that oneDNN is slower in particulra case.

Compile time flag makes configuration of SYCL with oneDNN cleaner.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 16, 2025
@NeoZhangJianyu
Copy link
Collaborator

  1. Please make sure the CI/UT is passed with the PR.
  2. There are several assert() to be added.
    Why add them?
    New code should support same cases of legacy code.

@github-actions github-actions bot added the testing Everything test related label Apr 23, 2025
@lslusarczyk lslusarczyk force-pushed the todnn branch 2 times, most recently from e711ab9 to ac04335 Compare May 7, 2025 08:06
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 7, 2025
@lslusarczyk
Copy link
Contributor Author

  1. Please make sure the CI/UT is passed with the PR.

sure

  1. There are several assert() to be added.
    Why add them?
    New code should support same cases of legacy code.

I've updated description of this PR to explain this. There are no new limitations added. New assertions just document already existing assumptions or document dependencies in the code. Such an assertion is a self-verifying documentation of the code.

@lslusarczyk lslusarczyk changed the title sycl: use DNN in the first part of ggml_sycl_mul_mat_batched_sycl sycl: use oneDNN for matrices multiplication May 7, 2025
@lslusarczyk lslusarczyk marked this pull request as ready for review May 7, 2025 08:47
Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll get some performance numbers and post them in here

@@ -741,6 +742,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase |
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because graph performance isn't yet better than non-graph performance. |
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now for developing purposes but I would reconsider adding this to the backend as an option to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The simpler the better. Once oneDNN implementation is tuned I would opt for removing completely MKL path together with these options. But we will decide this in the future.

@Alcpz
Copy link
Collaborator

Alcpz commented May 12, 2025

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this

if (!ggml_is_contiguous(b)) {
return false;
}
)

@lslusarczyk
Copy link
Contributor Author

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this

if (!ggml_is_contiguous(b)) {
return false;
}

)

Yes. I've worked last days on resolving conflicts and fixing test cases which started to fail after rebasing into recent changes. All is rebased and works for me locally now. I will test performance on my own again and if all looks good I will ask you here for testing perf again.

Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall!

@@ -731,6 +731,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_DEVICE_ARCH | Optional (except for AMD) | Set the SYCL device architecture, optional except for AMD. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. |
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. |
| GGML_SYCL_GRAPH | ON *(default)* \|OFF *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for adding this flag? Currently oneDNN is enabled if CMake is able to find it. Given that we want to prioritize oneDNN and will probably remove the option to use oneMKL/oneMath I feel that we will remove this option soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until oneDNN isn't the only option, there should be way to control which one, MKL or DNN, is used. E.g for us to compare performance. Not having this option means anyone who do not want to use oneDNN with llama.cpp needs to uninstall oneDNN or do to some ugly hacks to make it not be detected.

Probably this option will be removed soon with oneMKL/oneMath - it is OK. But it may happen we will also have to use it a bit longer in case of problems with oneDNN. Let's have it in the meantime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the environment variable would be enough for users who want to test disabling oneDNN. It seems to me this just adds more noise to this PR and the one that is going to remove oneMKL/oneMath.
It's not a big concern so feel free to ignore if no one else has an opinion.

@Alcpz
Copy link
Collaborator

Alcpz commented May 14, 2025

Here's the performance I got comparing to master using FP32:

model device test t/s bb1681f (5376) t/s 53113d0 (5352) speedup
qwen2 1.5B Q4_0 Data Max 1100 pp512 5153.25 5176.24 1.0
qwen2 1.5B Q4_0 Data Max 1100 tg128 146.65 146.3 1.0
llama 7B Q4_0 Data Max 1100 pp512 1170.74 1170.12 1.0
llama 7B Q4_0 Data Max 1100 tg128 71.48 72.16 1.01
phi3 3B Q4_0 Data Max 1100 pp512 1909.14 1906.37 1.0
phi3 3B Q4_0 Data Max 1100 tg128 109.66 111.03 1.01
qwen2 1.5B Q4_0 Arc B580 pp512 3298.62 3327.74 1.01
qwen2 1.5B Q4_0 Arc B580 tg128 79.35 81.92 1.03
llama 7B Q4_0 Arc B580 pp512 781.32 781.11 1.0
llama 7B Q4_0 Arc B580 tg128 61.99 63.41 1.02
phi3 3B Q4_0 Arc B580 pp512 1259.04 1259.22 1.0
phi3 3B Q4_0 Arc B580 tg128 80.08 79.38 0.99
qwen2 1.5B Q4_0 Arc A770 pp512 3246.14 3268.02 1.01
qwen2 1.5B Q4_0 Arc A770 tg128 44.78 44.77 1.0
llama 7B Q4_0 Arc A770 pp512 790.7 789.74 1.0
llama 7B Q4_0 Arc A770 tg128 33.87 33.79 1.0
phi3 3B Q4_0 Arc A770 pp512 1257.26 1257.94 1.0
phi3 3B Q4_0 Arc A770 tg128 39.4 39.53 1.0
qwen2 1.5B Q4_0 Arc V140 pp512 886.79 884.95 1.0
qwen2 1.5B Q4_0 Arc V140 tg128 47.65 47.91 1.01
llama 7B Q4_0 Arc V140 pp512 207.99 208.04 1.0
llama 7B Q4_0 Arc V140 tg128 20.91 20.87 1.0
phi3 3B Q4_0 Arc V140 pp512 334.03 333.69 1.0
phi3 3B Q4_0 Arc V140 tg128 31.26 31.11 1.0

Speed up is calculated by dividing b5352/b5376 (this pr / master)

I'm re-runing using FP16 and some more models / quantizations, but so far looks good.

@Alcpz
Copy link
Collaborator

Alcpz commented May 14, 2025

Results for FP16:

So far results are roughly equivalent, even better in some cases. We have a couple of cases that may require looking a bit more into them, like qwen2 1.5B Q4_K - Medium on battlemage during TG and gemma2 2B Q4_K - Medium during prompt processing.

Benchmark Results
model device test t/s bb1681f (5376) t/s 53113d0 (5352) speedup
qwen2 1.5B Q4_0 Data Max 1100 pp512 6981.83 7050.21 1.01
qwen2 1.5B Q4_0 Data Max 1100 tg128 143.85 144.1 1
qwen2 1.5B Q4_K - Medium Data Max 1100 pp512 10254.4 10172.7 0.99
qwen2 1.5B Q4_K - Medium Data Max 1100 tg128 107.35 107.43 1
llama 7B Q4_0 Data Max 1100 pp512 1751.95 1751.15 1
llama 7B Q4_0 Data Max 1100 tg128 71.06 71.28 1
llama 7B Q4_K - Medium Data Max 1100 pp512 3369.32 3378.41 1
llama 7B Q4_K - Medium Data Max 1100 tg128 38.98 38.99 1
gemma2 2B Q4_K - Medium Data Max 1100 pp512 7825.8 7798.84 1
gemma2 2B Q4_K - Medium Data Max 1100 tg128 82.58 82.4 1
phi3 3B Q4_0 Data Max 1100 pp512 2818.08 2825.46 1
phi3 3B Q4_0 Data Max 1100 tg128 108.64 109.39 1.01
phi3 3B Q4_K - Medium Data Max 1100 pp512 4505.99 4626.3 1.03
phi3 3B Q4_K - Medium Data Max 1100 tg128 65.41 65.43 1
llama 34B Q6_K Data Max 1100 pp512 1755.48 1763.75 1
llama 34B Q6_K Data Max 1100 tg128 23.53 23.3 0.99
llama 8B Q4_K - Medium Data Max 1100 pp512 3383.7 3338.83 0.99
llama 8B Q4_K - Medium Data Max 1100 tg128 36.26 35.93 0.99
qwen2 1.5B Q4_0 Arc B580 pp512 5959.04 6052.54 1.02
qwen2 1.5B Q4_0 Arc B580 tg128 138.28 138.52 1
qwen2 1.5B Q4_K - Medium Arc B580 pp512 7564.61 7502.2 0.99
qwen2 1.5B Q4_K - Medium Arc B580 tg128 106.54 100.88 0.95
llama 7B Q4_0 Arc B580 pp512 1624.43 1624.43 1
llama 7B Q4_0 Arc B580 tg128 65.43 65.09 0.99
llama 7B Q4_K - Medium Arc B580 pp512 2206.23 2201.93 1
llama 7B Q4_K - Medium Arc B580 tg128 38.3 38.2 1
gemma2 2B Q4_K - Medium Arc B580 pp512 5752.78 5722.25 0.99
gemma2 2B Q4_K - Medium Arc B580 tg128 75.26 72.35 0.96
phi3 3B Q4_0 Arc B580 pp512 2410.01 2408.05 1
phi3 3B Q4_0 Arc B580 tg128 100.07 95.97 0.96
phi3 3B Q4_K - Medium Arc B580 pp512 3167.63 3165.56 1
phi3 3B Q4_K - Medium Arc B580 tg128 55.43 54.65 0.99
llama 34B Q6_K Arc B580 pp512 1486.23 1483.83 1
llama 34B Q6_K Arc B580 tg128 20.43 20.27 0.99
llama 8B Q4_K - Medium Arc B580 pp512 2075.02 2067.28 1
llama 8B Q4_K - Medium Arc B580 tg128 34.68 34.06 0.98
qwen2 1.5B Q4_0 Arc A770 pp512 4082.11 4071.16 1
qwen2 1.5B Q4_0 Arc A770 tg128 44.44 44.5 1
qwen2 1.5B Q4_K - Medium Arc A770 pp512 4460.65 4458.04 1
qwen2 1.5B Q4_K - Medium Arc A770 tg128 43.85 44.37 1.01
llama 7B Q4_0 Arc A770 pp512 1468.23 1468.99 1
llama 7B Q4_0 Arc A770 tg128 33.55 33.42 1
llama 7B Q4_K - Medium Arc A770 pp512 1727.55 1727.02 1
llama 7B Q4_K - Medium Arc A770 tg128 25.85 25.72 0.99
gemma2 2B Q4_K - Medium Arc A770 pp512 3724.92 3623.25 0.97
gemma2 2B Q4_K - Medium Arc A770 tg128 35.11 35.06 1
phi3 3B Q4_0 Arc A770 pp512 2163.5 2165.57 1
phi3 3B Q4_0 Arc A770 tg128 37.23 37.48 1.01
phi3 3B Q4_K - Medium Arc A770 pp512 2502.81 2499.9 1
phi3 3B Q4_K - Medium Arc A770 tg128 29.86 29.91 1
llama 34B Q6_K Arc A770 pp512 1063.01 1029.61 0.97
llama 34B Q6_K Arc A770 tg128 15.42 14.95 0.97
llama 8B Q4_K - Medium Arc A770 pp512 1676.83 1620.34 0.97
llama 8B Q4_K - Medium Arc A770 tg128 23.19 22.64 0.98
qwen2 1.5B Q4_0 Arc V140 pp512 1100.25 1086.64 0.99
qwen2 1.5B Q4_0 Arc V140 tg128 47.9 47.55 0.99
qwen2 1.5B Q4_K - Medium Arc V140 pp512 1490 1512.93 1.02
qwen2 1.5B Q4_K - Medium Arc V140 tg128 37.24 37.26 1
llama 7B Q4_0 Arc V140 pp512 323.82 323.29 1
llama 7B Q4_0 Arc V140 tg128 20.49 20.94 1.02
llama 7B Q4_K - Medium Arc V140 pp512 548.41 549.69 1
llama 7B Q4_K - Medium Arc V140 tg128 13.35 13.37 1
gemma2 2B Q4_K - Medium Arc V140 pp512 647.98 594.44 0.92
gemma2 2B Q4_K - Medium Arc V140 tg128 25.2 25.2 1
phi3 3B Q4_0 Arc V140 pp512 517.41 517.47 1
phi3 3B Q4_0 Arc V140 tg128 31.41 31.41 1
phi3 3B Q4_K - Medium Arc V140 pp512 841.13 843.35 1
phi3 3B Q4_K - Medium Arc V140 tg128 21.19 21.21 1
llama 34B Q6_K Arc V140 pp512 368.28 367.2 1
llama 34B Q6_K Arc V140 tg128 7.34 7.33 1
llama 8B Q4_K - Medium Arc V140 pp512 507.21 504.61 0.99
llama 8B Q4_K - Medium Arc V140 tg128 11.77 11.73 1

Edit:

llama-bench command:
export GGML_SYCL_PRIORITIZE_DMMV=0
export GGML_SYCL_DISABLE_OPT=0
${LLAMA_BUILD_DIR}/bin/llama-bench \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/gemma-2-2b-it-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/solar-10.7b-instruct-v1.0.Q6_K.gguf \
  -m /home/shared/llama.cpp-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p 512 \
  -n 128 \
  -pg 0,0 \
  -mmp 0 \
  -t 8 \
  -r 5 \
  -sm none \
  -ngl 99 \
  -o md

Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me. To my understanding this patch should not impact TG. There were some concerns about potential regressions in PP but it overall looks the same.

@lslusarczyk
Copy link
Contributor Author

All tests have passed. @Alcpz , @Rbiessy , could you please merge?

Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. Thanks for your contribution.

@Rbiessy Rbiessy merged commit 9c404ed into ggml-org:master May 15, 2025
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants