sycl: use oneDNN for matrices multiplication #12972

lslusarczyk · 2025-04-16T11:34:32Z

Summary:

SYCL is doing matrix multiplication using oneDNN instead of MKL
getting back to previous behavior by not using oneDNN and working with MKL can be forced by a new runtime flag: GGML_SYCL_DISABLE_DNN
compilation with oneDNN may be explicitly disabled by cmake flag: GGML_SYCL_DNN
several assertions were added, but they do not introduce any additional limitations, just document the code and limitations and assumptions which were already in the code before
some cleanups to use less code, remove unused variables

This is the first change towards making SYCL graphs working by avoiding using calls which issue sycl::wait inside recording phase. MKL uses waits a lot while oneDNN does not. Also using oneDNN should be preferred over using MKL in AI - hence the change.

Runtime flag is meant to ease performance comparison between MKL and oneDNN and is a safeguard for the custemers in case someone will find that oneDNN is slower in particulra case.

Compile time flag makes configuration of SYCL with oneDNN cleaner.

ggml/src/ggml-sycl/ggml-sycl.cpp

NeoZhangJianyu · 2025-04-21T06:36:32Z

Please make sure the CI/UT is passed with the PR.
There are several assert() to be added.
Why add them?
New code should support same cases of legacy code.

lslusarczyk · 2025-05-07T08:25:46Z

Please make sure the CI/UT is passed with the PR.

sure

There are several assert() to be added.
Why add them?
New code should support same cases of legacy code.

I've updated description of this PR to explain this. There are no new limitations added. New assertions just document already existing assumptions or document dependencies in the code. Such an assertion is a self-verifying documentation of the code.

Alcpz

I'll get some performance numbers and post them in here

Alcpz · 2025-05-07T17:10:09Z

docs/backend/SYCL.md

@@ -741,6 +742,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
 | GGML_SYCL_DEBUG   | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG                                                                             |
 | GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase |
 | GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because graph performance isn't yet better than non-graph performance. |
+| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |


I think this is fine for now for developing purposes but I would reconsider adding this to the backend as an option to users.

Agree. The simpler the better. Once oneDNN implementation is tuned I would opt for removing completely MKL path together with these options. But we will decide this in the future.

Alcpz · 2025-05-12T12:52:13Z

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this

llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp

Lines 3935 to 3937 in ac04335

    
           if (!ggml_is_contiguous(b)) { 
        
               return false; 
        
           }

)

lslusarczyk · 2025-05-12T14:35:09Z

@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this

llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp

Lines 3935 to 3937 in ac04335

if (!ggml_is_contiguous(b)) {

return false;

}

)

Yes. I've worked last days on resolving conflicts and fixing test cases which started to fail after rebasing into recent changes. All is rebased and works for me locally now. I will test performance on my own again and if all looks good I will ask you here for testing perf again.

Rbiessy

LGTM overall!

Rbiessy · 2025-05-13T13:06:16Z

docs/backend/SYCL.md

@@ -731,6 +731,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
 | GGML_SYCL_DEVICE_ARCH | Optional (except for AMD)             | Set the SYCL device architecture, optional except for AMD. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. |
 | GGML_SYCL_F16      | OFF *(default)* \|ON *(optional)*     | Enable FP16 build with SYCL code path.      |
 | GGML_SYCL_GRAPH    | ON *(default)* \|OFF *(Optional)*     | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
+| GGML_SYCL_DNN      | ON *(default)* \|OFF *(Optional)*     | Enable build with oneDNN.                   |


What is the reason for adding this flag? Currently oneDNN is enabled if CMake is able to find it. Given that we want to prioritize oneDNN and will probably remove the option to use oneMKL/oneMath I feel that we will remove this option soon.

Until oneDNN isn't the only option, there should be way to control which one, MKL or DNN, is used. E.g for us to compare performance. Not having this option means anyone who do not want to use oneDNN with llama.cpp needs to uninstall oneDNN or do to some ugly hacks to make it not be detected.

Probably this option will be removed soon with oneMKL/oneMath - it is OK. But it may happen we will also have to use it a bit longer in case of problems with oneDNN. Let's have it in the meantime.

I think using the environment variable would be enough for users who want to test disabling oneDNN. It seems to me this just adds more noise to this PR and the one that is going to remove oneMKL/oneMath.
It's not a big concern so feel free to ignore if no one else has an opinion.

ggml/src/ggml-sycl/CMakeLists.txt

ggml/src/ggml-sycl/ggml-sycl.cpp

tests/test-backend-ops.cpp

Alcpz · 2025-05-14T14:29:25Z

Here's the performance I got comparing to master using FP32:

model	device	test	t/s `bb1681f` (5376)	t/s `53113d0` (5352)	speedup
qwen2 1.5B Q4_0	Data Max 1100	pp512	5153.25	5176.24	1.0
qwen2 1.5B Q4_0	Data Max 1100	tg128	146.65	146.3	1.0
llama 7B Q4_0	Data Max 1100	pp512	1170.74	1170.12	1.0
llama 7B Q4_0	Data Max 1100	tg128	71.48	72.16	1.01
phi3 3B Q4_0	Data Max 1100	pp512	1909.14	1906.37	1.0
phi3 3B Q4_0	Data Max 1100	tg128	109.66	111.03	1.01
qwen2 1.5B Q4_0	Arc B580	pp512	3298.62	3327.74	1.01
qwen2 1.5B Q4_0	Arc B580	tg128	79.35	81.92	1.03
llama 7B Q4_0	Arc B580	pp512	781.32	781.11	1.0
llama 7B Q4_0	Arc B580	tg128	61.99	63.41	1.02
phi3 3B Q4_0	Arc B580	pp512	1259.04	1259.22	1.0
phi3 3B Q4_0	Arc B580	tg128	80.08	79.38	0.99
qwen2 1.5B Q4_0	Arc A770	pp512	3246.14	3268.02	1.01
qwen2 1.5B Q4_0	Arc A770	tg128	44.78	44.77	1.0
llama 7B Q4_0	Arc A770	pp512	790.7	789.74	1.0
llama 7B Q4_0	Arc A770	tg128	33.87	33.79	1.0
phi3 3B Q4_0	Arc A770	pp512	1257.26	1257.94	1.0
phi3 3B Q4_0	Arc A770	tg128	39.4	39.53	1.0
qwen2 1.5B Q4_0	Arc V140	pp512	886.79	884.95	1.0
qwen2 1.5B Q4_0	Arc V140	tg128	47.65	47.91	1.01
llama 7B Q4_0	Arc V140	pp512	207.99	208.04	1.0
llama 7B Q4_0	Arc V140	tg128	20.91	20.87	1.0
phi3 3B Q4_0	Arc V140	pp512	334.03	333.69	1.0
phi3 3B Q4_0	Arc V140	tg128	31.26	31.11	1.0

Speed up is calculated by dividing b5352/b5376 (this pr / master)

I'm re-runing using FP16 and some more models / quantizations, but so far looks good.

Alcpz · 2025-05-14T16:15:14Z

Results for FP16:

So far results are roughly equivalent, even better in some cases. We have a couple of cases that may require looking a bit more into them, like qwen2 1.5B Q4_K - Medium on battlemage during TG and gemma2 2B Q4_K - Medium during prompt processing.

Benchmark Results

model	device	test	t/s `bb1681f` (5376)	t/s `53113d0` (5352)	speedup
qwen2 1.5B Q4_0	Data Max 1100	pp512	6981.83	7050.21	1.01
qwen2 1.5B Q4_0	Data Max 1100	tg128	143.85	144.1	1
qwen2 1.5B Q4_K - Medium	Data Max 1100	pp512	10254.4	10172.7	0.99
qwen2 1.5B Q4_K - Medium	Data Max 1100	tg128	107.35	107.43	1
llama 7B Q4_0	Data Max 1100	pp512	1751.95	1751.15	1
llama 7B Q4_0	Data Max 1100	tg128	71.06	71.28	1
llama 7B Q4_K - Medium	Data Max 1100	pp512	3369.32	3378.41	1
llama 7B Q4_K - Medium	Data Max 1100	tg128	38.98	38.99	1
gemma2 2B Q4_K - Medium	Data Max 1100	pp512	7825.8	7798.84	1
gemma2 2B Q4_K - Medium	Data Max 1100	tg128	82.58	82.4	1
phi3 3B Q4_0	Data Max 1100	pp512	2818.08	2825.46	1
phi3 3B Q4_0	Data Max 1100	tg128	108.64	109.39	1.01
phi3 3B Q4_K - Medium	Data Max 1100	pp512	4505.99	4626.3	1.03
phi3 3B Q4_K - Medium	Data Max 1100	tg128	65.41	65.43	1
llama 34B Q6_K	Data Max 1100	pp512	1755.48	1763.75	1
llama 34B Q6_K	Data Max 1100	tg128	23.53	23.3	0.99
llama 8B Q4_K - Medium	Data Max 1100	pp512	3383.7	3338.83	0.99
llama 8B Q4_K - Medium	Data Max 1100	tg128	36.26	35.93	0.99
qwen2 1.5B Q4_0	Arc B580	pp512	5959.04	6052.54	1.02
qwen2 1.5B Q4_0	Arc B580	tg128	138.28	138.52	1
qwen2 1.5B Q4_K - Medium	Arc B580	pp512	7564.61	7502.2	0.99
qwen2 1.5B Q4_K - Medium	Arc B580	tg128	106.54	100.88	0.95
llama 7B Q4_0	Arc B580	pp512	1624.43	1624.43	1
llama 7B Q4_0	Arc B580	tg128	65.43	65.09	0.99
llama 7B Q4_K - Medium	Arc B580	pp512	2206.23	2201.93	1
llama 7B Q4_K - Medium	Arc B580	tg128	38.3	38.2	1
gemma2 2B Q4_K - Medium	Arc B580	pp512	5752.78	5722.25	0.99
gemma2 2B Q4_K - Medium	Arc B580	tg128	75.26	72.35	0.96
phi3 3B Q4_0	Arc B580	pp512	2410.01	2408.05	1
phi3 3B Q4_0	Arc B580	tg128	100.07	95.97	0.96
phi3 3B Q4_K - Medium	Arc B580	pp512	3167.63	3165.56	1
phi3 3B Q4_K - Medium	Arc B580	tg128	55.43	54.65	0.99
llama 34B Q6_K	Arc B580	pp512	1486.23	1483.83	1
llama 34B Q6_K	Arc B580	tg128	20.43	20.27	0.99
llama 8B Q4_K - Medium	Arc B580	pp512	2075.02	2067.28	1
llama 8B Q4_K - Medium	Arc B580	tg128	34.68	34.06	0.98
qwen2 1.5B Q4_0	Arc A770	pp512	4082.11	4071.16	1
qwen2 1.5B Q4_0	Arc A770	tg128	44.44	44.5	1
qwen2 1.5B Q4_K - Medium	Arc A770	pp512	4460.65	4458.04	1
qwen2 1.5B Q4_K - Medium	Arc A770	tg128	43.85	44.37	1.01
llama 7B Q4_0	Arc A770	pp512	1468.23	1468.99	1
llama 7B Q4_0	Arc A770	tg128	33.55	33.42	1
llama 7B Q4_K - Medium	Arc A770	pp512	1727.55	1727.02	1
llama 7B Q4_K - Medium	Arc A770	tg128	25.85	25.72	0.99
gemma2 2B Q4_K - Medium	Arc A770	pp512	3724.92	3623.25	0.97
gemma2 2B Q4_K - Medium	Arc A770	tg128	35.11	35.06	1
phi3 3B Q4_0	Arc A770	pp512	2163.5	2165.57	1
phi3 3B Q4_0	Arc A770	tg128	37.23	37.48	1.01
phi3 3B Q4_K - Medium	Arc A770	pp512	2502.81	2499.9	1
phi3 3B Q4_K - Medium	Arc A770	tg128	29.86	29.91	1
llama 34B Q6_K	Arc A770	pp512	1063.01	1029.61	0.97
llama 34B Q6_K	Arc A770	tg128	15.42	14.95	0.97
llama 8B Q4_K - Medium	Arc A770	pp512	1676.83	1620.34	0.97
llama 8B Q4_K - Medium	Arc A770	tg128	23.19	22.64	0.98
qwen2 1.5B Q4_0	Arc V140	pp512	1100.25	1086.64	0.99
qwen2 1.5B Q4_0	Arc V140	tg128	47.9	47.55	0.99
qwen2 1.5B Q4_K - Medium	Arc V140	pp512	1490	1512.93	1.02
qwen2 1.5B Q4_K - Medium	Arc V140	tg128	37.24	37.26	1
llama 7B Q4_0	Arc V140	pp512	323.82	323.29	1
llama 7B Q4_0	Arc V140	tg128	20.49	20.94	1.02
llama 7B Q4_K - Medium	Arc V140	pp512	548.41	549.69	1
llama 7B Q4_K - Medium	Arc V140	tg128	13.35	13.37	1
gemma2 2B Q4_K - Medium	Arc V140	pp512	647.98	594.44	0.92
gemma2 2B Q4_K - Medium	Arc V140	tg128	25.2	25.2	1
phi3 3B Q4_0	Arc V140	pp512	517.41	517.47	1
phi3 3B Q4_0	Arc V140	tg128	31.41	31.41	1
phi3 3B Q4_K - Medium	Arc V140	pp512	841.13	843.35	1
phi3 3B Q4_K - Medium	Arc V140	tg128	21.19	21.21	1
llama 34B Q6_K	Arc V140	pp512	368.28	367.2	1
llama 34B Q6_K	Arc V140	tg128	7.34	7.33	1
llama 8B Q4_K - Medium	Arc V140	pp512	507.21	504.61	0.99
llama 8B Q4_K - Medium	Arc V140	tg128	11.77	11.73	1

Edit:

llama-bench command:

export GGML_SYCL_PRIORITIZE_DMMV=0
export GGML_SYCL_DISABLE_OPT=0
${LLAMA_BUILD_DIR}/bin/llama-bench \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/gemma-2-2b-it-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_0.gguf \
  -m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
  -m /home/shared/llama.cpp-models/solar-10.7b-instruct-v1.0.Q6_K.gguf \
  -m /home/shared/llama.cpp-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p 512 \
  -n 128 \
  -pg 0,0 \
  -mmp 0 \
  -t 8 \
  -r 5 \
  -sm none \
  -ngl 99 \
  -o md

Rbiessy

That looks good to me. To my understanding this patch should not impact TG. There were some concerns about potential regressions in PP but it overall looks the same.

lslusarczyk · 2025-05-15T14:01:37Z

All tests have passed. @Alcpz , @Rbiessy , could you please merge?

Alcpz

LGTM as well. Thanks for your contribution.

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 16, 2025

Rbiessy reviewed Apr 17, 2025

View reviewed changes

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

github-actions bot added the testing Everything test related label Apr 23, 2025

lslusarczyk force-pushed the todnn branch 2 times, most recently from e711ab9 to ac04335 Compare May 7, 2025 08:06

github-actions bot added the documentation Improvements or additions to documentation label May 7, 2025

lslusarczyk changed the title ~~sycl: use DNN in the first part of ggml_sycl_mul_mat_batched_sycl~~ sycl: use oneDNN for matrices multiplication May 7, 2025

lslusarczyk marked this pull request as ready for review May 7, 2025 08:47

Alcpz reviewed May 7, 2025

View reviewed changes

lslusarczyk force-pushed the todnn branch from ac04335 to 53113d0 Compare May 12, 2025 13:42

This was referenced May 12, 2025

[BLAS] SYCL-Graph integration for native-command uxlfoundation/oneMath#669

Draft

sycl: Use oneMath on non-WIN32 #13503

Closed

Rbiessy reviewed May 13, 2025

View reviewed changes

lslusarczyk force-pushed the todnn branch from 53113d0 to 6ad1ed3 Compare May 14, 2025 15:16

lslusarczyk added 2 commits May 14, 2025 15:53

sycl: use DNN for matrices multiplication

f8f1b95

format fix

6ad1ed3

applied Romain's comments

d262ec6

Rbiessy approved these changes May 15, 2025

View reviewed changes

Alcpz approved these changes May 15, 2025

View reviewed changes

Rbiessy merged commit 9c404ed into ggml-org:master May 15, 2025
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl: use oneDNN for matrices multiplication #12972

sycl: use oneDNN for matrices multiplication #12972

lslusarczyk commented Apr 16, 2025 •

edited

Loading

NeoZhangJianyu commented Apr 21, 2025

lslusarczyk commented May 7, 2025

Alcpz left a comment

Alcpz May 7, 2025

lslusarczyk May 8, 2025

Alcpz commented May 12, 2025

lslusarczyk commented May 12, 2025

Rbiessy left a comment

Rbiessy May 13, 2025

lslusarczyk May 14, 2025

Rbiessy May 15, 2025

Alcpz commented May 14, 2025

Alcpz commented May 14, 2025 •

edited

Loading

Rbiessy left a comment

lslusarczyk commented May 15, 2025

Alcpz left a comment

sycl: use oneDNN for matrices multiplication #12972

sycl: use oneDNN for matrices multiplication #12972

Conversation

lslusarczyk commented Apr 16, 2025 • edited Loading

NeoZhangJianyu commented Apr 21, 2025

lslusarczyk commented May 7, 2025

Alcpz left a comment

Choose a reason for hiding this comment

Alcpz May 7, 2025

Choose a reason for hiding this comment

lslusarczyk May 8, 2025

Choose a reason for hiding this comment

Alcpz commented May 12, 2025

lslusarczyk commented May 12, 2025

Rbiessy left a comment

Choose a reason for hiding this comment

Rbiessy May 13, 2025

Choose a reason for hiding this comment

lslusarczyk May 14, 2025

Choose a reason for hiding this comment

Rbiessy May 15, 2025

Choose a reason for hiding this comment

Alcpz commented May 14, 2025

Alcpz commented May 14, 2025 • edited Loading

Rbiessy left a comment

Choose a reason for hiding this comment

lslusarczyk commented May 15, 2025

Alcpz left a comment

Choose a reason for hiding this comment

lslusarczyk commented Apr 16, 2025 •

edited

Loading

Alcpz commented May 14, 2025 •

edited

Loading