-
Notifications
You must be signed in to change notification settings - Fork 11.8k
sycl: use oneDNN for matrices multiplication #12972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
e711ab9
to
ac04335
Compare
sure
I've updated description of this PR to explain this. There are no new limitations added. New assertions just document already existing assumptions or document dependencies in the code. Such an assertion is a self-verifying documentation of the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll get some performance numbers and post them in here
@@ -741,6 +742,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512 | |||
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG | | |||
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase | | |||
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because graph performance isn't yet better than non-graph performance. | | |||
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now for developing purposes but I would reconsider adding this to the backend as an option to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. The simpler the better. Once oneDNN implementation is tuned I would opt for removing completely MKL path together with these options. But we will decide this in the future.
@lslusarczyk Mind rebasing on top of master? I'm seeing very bad performance in this PR, but it seems to be related to not having #13343 in this branch. A previous PR disabled non-contiguous src1 mul_mats due to some bugs with translates to very poor performance in Prompt Processing. Numbers are not reliable until you merge it. (It's caused by this llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp Lines 3935 to 3937 in ac04335
|
Yes. I've worked last days on resolving conflicts and fixing test cases which started to fail after rebasing into recent changes. All is rebased and works for me locally now. I will test performance on my own again and if all looks good I will ask you here for testing perf again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall!
@@ -731,6 +731,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512 | |||
| GGML_SYCL_DEVICE_ARCH | Optional (except for AMD) | Set the SYCL device architecture, optional except for AMD. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. | | |||
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. | | |||
| GGML_SYCL_GRAPH | ON *(default)* \|OFF *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). | | |||
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for adding this flag? Currently oneDNN is enabled if CMake is able to find it. Given that we want to prioritize oneDNN and will probably remove the option to use oneMKL/oneMath I feel that we will remove this option soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until oneDNN isn't the only option, there should be way to control which one, MKL or DNN, is used. E.g for us to compare performance. Not having this option means anyone who do not want to use oneDNN with llama.cpp needs to uninstall oneDNN or do to some ugly hacks to make it not be detected.
Probably this option will be removed soon with oneMKL/oneMath - it is OK. But it may happen we will also have to use it a bit longer in case of problems with oneDNN. Let's have it in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the environment variable would be enough for users who want to test disabling oneDNN. It seems to me this just adds more noise to this PR and the one that is going to remove oneMKL/oneMath.
It's not a big concern so feel free to ignore if no one else has an opinion.
Here's the performance I got comparing to master using FP32:
Speed up is calculated by dividing b5352/b5376 (this pr / master) I'm re-runing using FP16 and some more models / quantizations, but so far looks good. |
Results for FP16: So far results are roughly equivalent, even better in some cases. We have a couple of cases that may require looking a bit more into them, like Benchmark Results
Edit: llama-bench command:export GGML_SYCL_PRIORITIZE_DMMV=0
export GGML_SYCL_DISABLE_OPT=0
${LLAMA_BUILD_DIR}/bin/llama-bench \
-m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf \
-m /home/shared/llama.cpp-models/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_0.gguf \
-m /home/shared/llama.cpp-models/emma-500-llama2-7b-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/gemma-2-2b-it-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_0.gguf \
-m /home/shared/llama.cpp-models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
-m /home/shared/llama.cpp-models/solar-10.7b-instruct-v1.0.Q6_K.gguf \
-m /home/shared/llama.cpp-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p 512 \
-n 128 \
-pg 0,0 \
-mmp 0 \
-t 8 \
-r 5 \
-sm none \
-ngl 99 \
-o md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me. To my understanding this patch should not impact TG. There were some concerns about potential regressions in PP but it overall looks the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well. Thanks for your contribution.
Summary:
This is the first change towards making SYCL graphs working by avoiding using calls which issue sycl::wait inside recording phase. MKL uses waits a lot while oneDNN does not. Also using oneDNN should be preferred over using MKL in AI - hence the change.
Runtime flag is meant to ease performance comparison between MKL and oneDNN and is a safeguard for the custemers in case someone will find that oneDNN is slower in particulra case.
Compile time flag makes configuration of SYCL with oneDNN cleaner.