-
Notifications
You must be signed in to change notification settings - Fork 169
[BLAS] SYCL-Graph integration for native-command #669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I see crashes from 3 operations: 1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`. 2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed? 3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task? For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged. For 2 & 3) we've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I see crashes from 3 operations: 1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`. 2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed? 3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task? For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged. For 2 & 3) we've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I see crashes from 3 operations: 1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`. 2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed? 3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task? For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged. For 2 & 3) we've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
671d9bc
to
3c06934
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your questions:
- Yes we need to ensure this can build with the latest public oneAPI release
- Yes it would be best to have a test with SYCL-Graph if this is something we want to support. I'm thinking it could be a separate test file. I don't think we would need to test every operation, some sort of example using a single oneMath operation could be enough?
Let me know if you think you will still need this oneMath change. The llama PR using oneDNN looks promising and if it works well we could remove the dependency on oneMath.
#ifdef SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND | ||
sc.begin_recording_if_graph(queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this added only for gemm_batch
? I understand llama only uses oneMath for batch gemm (if oneDNN is found) but from oneMath perspective it would look odd. I think this could be supported for all operations by calling begin_recording_if_graph
and end_recording_if_graph
in
static inline void host_task_internal(H& cgh, sycl::queue queue, F f) { |
Also note that there is a llama PR to start using oneDNN for batch gemm too which could be another way to avoid the issue? See ggml-org/llama.cpp#12972.
5240f70
to
a332a53
Compare
In order to support applications calling the library with a sycl queue recording to a SYCL-Graph, check if the `ext_codeplay_enqueue_native_command` command-group is being recorded to a graph object. If so use the native stream recording APIs to add the blas calls as nodes in the graph. In particular this fixes the llama.cpp unit test `MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[2,1],per=[0,1,2,3],v=0)` on CUDA with SYCL-Graph enabled. Previously this would throw an error: ```sh $ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2 UR CUDA ERROR: Value: 700 Name: CUDA_ERROR_ILLEGAL_ADDRESS Description: an illegal memory access was encountered Function: operator() Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154 Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN) Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator() SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code! in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598 $HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. ```
a332a53
to
8d153ca
Compare
In order to support applications calling the library with a sycl queue recording to a SYCL-Graph, check if the
ext_codeplay_enqueue_native_command
command-group is being recorded to a graph object. If so use the native stream recording APIs to add the blas calls as nodes in the graph.In particular this fixes the llama.cpp unit test
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[2,1],per=[0,1,2,3],v=0)
on CUDA with SYCL-Graph enabled. Previously this would throw an error:TODO
Before taking out of draft
Decide if we need [SYCL] Bump native enqueue extension version intel/llvm#18321 so to guard against breakages when building against 2025.1 oneAPI release.Update: Now has guard on feature macro version two.Decide whether we need to extend theUpdate: added test.tests/unit_tests/blas/batch/gemm_batch_usm.cpp
testing have a mechanism to stress the queue recording to a graph of thegemm_batch_usm
API.