Skip to content

[BLAS] SYCL-Graph integration for native-command #669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

EwanC
Copy link
Contributor

@EwanC EwanC commented May 7, 2025

In order to support applications calling the library with a sycl queue recording to a SYCL-Graph, check if the ext_codeplay_enqueue_native_command command-group is being recorded to a graph object. If so use the native stream recording APIs to add the blas calls as nodes in the graph.

In particular this fixes the llama.cpp unit test
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[2,1],per=[0,1,2,3],v=0) on CUDA with SYCL-Graph enabled. Previously this would throw an error:

$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        operator()
        Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154

Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
  in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

TODO

Before taking out of draft

  • Decide if we need [SYCL] Bump native enqueue extension version intel/llvm#18321 so to guard against breakages when building against 2025.1 oneAPI release. Update: Now has guard on feature macro version two.
  • Decide whether we need to extend the tests/unit_tests/blas/batch/gemm_batch_usm.cpp testing have a mechanism to stress the queue recording to a graph of the gemm_batch_usm API. Update: added test.
  • Make solution more generic to work with more operations

EwanC added a commit to EwanC/llama.cpp that referenced this pull request May 7, 2025
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I
see crashes from 3 operations:

1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`.
2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed?
3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task?

For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged.

For 2 & 3) we've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
EwanC added a commit to EwanC/llama.cpp that referenced this pull request May 7, 2025
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I
see crashes from 3 operations:

1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`.
2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed?
3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task?

For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged.

For 2 & 3) we've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
EwanC added a commit to EwanC/llama.cpp that referenced this pull request May 7, 2025
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I
see crashes from 3 operations:

1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`.
2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed?
3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task?

For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged.

For 2 & 3) we've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
@EwanC EwanC force-pushed the sycl-graph_native-command branch from 671d9bc to 3c06934 Compare May 9, 2025 09:40
Copy link
Contributor

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your questions:

  1. Yes we need to ensure this can build with the latest public oneAPI release
  2. Yes it would be best to have a test with SYCL-Graph if this is something we want to support. I'm thinking it could be a separate test file. I don't think we would need to test every operation, some sort of example using a single oneMath operation could be enough?

Let me know if you think you will still need this oneMath change. The llama PR using oneDNN looks promising and if it works well we could remove the dependency on oneMath.

#ifdef SYCL_EXT_ONEAPI_ENQUEUE_NATIVE_COMMAND
sc.begin_recording_if_graph(queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this added only for gemm_batch? I understand llama only uses oneMath for batch gemm (if oneDNN is found) but from oneMath perspective it would look odd. I think this could be supported for all operations by calling begin_recording_if_graph and end_recording_if_graph in

static inline void host_task_internal(H& cgh, sycl::queue queue, F f) {

Also note that there is a llama PR to start using oneDNN for batch gemm too which could be another way to avoid the issue? See ggml-org/llama.cpp#12972.

@EwanC EwanC force-pushed the sycl-graph_native-command branch 2 times, most recently from 5240f70 to a332a53 Compare May 15, 2025 10:54
In order to support applications calling the library
with a sycl queue recording to a SYCL-Graph, check if
the `ext_codeplay_enqueue_native_command` command-group is being
recorded to a graph object. If so use the native stream recording
APIs to add the blas calls as nodes in the graph.

In particular this fixes the llama.cpp unit test
`MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[2,1],per=[0,1,2,3],v=0)`
on CUDA with SYCL-Graph enabled. Previously this would throw an error:

```sh
$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        operator()
        Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154

Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
  in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```
@EwanC EwanC force-pushed the sycl-graph_native-command branch from a332a53 to 8d153ca Compare May 15, 2025 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants