-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Misc. bug: Compute pipeline creation failed when using Flash Attention on macOS/Vulkan #13450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm, nothing obviously wrong in the validation logs. Are you able to capture the metal shader that is failing? |
Sure, I ran test_backend_ops with only this test enabled:
|
Is there source for the translated metal shader in there? I don't know how to decode that. |
You're right, the trace seems to be corrupted. Maybe because test-backend-ops is crashing too soon? The trace was done through moltenvk like this: I'm currently trying to do it through xcode by attaching the debugger to the process, but that also seems to crash before it can capture anything useful. I can see that the pipeline creation error already happened in terminal, but there is nothing to capture in xcode. Nothing in the "FPS" tab and the option "Capture GPU Workload" is greyed out. |
Is there a way to enable debug output for MoltenVK? On our side it's successfully reporting |
I think I got something more useful by installing the debug version of MoltenVK. The trace is still corrupted, but the logs are more verbose and there is what looks like shader code in there. It's quite long, so I'm attaching it as a text file. |
Thanks, there are several errors related to use of gl_WorkGroupSize.x as a constant expression:
Maybe we need to declare another variable using the same spec id and use that for the shared memory variables? |
@soerenkampschroer please try this change: e1c331f |
Your commit seems to have fixed the issue, no more errors: ❯ ./llama-bench -ngl 99 -m ~/.cache/sanctum/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_K_M.gguf -fa 0,1
|
Just wanted to clear up that the performace gap is not as big with larger models. I haven't had the time to really test it, but the numbers for larger models are way better: ./llama-bench -ngl 30 -m ~/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -fa 0,1
./llama-bench -ngl 100 -m ~/models/bartowski/Qwen_Qwen3-14B-GGUF/Qwen_Qwen3-14B-Q4_K_M.gguf -fa 0,1
|
Name and Version
version: 5335 (d891942)
built with Apple clang version 17.0.0 (clang-1700.0.13.3) for x86_64-apple-darwin24.4.0
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
llama-server, llama-bench, llama-cli
Command line
VK_LOADER_DEBUG=all ./llama-bench -ngl 99 -m ~/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_K_M.gguf -fa 1
Problem description & steps to reproduce
Flash Attention is not working on macOS / Vulkan. Trying to use it (-fa 1) will result in the following error:
This is running on:
@jeffbolznv I've attached the full logs of a llama-bench run with validation layers enabled below.
First Bad Commit
dc1d2ad
#13324
Relevant log output
The text was updated successfully, but these errors were encountered: