-
Notifications
You must be signed in to change notification settings - Fork 11.8k
vulkan: scalar flash attention implementation #13324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Umm unfortunately it doesn't look like there's an improvement for my setup, even for the rtx 2070 gpu. Short version: With Flash Attention (
|
GPU | Prompt Eval Time (ms) | ms/token | Tokens/sec | Eval Time (ms) | ms/token | Tokens/sec |
---|---|---|---|---|---|---|
RX 480 | 7402 | 5.5 | 182 | 19191 | 48 | 21 |
RTX 2070 | 4953 | 3.7 | 272 | 11552 | 31 | 33 |
Without Flash Attention
GPU | Prompt Eval Time (ms) | ms/token | Tokens/sec | Eval Time (ms) | ms/token | Tokens/sec |
---|---|---|---|---|---|---|
RX 480 | 4892 | 3.64 | 275 | 17451 | 48 | 21 |
RTX 2070 | 4907 | 3.65 | 274 | 8579 | 29 | 34 |
Long version: (click to show)
Hardware Configuration
- CPU: Intel Xeon E5-2620 v3 (6 cores, 12 threads, Haswell)
- Memory: Quad-channel DDR4 @ 1866 MHz
- GPUs:
- CUDA0: NVIDIA RTX 2070 (CUDA backend)
- VULKAN0: AMD RX 480 8GB (Vulkan backend)
- VULKAN1: NVIDIA RTX 2070 (Vulkan backend)
Repository Status (Sanity Check)
git status
# On branch master
# Your branch is ahead of 'origin/master' by 1 commit.
git log -1
# commit 6c7443cbcfc34c9247166a3f9ed9cfe762441a43 (HEAD -> master)
# vulkan: scalar flash attention implementation
Test Prompt
- Length: 1984 tokens
- Source: Default sillytavern conversation prompt
Performance Results
1. Normal Setup
1.1 With Flash Attention (-fa
)
Command | GPU | Prompt Eval Time | Tokens | ms/token | Tokens/sec | Eval Time | Tokens | ms/token | Tokens/sec |
---|---|---|---|---|---|---|---|---|---|
./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host -fa :: |
RX 480 (VULKAN0) | 7402.49 ms | 1345 | 5.50 | 181.70 | 19191.17 ms | 400 | 47.98 | 20.84 |
./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host :: -fa -ctk q4_0 -ctv q4_0 |
RX 480 (VULKAN0) | 37511.76 ms | 1345 | 27.89 | 35.86 | 45465.21 ms | 391 | 116.28 | 8.60 |
./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev VULKAN1 -ngl 99 -c 8192 --host :: -fa |
RTX 2070 (VULKAN1) | 4952.98 ms | 1345 | 3.68 | 271.55 | 11552.00 ms | 377 | 30.64 | 32.64 |
./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev VULKAN1 -ngl 99 -c 8192 --host :: -fa -ctk q4_0 -ctv q4_0 |
RTX 2070 (VULKAN1) | 41505.56 ms | 1345 | 30.86 | 32.41 | 58276.78 ms | 400 | 145.69 | 6.86 |
1.2 Without Flash Attention
Command | GPU | Prompt Eval Time | Tokens | ms/token | Tokens/sec | Eval Time | Tokens | ms/token | Tokens/sec |
---|---|---|---|---|---|---|---|---|---|
./build/bin/llama-server -m /share/Qwen -dev Vulkan0 -ngl 99 -c 8192 --host :: |
RX 480 (VULKAN0) | 4891.92 ms | 1345 | 3.64 | 274.94 | 17451.06 ms | 362 | 48.21 | 20.74 |
./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev VULKAN1 -ngl 99 -c 8192 --host :: |
RTX 2070 (VULKAN1) | 4906.50 ms | 1345 | 3.65 | 274.13 | 8579.09 ms | 292 | 29.38 | 34.04 |
2. Experimental Setup (Patch for Issue #13164)
- Patch applied to increase matrix multiplication size limit from 3072 to 8192
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
- GGML_ASSERT(nei0 * nei1 <= 3072);
+ GGML_ASSERT(nei0 * nei1 <= 8192);
--- a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
- shared u16vec2 row_ids[3072];
+ shared u16vec2 row_ids[8192];
2.1 With Flash Attention (-fa
)
Command | GPU | Prompt Eval Time | Tokens | ms/token | Tokens/sec | Eval Time | Tokens | ms/token | Tokens/sec |
---|---|---|---|---|---|---|---|---|---|
./build/bin/llama-server -dev CUDA0,Vulkan0 -ngl 99 -c 8192 -m /share/Qwen3-30B-A3B-UD-Q3_K_XL.gguf -fa --batch-size 1200 --host :: |
RTX 2070 (CUDA0) + RX 480 (VULKAN0) | 47275.80 ms | 1306 | 36.20 | 27.63 | 19982.30 ms | 400 | 49.96 | 20.02 |
./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev VULKAN0 -ngl 99 -c 8192 --host :: -fa |
RX 480 (VULKAN0) | 7390.74 ms | 1345 | 5.49 | 181.98 | 18178.16 ms | 380 | 47.84 | 20.90 |
2.2 Without Flash Attention
Command | GPU | Prompt Eval Time | Tokens | ms/token | Tokens/sec | Eval Time | Tokens | ms/token | Tokens/sec |
---|---|---|---|---|---|---|---|---|---|
./build/bin/llama-server -dev CUDA0,Vulkan0 -ngl 99 -c 8192 -m /share/Qwen3-30B-A3B-UD-Q3_K_XL.gguf --batch-size 1200 --host :: |
RTX 2070 (CUDA0) + RX 480 (VULKAN0) | 46707.90 ms | 1345 | 34.73 | 28.80 | 17597.43 ms | 400 | 43.99 | 22.73 |
./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev VULKAN0 -ngl 99 -c 8192 --host :: |
RX 480 (VULKAN0) | 4915.57 ms | 1345 | 3.65 | 273.62 | 11767.99 ms | 244 | 48.23 | 20.73 |
On the 2070 system, does it report coopmat1 or coopmat2 support? If it's coopmat2 then FA is already accelerated.
I didn't add support for quantized KV yet (it's probably not a ton of work, just didn't think it was critical for the first version), so these tests will continue to fall back to CPU. |
Not sure how to confirm this. Coopmat is not mentioned in stdout when running llama-server. llama-cpp was built with
Ah I see, good idea |
llama-server should print something like this when using the vulkan backend:
What does yours say for matrix cores? |
I've just tested it on both the Radeon RX 7800 XT and the Radeon RX 5700 XT and the performance is pretty close to non FA. RX 7800 XT
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
RX 5700 XT
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
I'm not sure if there are specific FA tests in |
It looks like it's not enabled:
Looking a bit closer I think that's coming from the build config. I'm using a standard debian server installation. From what I understand the libvulkan version shipped with debian bookworm(1.3.239) is probably too old to support these "new" extensions.
In any case I see that it`s working on my desktop with arch linux and a 3080 ti:
Full debian server cmake -B build -DGGML_VULKAN=ON outputcmake -B build -DGGML_VULKAN=ON -- The C compiler identification is GNU 12.2.0 -- The CXX compiler identification is GNU 12.2.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.39.5") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF. -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- Including CPU backend -- Found OpenMP_C: -fopenmp (found version "4.5") -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5") -- x86 detected -- Adding CPU backend variant ggml-cpu: -march=native -- Found Vulkan: /usr/lib/x86_64-linux-gnu/libvulkan.so (found version "1.3.239") found components: glslc glslangValidator -- Vulkan found -- GL_KHR_cooperative_matrix not supported by glslc -- GL_NV_cooperative_matrix2 not supported by glslc -- GL_EXT_integer_dot_product not supported by glslc -- GL_EXT_bfloat16 not supported by glslc -- Including Vulkan backend -- Found CURL: /usr/lib/x86_64-linux-gnu/libcurl.so (found version "7.88.1") -- Configuring done -- Generating done -- Build files have been written to: /home/joe/ai/temp/llama.cpp/buildFull debian server llama-server output./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev VULKAN1 -ngl 99 -c 8192 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 480 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 1 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none build: 5288 (6c7443cb) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu system info: n_threads = 6, n_threads_batch = 6, total_threads = 12system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | main: binding port with default address family |
OK, so your RTX 2070 should have been using flash attention. Were all your test results with this change? Did you try using flash attention without this change? It would have fallen back to the CPU. |
Yes all tests were made with this change, more specifically with this commit cherry picked on top of the latest commit at the time of writing (9070365)
Just did the same test again without this commit and indeed it falls back to the cpu and is very slow: ./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host :: -fa
359a92f691ff74f7fc89cf12cac744bb18ab98df (this pr commit) ./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host :: -fa
|
@nalf3in I think there's something wrong with your setup as your numbers already don't make sense for the non fa case. First of all even if you have no DP4A and no matrix cores the 2070 should easily beat the 480 in prompt processing. Your inference speeds are also really low for a Q4 4B model. Can you run a regular |
I didn't use llama-bench previously because it doesn't support the -dev option which allows to specify which gpu I want to use. From what I understand it isn't possible using llama-bench command line arguments but I was able to do it anyway using bwrap (see below for full command line args) Short version of the results:
It seems that the rtx 2070 is still around 1.6x faster than then rx 480 using vulkan. Cuda is much faster for prompt ingestion (vulkan doesn't use KHR_coopmat there though) Cuda without flash attention for reference
Long versionCommit 141a908~/ai/temp/llama.cpp$ bwrap
build: 141a908 (5298) ~/ai/temp/llama.cpp$ bwrap
build: 141a908 (5298) ~/ai/temp/llama.cpp$ bwrap
build: 141a908 (5298) ~/ai/temp/llama.cpp$ bwrap
build: 141a908 (5298) Commit 005756a ~/ai/temp/llama.cpp$ bwrap
build: bd417ee8 (5299) ~/ai/temp/llama.cpp$ bwrap
build: bd417ee8 (5299) ~/ai/temp/llama.cpp$ bwrap
build: bd417ee8 (5299) ~/ai/temp/llama.cpp$ bwrap
build: bd417ee8 (5299) |
Anyways I went and tried this out on my RX 470. With FA turned on prompt processing becomes slower and inference becomes faster when I make it generate a lot of text. I guess there's a tradeoff here and this should be quite useful for those new thinking models.
The FA tests are passing on my RX 470 but they're failing on my W8100 when |
Oh you can just use the
Yeah those numbers make more sense now 😉. If you get coopmat2 working it should be much closer to CUDA but I think it's still going to be a bit slower. |
You can also set the env var |
I hadn't realized this was happening, it's the leftover ACC_TYPE in the shader that's barely used. I've changed the logic to always select the f32 variant for scalar. |
Set to draft, I have a bit more perf tuning I want to try. |
This is very exciting. I'll test it across my devices within the next days. |
I have some 7900XTX results to share with the radv vulkan driver with the Qwen3 32B Q4_K_S model. I am seeing very nice speedups for token generation at longer context depths, but unfortunately prompt processing drops off a cliff: ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
|
That is expected for you since the new flash attention shader doesn't use coopmat1 for matrix core acceleration, which your GPU supports and uses for non-FA prompt processing, that's why it's slower. I'll look into a coopmat1 version that would fix this at some point, if nobody else gets to it first. |
ROCm number for reference in case they are useful: ggml_cuda_init: found 1 ROCm devices:
|
That is honestly great to hear, thank you! I think once KV Cache quantization is in place, and prompt processing performance has been resolved, there really isn't much reason left to use ROCm over Vulkan. Vulkan has shown great performance at token generation compared to ROCm. |
Thanks it's passing now! |
Here are my new results on my 7900XTX. Prompt processing has gotten a really nice performance boost, especially at higher depths, so that's really nice! Unfortunately, token generation has seen a pretty noticeable regression at low to medium depths. For example, going from 34.27 tokens/sec at 1024 depth to 32.24 tokens/sec, which is a >6% performance regression. Interestingly, at high depths the regression disappears and turns into a slight performance lead going from 23.11 tokens/sec to 24.45 tokens/sec at 16k depth. With prompt caching, token generation is arguably more important than prompt processing, so really hopeful the cause of the regression at low to medium depths can be identified and fixed. Full result here: ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
|
It was probably the tile size change, going fro 4 to 8 rows when tg only needs one row. I've pushed a change that only uses 1 row when that's all that's needed. I verified this fixed a small regression when running llama-2-7b.Q4_0.gguf. When I ran qwen3 I hit #13164, were you working around that in your tests? I also fixed an issue where the last round of optimizations had reintroduced usage of Float16. |
Perfect! Recompiling now to retest on my 7900XTX as well. Will let you know as soon as I have the results.
I am testing with the dense 32B model, which is unaffected by that bug. It only impacts the 30B MOE model. |
Benchmark only just started running and will take a while to fully complete, but the initial tests show worse performance for both prompt processing as well as token generation compared to the previous build. So the token generation regression seems to have gotten worse, and the prompt processing improvements have been reduced. Will edit this post with the full result as soon as it's done, but just wanted to share some initial results: ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
|
Hmm, I don't know what's going on there. I tried Qwen3-14B-Q4_K_M.gguf on my RTX 4070 using the KHR_coopmat path, and I see an improvement vs yesterday with both pp512 @ d1024 and tg128 @ d1024 |
On my Ryzen 5 3400G iGPU, most tests get a little bit slower, a few improve slightly; the difference seems to be less than the variation between consecutive runs. ggml_vulkan: 0 = AMD Radeon Vega 11 Graphics (RADV RAVEN) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none Qwen3-30B-A3B-UD-Q4_K_XL at e660942 :
llama-3.2-1b-instruct-q8_0 at 20a6246 :
|
Updated my table above with the full results. Observations:
The differences of my setup versus yours:
|
I've just retested the latest changes on my cards: Radeon RX 5700 XT
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
Radeon RX 7800 XT
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
Token generation has improved on almost any scenario, however there seems to be a constant performance penalty in prompt processing on amdvlk and vulkan_pro, this hardly affects linux but since those drivers are based on the Windows ones the regression may be present there. |
Do you just mean that the scalar FA is slower than the KHR_coopmat alternative? This is expected. I'm going to have very limited availability over the next week, and I don't think anybody has reported a serious performance problem. So I suggest we merge this as-is (after any review fixes) and further tuning can happen later. |
What I mean is that I compared my first results with today's and performance on the amdvlk and vulkan_pro on Linux and the performance on both got worse in prompt processing. I'm just pointing this out since these drivers behave almost identically to the AMD driver on Windows (linux uses radv by default so it's not an issue there). This is the result from 005756a: ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
And this is from 20a6246: ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
The performance hit seems to have initiated on a6c940b and got worse with further commits. Overall it's amazing that we finally have a flash attention implementation on vulkan for non coopmat2. I'm just commenting about this so there's initial data for some future tuning. |
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Performance is good in my tests. |
It would be cool if we could figure out the performance regression on AMD non-mesa drivers, but I wouldn't hold up the PR with it. They constantly cause issues. At least performance with them seems pretty good at this point, apart from this problem. |
Your AMD tests are also showing a token generation performance drop after enabling FA. I see the same with the latest build, but that wasn't the case in the earlier version, see here:
Latest build with regressions included:
|
Yes, but my initial concern is just that there are no issues with the output and performance is roughly in line with expected numbers. Performance tuning can happen in follow-up PRs. |
Fair enough. It does seem better not to have any further holdup, since even with this regression this is a massive improvement and finally allows me to drop ROCm completely as KV cache quantization was the only thing preventing me from moving over to Vulkan. Just hoping we can get back to the same token generation performance as the earlier version of this PR in a followup PR :) Great job on this massive step forward for the Vulkan backend! |
Seems to be working very well. Thanks! |
Can't wait to test this out once merged! |
Compared to my last run pp2000 is around 30% faster with these new changes, while everything else is pretty close to before. As the others mentioned optimizations will come eventually and I think this is good enough to merge.
|
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
|
Thanks for fixing #12526! Tests with Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf and Qwen2.5-Coder-14B-Instruct-Q4_K_L.gguf. ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
|
Just wanted to let you know that on macOS (Intel CPU/AMD GPU) this doesn't seem to work. I tried using flash attention and I'm getting the following error:
I'm using MoltenVK v1.3.0 and Vulkan SDK v1.4.313 with an RX 6800, macOS 15.4.1.
I can provide more logs here or open a separate issue if you want me to. |
Yeah, please file a new issue to track this. Do the validation layers report anything? |
With so many issues like #13217 being due to lack of FA support, I ported the FA shader to use scalar math. Perf is pretty good for cases where there are few rows (e.g. during token gen), but it will still be slower than -fa 0 for cases where -fa 0 uses KHR_coopmat.
I'd appreciate some help testing (including perf testing) on non-NVIDIA GPUs. And if anybody knows a good placeholder value for shader_core_count for Intel or how to query it, that would be good too.