-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Eval bug: GGML_ASSERT(q_to_vec_dot && "fattn: unsupported K-type") failed with Vulkan #12815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The crash comes from the CPU backend, apparently the combination of quantized k-cache and Flash Attention is unsupported. It fell back to the CPU cause Flash Attention on Vulkan is only supported on Nvidia GPUs with a beta driver. Without Flash Attention, only k-cache quantization is supported, so maybe give it a try without |
Hm, that's unexpected. The CPU supports FA with quantized types - it is the reference implementation. I am not able to reproduce this assert on my Mac after forcing
|
@0cc4m oh I didn't know that. I thought that latest changes to FA in Vulkan backend made it work for AMD as well. I've just noticed that CPU usage also raises when using FA, so yeah, looks like it uses CPU fallback indeed. Using just @ggerganov here's the log before crash. I run the server with |
Ok thanks, I can reproduce the bug. |
@deiteris Could you test if branch #12825 works? Btw, thanks for testing and reporting your usage of |
@ggerganov looking good so far! I've checked on a few large projects and don't see crashes anymore with the same parameters. Thanks! Sure, you're welcome :) I'll see if there's anything else, but this is the only major issue I've seen for a while. |
Name and Version
version: 5061 (916c83b)
built with MSVC 19.38.33134.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
Ryzen 7 5800H + AMD Radeon RX 6600M 8GB
Models
Snowflake Arctic Embed L v2.0 Q8_0 (https://huggingface.co/Casual-Autopsy/snowflake-arctic-embed-l-v2.0-gguf/tree/main)
BGE Reranker v2 M3 Q8_0 (https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/tree/main)
Problem description & steps to reproduce
When running an embedding model, llama.cpp server randomly crashes with
GGML_ASSERT(q_to_vec_dot && "fattn: unsupported K-type") failed
. Sometimes after first task, sometimes after several tasks.First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: