-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI #13164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm getting the same issue. did a little digging and gave Google Gemini the method that the assertion is being thrown in, and based upon what I understood, I believe as a temporary measure, using the --batch-size flag to reduce the batch size from the standard 2048 tokens processed to 365 tokens or less for example will prevent the assertion occurring. testing involved submitting 6644 tokens in a single go with various batch-size values ranging from 2048 to 512 which all failed with 365 and below working, as well as several small multi-turn conversations using small batch-sizes (16 to 64 tokens processed at any one time). EDIT: I'm using IQ3_XS quant running on an Arc A770 16GB. |
Thanks comrade! |
I can confirm. Dense Qwen3 models work fine, but the MoE crashes. I'm also using Vulkan, the crash still happens with no GPU offloading. (Edit: it doesn't happen on CPU only build) Changing batch size is a valid workaround. It seems I can go up to a batch size of 384, but 385 crashes. |
Seems Vulkan related. At least with Metal both 30B-A3B and 235B-A22B work without issues. |
same issue
Command line start :
Configuration :
If you need more info reply to my comment, ready to test some weird stuff if needed. |
Same thing happens on my 6900XT but only with vulkan, ROCm works fine but vulkan is faster. |
I've same problem, mostly at second prompt crash with --batch-size fixes using b5220 Vulkan with 9070 XT |
Same thing happens on 780M APU, using Vulkan. |
Isn't it similar to GLM-4-32b? which still only works if i use |
Same thing happens on 780M APU, using Vulkan, +1 |
Same with Vega56. Another thing I've noticed, which I don't know if it has anything to do with it, is that the Vulkan version works 10%-20% slower than the AVX2 version. This is something I've only seen with this model, and the other Qwen 225b MOE. |
My HW Specs: Tested model: Unsloth/Qwen3-30B-A3B-Q3_K_M.gguf I have the same problem with Vulkan version. (I haven't tried CPU version yet). Couple of notes from my own observation, hopefully it will be of any use for developers for fixing this issue:
Little anecdote from testing: cURL command for testing API as given by LM Studio is not ready for use in Windows command line neither Windows batch file (.bat), so while I still had Qwen3-30B-A3B loaded in the LM Studio, I asked the model in the UI itself to change the cURL command to make it work in Windows batch script. It did a perfect job, so then I was able to test the API simply using cURL thanks to which I found out it works that way. 😂 |
Running with It does happen to me as well. Latest build 5233 ROCm and CPU only backends don't do that, --flash-attn has no effect on the bug. |
Taking a look, the assert bound itself was changed from 2048 to 3072 in commit 751fcfc 10 months ago. Ironically with the help of Qwen3 30BA3B, I am not a developer of this project, this is my first post here. But I think this could be fixed by raising the I think this could be fixed by raising the limit of This would require 16384 to get the 2048 batch size. 65536 in shared memory which is the limit for my GPU (RX 6800), this is too much. A rewrite of how it is handled might be needed. |
@Marcuss2 I hope this gets fixed for this great model. I don't know about internals but it also seems slower than it needs to be due to such small batch. Thanks for trying to move this forward <3 |
build b5242 work around example (16GB VRAM) llamacpp-b5242>llama-server.exe -m F:\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -ngl 95 -c 16000 -ubatch-size 300 -batch-size 300 |
Trying to set batch size to 300 in LM Studio seems to work for me too. The server doesn't crash that way, but naturally it also makes the prompt ingestion much slower, kinda taking the speed advantage of MoE away. I hope we can do something to fix it. :/ |
batch size=365~380 works for me. But yes, too slow. |
I pray this will patch this bug :( The model is so good but it's impossible to it :( |
Naive question, wouldn't it be an easy fix by like using 16384, instead of 3072? I have put 0 effort into understanding the code, just curious what is it supposed to assert, why such numbers and why is it needed? |
From what I understood, no. It would no longer fit into shared memory of most devices. |
Where is this issue documented? I mean for GLM-4-32B model? Where is the report and what's the status of the issue, is it being looked at at all? I searched the issues tab, no success. I have the same issue with the GLM-4 model and it all started acting up for me when the fixes for GLM-4-32B model were merged. Running this model with batch size of 8 is unacceptable. |
I was just made aware of this. I think #13326 should fix it. Did anybody see a tensor size that wouldn't fit in 4096 elements? |
#13326 fixes the issue for me on macOS w/ Vulkan 👍 |
#13326 seems to have fixed this for me. I tested going up to ~8k context length and it worked correctly. |
#13326 fixed it 👍 |
@sidran, I'd say it's fixed; the fundamental issue causing it was the number of tokens processed as a batch during prompt processing and it's presently working for me with "--batch-size 2048" which is the default value which means I can readily delete that command prompt flag. |
Seems fixed on my end too. (though I advise to keep using small batch sizes for MoE models anyways, as the performance is better) |
What do you mean? My subjective impression is that even my puny AMD 6600 8Gb destroys context processing with default 2048 batch compared to smaller. And since we are about speed. I know this is not the place for it but Ill post it and later close the whole thing, I have to share something with you, as an official contributor. I was playing with tensor overriding for hours yesterday. I over-focused on speed and completely ignored other consequences. To cut long story short, this was my starting and ending point, after a lot of back-and-forth. Using Vulkan backend (32Gb RAM, AMD 6600 8Gb) start: 15/48 layers loaded to VRAM, 12288 context max, generation speed ~12 t/s (empty context) When batch (2048) is full, processing speed hovers around 50 t/s using --override-tensor ".ffn_(down|gate|up)_exps.weight=CPU" I didnt get speed (didnt lose either) but I got 50% VRAM free for much larger context. This tensor juggling could be automated, as I am sure you know better than me, I am just amazed by potential. I also have suspicion that something around Vulkan operations might be bottlenecking as well. |
Maybe But here are tha llama-bench results on my system with RX 6800 16GB (pcie gen3 x16) + RX 5700XT 8GB (pcie gen2 x4😬), by varying batch size:
|
My peak prompt processing throughput is also at batchsize 128 with a single 7900XTX. |
with --override-tensor ".ffn_(down|gate|up)_exps.weight=CPU" and -sm none (only main gpu), peak pp is for batch size of 256.
Same with partial offloading (--ngl 15)
|
But did you try 2048? Maybe the difference is due to quantization? (IQ4_XS vs Q4_K_XL) edit: And another thing, maybe batch 128 has raw speed but a line of them processes slower than 2048 chunks? Just like you just got top speed with 256. I am not sure how thats called among programmers but there must be some overhead from switching multiple 128 chunks compared to maybe more streamlined 2048 (probably fused 512 parts internally, physically). |
Yes, it's within margin of error to 512 (even with -p 2048)
This could actually be it, I can't test it right now |
@stduhpf If you seen my edit and have no comment, Ill close this report. |
I don't think so. Llama-bench processes the same number of tokens regardless of batch size (512 tokens by default), but I still get peak performance for 128 batch size even when processing 2048 tokens. |
Name and Version
Version (release): B5215 Windows Vulkan x64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
No response
Command line
Problem description & steps to reproduce
Edit: GGUF was downloaded from ggml's HF repository
(https://huggingface.co/ggml-org/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q4_K_M.gguf)
It loads and seems everything is ok but as soon as I request inference through Llama.cpp's web UI, I get this error
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: