Skip to content

Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI #13164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sidran opened this issue Apr 29, 2025 · 37 comments

Comments

@sidran
Copy link

sidran commented Apr 29, 2025

Name and Version

Version (release): B5215 Windows Vulkan x64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

No response

Command line

echo Running Qwen3 30B MoE server 12 layers 12288 context

llama-server.exe ^
--model "D:\LLMs\Qwen3-30B-A3B-Q4_K_M.gguf" ^
--gpu-layers 12 ^
--ctx-size 12288 ^
--samplers top_k;dry;min_p;temperature;typ_p;xtc ^
--top-k 40 ^
--dry-multiplier 0.5 ^
--min-p 0.00 ^
--temp 0.6 ^
--top-p 0.95 ^
--repeat-penalty 1.1

Problem description & steps to reproduce

Edit: GGUF was downloaded from ggml's HF repository
(https://huggingface.co/ggml-org/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q4_K_M.gguf)

It loads and seems everything is ok but as soon as I request inference through Llama.cpp's web UI, I get this error

Image

First Bad Commit

No response

Relevant log output

@Thellton
Copy link

Thellton commented Apr 29, 2025

I'm getting the same issue. did a little digging and gave Google Gemini the method that the assertion is being thrown in, and based upon what I understood, I believe as a temporary measure, using the --batch-size flag to reduce the batch size from the standard 2048 tokens processed to 365 tokens or less for example will prevent the assertion occurring.

testing involved submitting 6644 tokens in a single go with various batch-size values ranging from 2048 to 512 which all failed with 365 and below working, as well as several small multi-turn conversations using small batch-sizes (16 to 64 tokens processed at any one time).

EDIT: I'm using IQ3_XS quant running on an Arc A770 16GB.

@sidran
Copy link
Author

sidran commented Apr 29, 2025

I'm getting the same issue. did a little digging and gave Google Gemini the method that the assertion is being thrown in, and based upon what I understood, I believe as a temporary measure, using the --batch-size flag to reduce the batch size from the standard 2048 tokens processed to 365 tokens or less for example will prevent the assertion occurring.

testing involved submitting 6644 tokens in a single go with various batch-size values ranging from 2048 to 512 which all failed with 365 and below working, as well as several small multi-turn conversations using small batch-sizes (16 to 64 tokens processed at any one time).

EDIT: I'm using IQ3_XS quant running on an Arc A770 16GB.

Thanks comrade!
I confirm that this actually works. Model seems awesome. I have 32Gb RAM and AMD 6600 8Gb, offloading 14/48 layers to VRAM, with 12288 context length and get ~10.7 t/s on this 30B model.
Life is good 😸

@stduhpf
Copy link
Contributor

stduhpf commented Apr 29, 2025

I can confirm. Dense Qwen3 models work fine, but the MoE crashes. I'm also using Vulkan, the crash still happens with no GPU offloading. (Edit: it doesn't happen on CPU only build)

Changing batch size is a valid workaround. It seems I can go up to a batch size of 384, but 385 crashes.

@ggerganov
Copy link
Member

Seems Vulkan related. At least with Metal both 30B-A3B and 235B-A22B work without issues.

@TheNexter
Copy link

TheNexter commented Apr 29, 2025

same issue

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 414
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 414, n_tokens = 414, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 414, n_tokens = 414
/app/ggml/src/ggml-vulkan/ggml-vulkan.cpp:5052: GGML_ASSERT(nei0 * nei1 <= 3072) failed

Command line start :

llama-server -m /Models/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf --port 9999 --ctx-size 16384 --gpu-layers 12 --temp 0.6 --top-k 20 --top-p 0.95 --flash-attn

Configuration :

Ubuntu 24.04 Host
Docker container for the running part (llama-swap i remplace the binary server by : b5218)
GPU : Amd 6600XT 8Go
RAM : 46Go

If you need more info reply to my comment, ready to test some weird stuff if needed.

@Ayandaftary
Copy link

Same thing happens on my 6900XT but only with vulkan, ROCm works fine but vulkan is faster.

@stan4cb
Copy link

stan4cb commented Apr 29, 2025

I've same problem, mostly at second prompt crash with
GGML_ASSERT(nei0 * nei1 <= 3072) failed
Qwen3-30B-A3B-Q4_K_M.gguf or Qwen3-30B-A3B-UD-Q4_K_XL.gguf doesn't matter
llama-server -m Qwen3-30B-A3B-UD-Q4_K_XL.gguf or -ngl 100 doesn't matter

--batch-size fixes

using b5220 Vulkan with 9070 XT

@DD3Boh
Copy link

DD3Boh commented Apr 29, 2025

Same thing happens on 780M APU, using Vulkan.

@mlsterpr0
Copy link

mlsterpr0 commented Apr 29, 2025

Isn't it similar to GLM-4-32b? which still only works if i use -b 8 -ub 8? Otherwise gibberish output... (vulkan version not cpu only)

@ZUIcat
Copy link

ZUIcat commented Apr 30, 2025

Same thing happens on 780M APU, using Vulkan, +1

@Dominguero
Copy link

Same with Vega56.

Another thing I've noticed, which I don't know if it has anything to do with it, is that the Vulkan version works 10%-20% slower than the AVX2 version.

This is something I've only seen with this model, and the other Qwen 225b MOE.

@mrdevolver
Copy link

mrdevolver commented Apr 30, 2025

My HW Specs:
OS: Windows 10 64bit
CPU: AMD Ryzen 7 2700X (8c / 16t)
RAM: 16 GB
GPU: Radeon RX Vega 56
VRAM: 8 GB

Tested model: Unsloth/Qwen3-30B-A3B-Q3_K_M.gguf

I have the same problem with Vulkan version. (I haven't tried CPU version yet).

Couple of notes from my own observation, hopefully it will be of any use for developers for fixing this issue:

  • LM Studio, text-generation-webui, KoboldCpp (all based on llama.cpp). Model works in the UI of the LM Studio itself (haven't tested UI usage in the rest), but crashes in all of these apps as soon as I try to call it through the llama.cpp OpenAI API server they provide (calls are done by third party UIs).
  • In LM Studio UI I can use batch size even 4096 and it still works.
  • Perhaps the most interesting: LM Studio gives quick cURL command for testing the API server. Interestingly, when I called the same API server through that test cURL command, it did NOT crash and gave response as expected.

Little anecdote from testing: cURL command for testing API as given by LM Studio is not ready for use in Windows command line neither Windows batch file (.bat), so while I still had Qwen3-30B-A3B loaded in the LM Studio, I asked the model in the UI itself to change the cURL command to make it work in Windows batch script. It did a perfect job, so then I was able to test the API simply using cURL thanks to which I found out it works that way. 😂

@Marcuss2
Copy link

Marcuss2 commented Apr 30, 2025

Running with llama-server --model Qwen3-30B-A3B-UD-Q4_K_XL.gguf --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --ctx-size 32772 --flash-attn

It does happen to me as well. Latest build 5233
GPU: RX 6800

ROCm and CPU only backends don't do that, --flash-attn has no effect on the bug.

@Marcuss2
Copy link

Marcuss2 commented May 1, 2025

Taking a look, the assert bound itself was changed from 2048 to 3072 in commit 751fcfc 10 months ago. Ironically with the help of Qwen3 30BA3B, nei0 is the batch size and nei1 is the number of used experts. Qwen3moe broke grounds by having higher number of used experts.

I am not a developer of this project, this is my first post here. But I think this could be fixed by raising the row_ids in ggml/src/vulkan-shaders/mul_mm.comp and the assert.

I think this could be fixed by raising the limit of row_ids, trough this will put pressure on device shared memory.

This would require 16384 to get the 2048 batch size. 65536 in shared memory which is the limit for my GPU (RX 6800), this is too much. A rewrite of how it is handled might be needed.

@sidran
Copy link
Author

sidran commented May 1, 2025

@Marcuss2 I hope this gets fixed for this great model. I don't know about internals but it also seems slower than it needs to be due to such small batch.

Thanks for trying to move this forward <3

@bennmann
Copy link

bennmann commented May 1, 2025

build b5242 work around example (16GB VRAM)

llamacpp-b5242>llama-server.exe -m F:\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -ngl 95 -c 16000 -ubatch-size 300 -batch-size 300

@mrdevolver
Copy link

build b5242 work around example (16GB VRAM)

llamacpp-b5242>llama-server.exe -m F:\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -ngl 95 -c 16000 -ubatch-size 300 -batch-size 300

Trying to set batch size to 300 in LM Studio seems to work for me too. The server doesn't crash that way, but naturally it also makes the prompt ingestion much slower, kinda taking the speed advantage of MoE away. I hope we can do something to fix it. :/

@zghnwsq
Copy link

zghnwsq commented May 2, 2025

batch size=365~380 works for me.
llama-b5257-bin-ubuntu-vulkan-x64
7940H , mem 24G, vram 8G,
model: Qwen3-30B-A3B-Q4_K_M
gpu layers: 23

But yes, too slow.

@TheNexter
Copy link

I pray this will patch this bug :(

The model is so good but it's impossible to it :(

@nasrally
Copy link

nasrally commented May 3, 2025

Naive question, wouldn't it be an easy fix by like using 16384, instead of 3072? I have put 0 effort into understanding the code, just curious what is it supposed to assert, why such numbers and why is it needed?

@Marcuss2
Copy link

Marcuss2 commented May 3, 2025

Naive question, wouldn't it be an easy fix by like using 16384, instead of 3072? I have put 0 effort into understanding the code, just curious what is it supposed to assert, why such numbers and why is it needed?

From what I understood, no. It would no longer fit into shared memory of most devices.

@mrdevolver
Copy link

Isn't it similar to GLM-4-32b? which still only works if i use -b 8 -ub 8? Otherwise gibberish output... (vulkan version not cpu only)

Where is this issue documented? I mean for GLM-4-32B model?

Where is the report and what's the status of the issue, is it being looked at at all?

I searched the issues tab, no success.

I have the same issue with the GLM-4 model and it all started acting up for me when the fixes for GLM-4-32B model were merged.
Ironically before those fixes, I was able to run GLM-4-32B quants created using the latest llama.cpp with already fixed converter. But once the LM Studio received the updated runtime which had those fixes, the model started acting up.

Running this model with batch size of 8 is unacceptable.

@jeffbolznv
Copy link
Collaborator

I was just made aware of this. I think #13326 should fix it. Did anybody see a tensor size that wouldn't fit in 4096 elements?

@soerenkampschroer
Copy link

#13326 fixes the issue for me on macOS w/ Vulkan 👍

@sidran
Copy link
Author

sidran commented May 10, 2025

#13326 seems to have fixed this for me. I tested going up to ~8k context length and it worked correctly.
Does anyone have anything to add before I close this?

@stan4cb
Copy link

stan4cb commented May 10, 2025

#13326 fixed it 👍

@Thellton
Copy link

@sidran, I'd say it's fixed; the fundamental issue causing it was the number of tokens processed as a batch during prompt processing and it's presently working for me with "--batch-size 2048" which is the default value which means I can readily delete that command prompt flag.

@stduhpf
Copy link
Contributor

stduhpf commented May 10, 2025

Seems fixed on my end too. (though I advise to keep using small batch sizes for MoE models anyways, as the performance is better)

@sidran
Copy link
Author

sidran commented May 10, 2025

Seems fixed on my end too. (though I advise to keep using small batch sizes for MoE models anyways, as the performance is better)

What do you mean? My subjective impression is that even my puny AMD 6600 8Gb destroys context processing with default 2048 batch compared to smaller.

And since we are about speed. I know this is not the place for it but Ill post it and later close the whole thing, I have to share something with you, as an official contributor.

I was playing with tensor overriding for hours yesterday. I over-focused on speed and completely ignored other consequences. To cut long story short, this was my starting and ending point, after a lot of back-and-forth. Using Vulkan backend (32Gb RAM, AMD 6600 8Gb)

start: 15/48 layers loaded to VRAM, 12288 context max, generation speed ~12 t/s (empty context)
end: 48/48 layers loaded to VRAM, 30720 context max, generation speed ~12.5 t/s (empty context)

When batch (2048) is full, processing speed hovers around 50 t/s

using --override-tensor ".ffn_(down|gate|up)_exps.weight=CPU" I didnt get speed (didnt lose either) but I got 50% VRAM free for much larger context. This tensor juggling could be automated, as I am sure you know better than me, I am just amazed by potential.

I also have suspicion that something around Vulkan operations might be bottlenecking as well.

@stduhpf
Copy link
Contributor

stduhpf commented May 10, 2025

Maybe --override-tensor changes things, I can offload this model entirely to my GPUs so I didn't play around with this argument yet.

But here are tha llama-bench results on my system with RX 6800 16GB (pcie gen3 x16) + RX 5700XT 8GB (pcie gen2 x4😬), by varying batch size:

model size params backend ngl n_batch sm test t/s
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 16 row pp512 @ d512 85.83 ± 0.80
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 16 row tg128 @ d512 60.13 ± 0.61
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 32 row pp512 @ d512 138.50 ± 2.04
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 32 row tg128 @ d512 60.72 ± 0.07
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 64 row pp512 @ d512 148.52 ± 1.58
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 64 row tg128 @ d512 60.67 ± 0.05
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 128 row pp512 @ d512 168.17 ± 1.08
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 128 row tg128 @ d512 60.76 ± 0.06
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 256 row pp512 @ d512 128.38 ± 0.22
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 256 row tg128 @ d512 60.45 ± 0.27
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 512 row pp512 @ d512 81.01 ± 0.10
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 512 row tg128 @ d512 60.64 ± 0.14

@Mushoz
Copy link

Mushoz commented May 10, 2025

My peak prompt processing throughput is also at batchsize 128 with a single 7900XTX.

@sidran
Copy link
Author

sidran commented May 10, 2025

@stduhpf @Mushoz
hmm, our observations diverge on this. I just made a test using exactly the same context:

Image

But I also just noticed another thing. When I use batch 128 my VRAM usage is significantly lower:

Image

Since I dont know almost anything about internals, this remains just an observation.

@stduhpf
Copy link
Contributor

stduhpf commented May 10, 2025

with --override-tensor ".ffn_(down|gate|up)_exps.weight=CPU" and -sm none (only main gpu), peak pp is for batch size of 256.

model size params backend ngl n_batch sm ot test t/s
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 64 none .ffn_.*_exps.weight=CPU pp512 @ d512 35.72 ± 0.09
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 128 none .ffn_.*_exps.weight=CPU pp512 @ d512 62.93 ± 0.09
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 256 none .ffn_.*_exps.weight=CPU pp512 @ d512 83.94 ± 0.09
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 99 512 none .ffn_.*_exps.weight=CPU pp512 @ d512 77.46 ± 0.10

Same with partial offloading (--ngl 15)

model size params backend ngl n_batch sm test t/s
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 15 64 none pp512 @ d512 44.90 ± 0.12
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 15 128 none pp512 @ d512 75.11 ± 0.18
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 15 256 none pp512 @ d512 94.14 ± 0.20
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.25 GiB 30.53 B Vulkan,RPC 15 512 none pp512 @ d512 82.00 ± 0.24

@sidran
Copy link
Author

sidran commented May 10, 2025

But did you try 2048?

Maybe the difference is due to quantization? (IQ4_XS vs Q4_K_XL)

edit: And another thing, maybe batch 128 has raw speed but a line of them processes slower than 2048 chunks? Just like you just got top speed with 256. I am not sure how thats called among programmers but there must be some overhead from switching multiple 128 chunks compared to maybe more streamlined 2048 (probably fused 512 parts internally, physically).

@stduhpf
Copy link
Contributor

stduhpf commented May 10, 2025

But did you try 2048?

Yes, it's within margin of error to 512 (even with -p 2048)

Maybe the difference is due to quantization? (IQ4_XS vs Q4_K_XL)

This could actually be it, I can't test it right now

@sidran
Copy link
Author

sidran commented May 10, 2025

@stduhpf If you seen my edit and have no comment, Ill close this report.

@stduhpf
Copy link
Contributor

stduhpf commented May 10, 2025

maybe batch 128 has raw speed but a line of them processes slower than 2048 chunks? Just like you just got top speed with 256. I am not sure how thats called among programmers but there must be some overhead from switching multiple 128 chunks compared to maybe more streamlined 2048 (probably fused 512 parts internally, physically).

I don't think so. Llama-bench processes the same number of tokens regardless of batch size (512 tokens by default), but I still get peak performance for 128 batch size even when processing 2048 tokens.

@sidran sidran closed this as completed May 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests