Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI #13164

sidran · 2025-04-29T03:15:18Z

Name and Version

Version (release): B5215 Windows Vulkan x64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

No response

Command line

echo Running Qwen3 30B MoE server 12 layers 12288 context

llama-server.exe ^
--model "D:\LLMs\Qwen3-30B-A3B-Q4_K_M.gguf" ^
--gpu-layers 12 ^
--ctx-size 12288 ^
--samplers top_k;dry;min_p;temperature;typ_p;xtc ^
--top-k 40 ^
--dry-multiplier 0.5 ^
--min-p 0.00 ^
--temp 0.6 ^
--top-p 0.95 ^
--repeat-penalty 1.1

Problem description & steps to reproduce

Edit: GGUF was downloaded from ggml's HF repository
(https://huggingface.co/ggml-org/Qwen3-30B-A3B-GGUF/blob/main/Qwen3-30B-A3B-Q4_K_M.gguf)

It loads and seems everything is ok but as soon as I request inference through Llama.cpp's web UI, I get this error

First Bad Commit

No response

Relevant log output

Thellton · 2025-04-29T04:29:43Z

I'm getting the same issue. did a little digging and gave Google Gemini the method that the assertion is being thrown in, and based upon what I understood, I believe as a temporary measure, using the --batch-size flag to reduce the batch size from the standard 2048 tokens processed to 365 tokens or less for example will prevent the assertion occurring.

testing involved submitting 6644 tokens in a single go with various batch-size values ranging from 2048 to 512 which all failed with 365 and below working, as well as several small multi-turn conversations using small batch-sizes (16 to 64 tokens processed at any one time).

EDIT: I'm using IQ3_XS quant running on an Arc A770 16GB.

sidran · 2025-04-29T04:56:36Z

I'm getting the same issue. did a little digging and gave Google Gemini the method that the assertion is being thrown in, and based upon what I understood, I believe as a temporary measure, using the --batch-size flag to reduce the batch size from the standard 2048 tokens processed to 365 tokens or less for example will prevent the assertion occurring.

testing involved submitting 6644 tokens in a single go with various batch-size values ranging from 2048 to 512 which all failed with 365 and below working, as well as several small multi-turn conversations using small batch-sizes (16 to 64 tokens processed at any one time).

EDIT: I'm using IQ3_XS quant running on an Arc A770 16GB.

Thanks comrade!
I confirm that this actually works. Model seems awesome. I have 32Gb RAM and AMD 6600 8Gb, offloading 14/48 layers to VRAM, with 12288 context length and get ~10.7 t/s on this 30B model.
Life is good 😸

stduhpf · 2025-04-29T10:52:03Z

I can confirm. Dense Qwen3 models work fine, but the MoE crashes. I'm also using Vulkan, the crash still happens with no GPU offloading. (Edit: it doesn't happen on CPU only build)

Changing batch size is a valid workaround. It seems I can go up to a batch size of 384, but 385 crashes.

ggerganov · 2025-04-29T12:06:19Z

Seems Vulkan related. At least with Metal both 30B-A3B and 235B-A22B work without issues.

TheNexter · 2025-04-29T12:30:03Z

same issue

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 414
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 414, n_tokens = 414, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 414, n_tokens = 414
/app/ggml/src/ggml-vulkan/ggml-vulkan.cpp:5052: GGML_ASSERT(nei0 * nei1 <= 3072) failed

Command line start :

llama-server -m /Models/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf --port 9999 --ctx-size 16384 --gpu-layers 12 --temp 0.6 --top-k 20 --top-p 0.95 --flash-attn

Configuration :

Ubuntu 24.04 Host
Docker container for the running part (llama-swap i remplace the binary server by : b5218)
GPU : Amd 6600XT 8Go
RAM : 46Go

If you need more info reply to my comment, ready to test some weird stuff if needed.

Ayandaftary · 2025-04-29T18:59:26Z

Same thing happens on my 6900XT but only with vulkan, ROCm works fine but vulkan is faster.

stan4cb · 2025-04-29T19:01:02Z

I've same problem, mostly at second prompt crash with
GGML_ASSERT(nei0 * nei1 <= 3072) failed
Qwen3-30B-A3B-Q4_K_M.gguf or Qwen3-30B-A3B-UD-Q4_K_XL.gguf doesn't matter
llama-server -m Qwen3-30B-A3B-UD-Q4_K_XL.gguf or -ngl 100 doesn't matter

--batch-size fixes

using b5220 Vulkan with 9070 XT

DD3Boh · 2025-04-29T23:00:22Z

Same thing happens on 780M APU, using Vulkan.

mlsterpr0 · 2025-04-29T23:14:12Z

Isn't it similar to GLM-4-32b? which still only works if i use -b 8 -ub 8? Otherwise gibberish output... (vulkan version not cpu only)

ZUIcat · 2025-04-30T14:06:35Z

Same thing happens on 780M APU, using Vulkan, +1

Dominguero · 2025-04-30T14:50:30Z

Same with Vega56.

Another thing I've noticed, which I don't know if it has anything to do with it, is that the Vulkan version works 10%-20% slower than the AVX2 version.

This is something I've only seen with this model, and the other Qwen 225b MOE.

mrdevolver · 2025-04-30T19:05:39Z

My HW Specs:
OS: Windows 10 64bit
CPU: AMD Ryzen 7 2700X (8c / 16t)
RAM: 16 GB
GPU: Radeon RX Vega 56
VRAM: 8 GB

Tested model: Unsloth/Qwen3-30B-A3B-Q3_K_M.gguf

I have the same problem with Vulkan version. (I haven't tried CPU version yet).

Couple of notes from my own observation, hopefully it will be of any use for developers for fixing this issue:

LM Studio, text-generation-webui, KoboldCpp (all based on llama.cpp). Model works in the UI of the LM Studio itself (haven't tested UI usage in the rest), but crashes in all of these apps as soon as I try to call it through the llama.cpp OpenAI API server they provide (calls are done by third party UIs).
In LM Studio UI I can use batch size even 4096 and it still works.
Perhaps the most interesting: LM Studio gives quick cURL command for testing the API server. Interestingly, when I called the same API server through that test cURL command, it did NOT crash and gave response as expected.

Little anecdote from testing: cURL command for testing API as given by LM Studio is not ready for use in Windows command line neither Windows batch file (.bat), so while I still had Qwen3-30B-A3B loaded in the LM Studio, I asked the model in the UI itself to change the cURL command to make it work in Windows batch script. It did a perfect job, so then I was able to test the API simply using cURL thanks to which I found out it works that way. 😂

Marcuss2 · 2025-04-30T19:52:15Z

Running with llama-server --model Qwen3-30B-A3B-UD-Q4_K_XL.gguf --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --ctx-size 32772 --flash-attn

It does happen to me as well. Latest build 5233
GPU: RX 6800

ROCm and CPU only backends don't do that, --flash-attn has no effect on the bug.

Marcuss2 · 2025-05-01T15:03:10Z

Taking a look, the assert bound itself was changed from 2048 to 3072 in commit 751fcfc 10 months ago. Ironically with the help of Qwen3 30BA3B, nei0 is the batch size and nei1 is the number of used experts. Qwen3moe broke grounds by having higher number of used experts.

I am not a developer of this project, this is my first post here. But I think this could be fixed by raising the row_ids in ggml/src/vulkan-shaders/mul_mm.comp and the assert.

I think this could be fixed by raising the limit of row_ids, trough this will put pressure on device shared memory.

This would require 16384 to get the 2048 batch size. 65536 in shared memory which is the limit for my GPU (RX 6800), this is too much. A rewrite of how it is handled might be needed.

sidran · 2025-05-01T15:13:42Z

@Marcuss2 I hope this gets fixed for this great model. I don't know about internals but it also seems slower than it needs to be due to such small batch.

Thanks for trying to move this forward <3

bennmann · 2025-05-01T15:15:36Z

build b5242 work around example (16GB VRAM)

llamacpp-b5242>llama-server.exe -m F:\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -ngl 95 -c 16000 -ubatch-size 300 -batch-size 300

mrdevolver · 2025-05-02T05:09:48Z

build b5242 work around example (16GB VRAM)

llamacpp-b5242>llama-server.exe -m F:\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -ngl 95 -c 16000 -ubatch-size 300 -batch-size 300

Trying to set batch size to 300 in LM Studio seems to work for me too. The server doesn't crash that way, but naturally it also makes the prompt ingestion much slower, kinda taking the speed advantage of MoE away. I hope we can do something to fix it. :/

zghnwsq · 2025-05-02T08:08:39Z

batch size=365~380 works for me.
llama-b5257-bin-ubuntu-vulkan-x64
7940H , mem 24G, vram 8G,
model： Qwen3-30B-A3B-Q4_K_M
gpu layers: 23

But yes, too slow.

TheNexter · 2025-05-03T20:57:30Z

I pray this will patch this bug :(

The model is so good but it's impossible to it :(

nasrally · 2025-05-03T21:11:31Z

Naive question, wouldn't it be an easy fix by like using 16384, instead of 3072? I have put 0 effort into understanding the code, just curious what is it supposed to assert, why such numbers and why is it needed?

Marcuss2 · 2025-05-03T21:21:17Z

Naive question, wouldn't it be an easy fix by like using 16384, instead of 3072? I have put 0 effort into understanding the code, just curious what is it supposed to assert, why such numbers and why is it needed?

From what I understood, no. It would no longer fit into shared memory of most devices.

mrdevolver · 2025-05-04T22:33:18Z

Isn't it similar to GLM-4-32b? which still only works if i use -b 8 -ub 8? Otherwise gibberish output... (vulkan version not cpu only)

Where is this issue documented? I mean for GLM-4-32B model?

Where is the report and what's the status of the issue, is it being looked at at all?

I searched the issues tab, no success.

I have the same issue with the GLM-4 model and it all started acting up for me when the fixes for GLM-4-32B model were merged.
Ironically before those fixes, I was able to run GLM-4-32B quants created using the latest llama.cpp with already fixed converter. But once the LM Studio received the updated runtime which had those fixes, the model started acting up.

Running this model with batch size of 8 is unacceptable.

jeffbolznv · 2025-05-06T03:47:29Z

I was just made aware of this. I think #13326 should fix it. Did anybody see a tensor size that wouldn't fit in 4096 elements?

soerenkampschroer · 2025-05-06T10:33:55Z

#13326 fixes the issue for me on macOS w/ Vulkan 👍

sidran · 2025-05-10T01:02:59Z

#13326 seems to have fixed this for me. I tested going up to ~8k context length and it worked correctly.
Does anyone have anything to add before I close this?

stan4cb · 2025-05-10T10:18:21Z

#13326 fixed it 👍

Thellton · 2025-05-10T12:39:13Z

@sidran, I'd say it's fixed; the fundamental issue causing it was the number of tokens processed as a batch during prompt processing and it's presently working for me with "--batch-size 2048" which is the default value which means I can readily delete that command prompt flag.

stduhpf · 2025-05-10T15:19:25Z

Seems fixed on my end too. (though I advise to keep using small batch sizes for MoE models anyways, as the performance is better)

sidran · 2025-05-10T17:16:02Z

Seems fixed on my end too. (though I advise to keep using small batch sizes for MoE models anyways, as the performance is better)

What do you mean? My subjective impression is that even my puny AMD 6600 8Gb destroys context processing with default 2048 batch compared to smaller.

And since we are about speed. I know this is not the place for it but Ill post it and later close the whole thing, I have to share something with you, as an official contributor.

I was playing with tensor overriding for hours yesterday. I over-focused on speed and completely ignored other consequences. To cut long story short, this was my starting and ending point, after a lot of back-and-forth. Using Vulkan backend (32Gb RAM, AMD 6600 8Gb)

start: 15/48 layers loaded to VRAM, 12288 context max, generation speed ~12 t/s (empty context)
end: 48/48 layers loaded to VRAM, 30720 context max, generation speed ~12.5 t/s (empty context)

When batch (2048) is full, processing speed hovers around 50 t/s

using --override-tensor ".ffn_(down|gate|up)_exps.weight=CPU" I didnt get speed (didnt lose either) but I got 50% VRAM free for much larger context. This tensor juggling could be automated, as I am sure you know better than me, I am just amazed by potential.

I also have suspicion that something around Vulkan operations might be bottlenecking as well.

stduhpf · 2025-05-10T17:43:08Z

Maybe --override-tensor changes things, I can offload this model entirely to my GPUs so I didn't play around with this argument yet.

But here are tha llama-bench results on my system with RX 6800 16GB (pcie gen3 x16) + RX 5700XT 8GB (pcie gen2 x4😬), by varying batch size:

model	size	params	backend	ngl	n_batch	sm	test	t/s
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	16	row	pp512 @ d512	85.83 ± 0.80
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	16	row	tg128 @ d512	60.13 ± 0.61
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	32	row	pp512 @ d512	138.50 ± 2.04
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	32	row	tg128 @ d512	60.72 ± 0.07
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	64	row	pp512 @ d512	148.52 ± 1.58
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	64	row	tg128 @ d512	60.67 ± 0.05
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	128	row	pp512 @ d512	168.17 ± 1.08
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	128	row	tg128 @ d512	60.76 ± 0.06
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	256	row	pp512 @ d512	128.38 ± 0.22
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	256	row	tg128 @ d512	60.45 ± 0.27
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	512	row	pp512 @ d512	81.01 ± 0.10
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	512	row	tg128 @ d512	60.64 ± 0.14

Mushoz · 2025-05-10T17:47:27Z

My peak prompt processing throughput is also at batchsize 128 with a single 7900XTX.

sidran · 2025-05-10T18:46:04Z

@stduhpf @Mushoz
hmm, our observations diverge on this. I just made a test using exactly the same context:

But I also just noticed another thing. When I use batch 128 my VRAM usage is significantly lower:

Since I dont know almost anything about internals, this remains just an observation.

stduhpf · 2025-05-10T19:03:30Z

with --override-tensor ".ffn_(down|gate|up)_exps.weight=CPU" and -sm none (only main gpu), peak pp is for batch size of 256.

model	size	params	backend	ngl	n_batch	sm	ot	test	t/s
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	64	none	.ffn_.*_exps.weight=CPU	pp512 @ d512	35.72 ± 0.09
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	128	none	.ffn_.*_exps.weight=CPU	pp512 @ d512	62.93 ± 0.09
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	256	none	.ffn_.*_exps.weight=CPU	pp512 @ d512	83.94 ± 0.09
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	99	512	none	.ffn_.*_exps.weight=CPU	pp512 @ d512	77.46 ± 0.10

Same with partial offloading (--ngl 15)

model	size	params	backend	ngl	n_batch	sm	test	t/s
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	15	64	none	pp512 @ d512	44.90 ± 0.12
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	15	128	none	pp512 @ d512	75.11 ± 0.18
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	15	256	none	pp512 @ d512	94.14 ± 0.20
qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.25 GiB	30.53 B	Vulkan,RPC	15	512	none	pp512 @ d512	82.00 ± 0.24

sidran · 2025-05-10T19:06:07Z

But did you try 2048?

Maybe the difference is due to quantization? (IQ4_XS vs Q4_K_XL)

edit: And another thing, maybe batch 128 has raw speed but a line of them processes slower than 2048 chunks? Just like you just got top speed with 256. I am not sure how thats called among programmers but there must be some overhead from switching multiple 128 chunks compared to maybe more streamlined 2048 (probably fused 512 parts internally, physically).

stduhpf · 2025-05-10T19:07:43Z

But did you try 2048?

Yes, it's within margin of error to 512 (even with -p 2048)

Maybe the difference is due to quantization? (IQ4_XS vs Q4_K_XL)

This could actually be it, I can't test it right now

sidran · 2025-05-10T19:12:54Z

@stduhpf If you seen my edit and have no comment, Ill close this report.

stduhpf · 2025-05-10T19:30:09Z

maybe batch 128 has raw speed but a line of them processes slower than 2048 chunks? Just like you just got top speed with 256. I am not sure how thats called among programmers but there must be some overhead from switching multiple 128 chunks compared to maybe more streamlined 2048 (probably fused 512 parts internally, physically).

I don't think so. Llama-bench processes the same number of tokens regardless of batch size (512 tokens by default), but I still get peak performance for 128 batch size even when processing 2048 tokens.

sidran added the bug-unconfirmed label Apr 29, 2025

stduhpf mentioned this issue May 1, 2025

Eval bug: Qwen3 30B-A3B throwing assertion error with Vulkan backend #13233

Closed

SergeyFilippov mentioned this issue May 1, 2025

Crash after successful response from Qwen3 MoE (GGML_ASSERT failed) lmstudio-ai/lmstudio-bug-tracker#635

Closed

nalf3in mentioned this issue May 6, 2025

vulkan: scalar flash attention implementation #13324

Merged

nalf3in mentioned this issue May 6, 2025

vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326

Merged

sidran closed this as completed May 10, 2025

Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI #13164

Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI #13164

Comments

sidran commented Apr 29, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Thellton commented Apr 29, 2025 • edited Loading

sidran commented Apr 29, 2025

stduhpf commented Apr 29, 2025 • edited Loading

ggerganov commented Apr 29, 2025

TheNexter commented Apr 29, 2025 • edited Loading

Ayandaftary commented Apr 29, 2025

stan4cb commented Apr 29, 2025

DD3Boh commented Apr 29, 2025

mlsterpr0 commented Apr 29, 2025 • edited Loading

ZUIcat commented Apr 30, 2025

Dominguero commented Apr 30, 2025

mrdevolver commented Apr 30, 2025 • edited Loading

Marcuss2 commented Apr 30, 2025 • edited Loading

Marcuss2 commented May 1, 2025 • edited Loading

sidran commented May 1, 2025

bennmann commented May 1, 2025

mrdevolver commented May 2, 2025

zghnwsq commented May 2, 2025 • edited Loading

TheNexter commented May 3, 2025

nasrally commented May 3, 2025

Marcuss2 commented May 3, 2025

mrdevolver commented May 4, 2025

jeffbolznv commented May 6, 2025

soerenkampschroer commented May 6, 2025

sidran commented May 10, 2025

stan4cb commented May 10, 2025

Thellton commented May 10, 2025

stduhpf commented May 10, 2025

sidran commented May 10, 2025 • edited Loading

stduhpf commented May 10, 2025 • edited Loading

Mushoz commented May 10, 2025

sidran commented May 10, 2025 • edited Loading

stduhpf commented May 10, 2025 • edited Loading

sidran commented May 10, 2025 • edited Loading

stduhpf commented May 10, 2025 • edited Loading

sidran commented May 10, 2025

stduhpf commented May 10, 2025 • edited Loading

sidran commented Apr 29, 2025 •

edited

Loading

Thellton commented Apr 29, 2025 •

edited

Loading

stduhpf commented Apr 29, 2025 •

edited

Loading

TheNexter commented Apr 29, 2025 •

edited

Loading

mlsterpr0 commented Apr 29, 2025 •

edited

Loading

mrdevolver commented Apr 30, 2025 •

edited

Loading

Marcuss2 commented Apr 30, 2025 •

edited

Loading

Marcuss2 commented May 1, 2025 •

edited

Loading

zghnwsq commented May 2, 2025 •

edited

Loading

sidran commented May 10, 2025 •

edited

Loading

stduhpf commented May 10, 2025 •

edited

Loading

sidran commented May 10, 2025 •

edited

Loading

stduhpf commented May 10, 2025 •

edited

Loading

sidran commented May 10, 2025 •

edited

Loading

stduhpf commented May 10, 2025 •

edited

Loading

stduhpf commented May 10, 2025 •

edited

Loading