You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure if this qualifies as a bug, but I noticed something unusual.
When using the Qwen3-30B-A3B-Q4_K_M.gguf model, if /no_think is enabled in the first query, the second response becomes significantly slower. llama-server web ui show that the second response's Prompt Tokens nearly equal the sum of the first response + second query.
The /no_think still generates opening and closing think tags for the first response, which are not sent with the second query causing the first response to be reprocessed.
You can fix this by adding --cache-reuse 256 to your llama-server command:
The /no_think still generates opening and closing think tags for the first response, which are not sent with the second query causing the first response to be reprocessed.
You can fix this by adding --cache-reuse 256 to your llama-server command:
Yes, as mentioned above, without --cache-reuse 256, it runs slowly regardless of /no_think. But with --cache-reuse 256, it's faster without /no_think, while with /no_think remains slow.
Sorry, I missed that you already used --cache-reuse. In that case, this is most likely another instance of the different tokenization bug: #11970 (comment)
If you could do some extra analysis by printing the tokens and confirming that this is the same issue, it would be helpful.
Name and Version
build: 5335 (d891942) with MSVC 19.43.34810.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
AMD 8840u (780m)
Models
Qwen3-30B-A3B-Q4_K_M.gguf
https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-GGUF/blob/main/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
Problem description & steps to reproduce
Here are the command lines I used:
I'm not sure if this qualifies as a bug, but I noticed something unusual.
When using the Qwen3-30B-A3B-Q4_K_M.gguf model, if
/no_think
is enabled in the first query, the second response becomes significantly slower. llama-server web ui show that the second response's Prompt Tokens nearly equal the sum of the first response + second query.First response:
Second response:
However, if
/no_think
is not used in the first query, the second response remains fast, with Prompt Tokens only reflecting the second query.First response:
Second response:
Additional notes:
--cache-reuse 256
, all second responses show high Prompt Tokens and become slow, regardless of/no_think
.Could this be a caching-related issue?
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: