Skip to content

Misc. bug: llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed #13405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bjodah opened this issue May 9, 2025 · 3 comments
Open

Misc. bug: llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed #13405

bjodah opened this issue May 9, 2025 · 3 comments

Comments

@bjodah
Copy link

bjodah commented May 9, 2025

Name and Version

$./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5329 (611aa914)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

llama-cli \
    --log-file /tmp/llamacpp-Qwen3-30B-A3B-Q8_K_XL.log \
    --hf-repo unsloth/Qwen3-30B-A3B-GGUF:Q8_K_XL \
    --override-tensor '([0-9]+).ffn_.*_exps.=CPU' \
    --n-gpu-layers 48 \
    --jinja \
    --cache-type-k q8_0 \
    --ctx-size 32768 \
    --samplers "top_k;dry;min_p;temperature;top_p" \
    --min-p 0.005 \
    --top-p 0.97 \
    --top-k 40 \
    --temp 0.7 \
    --dry-multiplier 0.7 \
    --dry-allowed-length 4 \
    --dry-penalty-last-n 2048 \
    --presence-penalty 0.05 \
    --frequency-penalty 0.005 \
    --repeat-penalty 1.01 \
    --repeat-last-n 16 \
    --verbose \
    --file generic-prompt-for-testing-1906words.txt

Problem description & steps to reproduce

The log file of the output, together with what I hope is all the relevant information can be found in this ephemeral repo I put up for this bug report:
https://github.com/bjodah/bug-reproducer-llamacpp-assert-triggering/tree/main

It might very well that I'm doing something awfully wrong here, but since it's an assert that is triggering, I'm thinking that you might be interested in a bug report?

I first observed this error using llama-serve on my laptop (ubuntu 24.04, geforce 1050 mobile), but everything in this bug report was reproduced on a more modern system (debian, geforce rtx 3090).

First Bad Commit

Qwen 3 support is pretty recent, so I haven't figured out what's the relevant oldest commit for a bisection.

Relevant log output

/... lots of output, see log file in repo linked in issue description .../ 
eval: [ 'G':38 ]
Gn_past = 2620
/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
/home/bjorn/.gdbinit:2: Error in sourced command file:
/home/bjorn/dotfiles/per-file/.gdbinit:22: Error in sourced command file:
Scripting in the "Python" language is not supported in this copy of GDB.
ptrace: Operation not permitted.
No stack.
The program is not being run.
@bjodah
Copy link
Author

bjodah commented May 9, 2025

...I should have added a --seed flag, but the issue is reproducible for me with all seeds I've tried so far.

The issue has to do with --dry-allowed-length 4:

...
Now finish your task according to taskDefinition, only write the poem, add no commentary.
assistant
GGGG/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed

If I adjust this to --dry-allowed-length 9 we see nine captial G before the assert:

...
Now finish your task according to taskDefinition, only write the poem, add no commentary.
assistant
GGGGGGGGG/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed

@segmond
Copy link

segmond commented May 12, 2025

I'm seeing this bug as well, I'm not passing in --dry-allowed-length 4.

main: server is listening on http://0.0.0.0:8089 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 16128, n_keep = 0, n_prompt_tokens = 88
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 88, n_tokens = 88, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 88, n_tokens = 88
/home/seg/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
dsv3b.sh: line 8: 68577 Aborted (core dumped) ~/llama.cpp/build/bin/llama-server -ngl 62 --host 0.0.0.0 --path ~/llama.cpp/examples/server/public -m /llmzoo/models/DeepSeek-V3-0324-UD-Q3_K_XL.gguf --port 8089 --override-tensor "blk.([0-4]).ffn_(up|down)exp.=CUDA0,blk.([1][0257]|[5]).ffn(up|down)exp.=CUDA1,blk.([2][0257]|[6]).ffn(up|down)exp.=CUDA2,blk.([3][0257]|[7]).ffn(up|down)exp.=CUDA3,blk.([4][0257]|[6][01]).ffn(up|down)exp.=CUDA4,blk.([5][02579]|[6][2]).ffn(up|down)exp.=CUDA5,blk.([8-9]|[1-9][0-9]).ffn.exp.=CPU" -md ~/models/draft/DeepSeek-V3-0324-DRAFT-0.5B-Q8_0.gguf -ngld 127 -devd CUDA2 -cd 16000 -fa -mg 4 --no-mmap -c 16000

@michmill1970
Copy link

michmill1970 commented May 13, 2025

I can confirm the same behavior on MacOS.
Version: llama-b5353-bin-macos-arm64.zip
MacOS: 15.4.1 (24E263)

Error:

que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 1
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 2, front = 0
slot update_slots: id 0 | task 0 | kv cache rm [347, end)
srv process_chun: processing image...
image/slice encoded in 21169 ms
decoding image batch 1/1, n_tokens_batch = 256
set_causal_attn: value = 0
image decoded (batch 1/1) in 6587 ms
set_causal_attn: value = 1
srv process_chun: image processed in 27757 ms
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 609, n_tokens = 6, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 609, n_tokens = 6
srv update_slots: decoding batch, n_tokens = 6
set_embeddings: value = 0
clear_adapter_lora: call
/Users/runner/work/llama.cpp/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
zsh: abort /Users/myuserdir/Projects/llamacpp/bin/llama-server --model --mmproj 4096

command line to start llama-server:

/Users/myuserdir/Projects/llamacpp/bin/llama-server
--model /Users/myuserdir/Projects/ImageIndexer/resources/Qwen2-VL-2B-Instruct-Q6_K.gguf
--mmproj /Users/myuserdir/Projects/ImageIndexer/resources/mmproj-Qwen2-VL-2B-Instruct-f16.gguf
--ctx-size 4096
-v

JSON payload:

request: {
"max_tokens": 250,
"messages": [
{
"content": "You describe the image and generate keywords.",
"role": "system"
},
{
"content": [
{
"text": "The tasks are to describe the image and to come up with a large set of keyword tags for it.\n\nWrite the Description using the active voice.\n\nThe Keywords must be one or two words each. Generate as many Keywords as possible using a controlled and consistent vocabulary.\n\nFor both Description and Keywords, make sure to include:\n\n - Themes, concepts\n - Items, animals, objects\n - Structures, landmarks, setting\n - Foreground and background elements\n - Notable colors, textures, styles\n - Actions, activities\n\nIf humans are present, include: \n - Physical appearance\n - Gender\n - Clothing\n - Age range\n - Visibly apparent ancestry\n - Occupation/role\n - Relationships between individuals\n - Emotions, expressions, body language\n\nUse ENGLISH only. Generate ONLY a JSON object with the keys Description and Keywords as follows {"Description": str, "Keywords": []}\n\nThe example input would be a stock photo of two apples, one red and one green, against a white backdrop and is a hypothetical Description and Keyword for a non-existent image.\nOUTPUT=json{\"Description\": \"Two apples next to each other, one green and one red, placed side by side against a white background. There is even and diffuse studio lighting. The fruit is glossy and covered with dropplets of water indicating they are fresh and recently washed. The image emphasizes the cleanliness and appetizing nature of the food\", \"Keywords\": [\"studio shot\",\"green\",\"fruit\",\"red\",\"apple\",\"stock image\",\"health food\",\"appetizing\",\"empty background\",\"grocery\",\"food\",\"snack\"]}\n ",
"type": "text"
},
{
"image_url": {
"url": "data:image/jpeg;base64,...image content in base64 here..."
},
"type": "image_url"
}
],
"role": "user"
}
],
"min_p": 1.05,
"temperature": 0.1,
"top_k": 0,
"top_p": 1
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants