Generation never stops? (Infinite GPU load) #14489

treyus30 · 2025-07-02T03:13:08Z

treyus30
Jul 2, 2025

Forgive me, I'm a total noob to LLMs and llama.cpp.

I'm trying to couple llama-server.exe with an Open WebUI frontend. While I have it working, after the first prompt goes through and finishes, my GPU/CPU stays maxed out as though it's still generating.

This is what I run:
./llama-server --model "C:\Users\admin\.lmstudio\models\lmstudio-community\DeepSeek-R1-0528-Qwen3-8B-GGUF\DeepSeek-R1-0528-Qwen3-8B-Q4_K_M.gguf" --port 10000 --ctx-size 1024 --n-gpu-layers 40 --alias "DeepSeek-R1-8B"

This is the relevant part of the response after the prompt and before I manually terminate:

main: server is listening on http://127.0.0.1:10000 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: OPTIONS /models 127.0.0.1 200
srv log_server_r: request: GET /models 127.0.0.1 200
srv log_server_r: request: OPTIONS /models 127.0.0.1 200
srv log_server_r: request: GET /models 127.0.0.1 200
srv log_server_r: request: OPTIONS /chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 10
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 10, n_tokens = 10, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 10, n_tokens = 10
slot update_slots: id 0 | task 0 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot release: id 0 | task 0 | stop processing: n_past = 657, truncated = 1
slot print_timing: id 0 | task 0 |
prompt eval time = 231.88 ms / 10 tokens ( 23.19 ms per token, 43.13 tokens per second)
eval time = 35607.68 ms / 1159 tokens ( 30.72 ms per token, 32.55 tokens per second)
total time = 35839.56 ms / 1169 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /chat/completions 127.0.0.1 200
srv log_server_r: request: OPTIONS /chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 1160 | processing task
slot update_slots: id 0 | task 1160 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 821
slot update_slots: id 0 | task 1160 | kv cache rm [0, end)
slot update_slots: id 0 | task 1160 | prompt processing progress, n_past = 821, n_tokens = 821, progress = 1.000000
slot update_slots: id 0 | task 1160 | prompt done, n_past = 821, n_tokens = 821
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
srv log_server_r: request: OPTIONS /chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
srv log_server_r: request: OPTIONS /chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 1160 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
srv cancel_tasks: cancel task, id_task = 1160
srv log_server_r: request: POST /chat/completions 127.0.0.1 200
slot release: id 0 | task 1160 | stop processing: n_past = 655, truncated = 1
slot launch_slot_: id 0 | task 3465 | processing task
slot update_slots: id 0 | task 3465 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 891
slot update_slots: id 0 | task 3465 | kv cache rm [0, end)
slot update_slots: id 0 | task 3465 | prompt processing progress, n_past = 891, n_tokens = 891, progress = 1.000000
slot update_slots: id 0 | task 3465 | prompt done, n_past = 891, n_tokens = 891
srv cancel_tasks: cancel task, id_task = 5764
srv log_server_r: request: POST /chat/completions 127.0.0.1 200
srv cancel_tasks: cancel task, id_task = 3465
srv log_server_r: request: POST /chat/completions 127.0.0.1 200
slot release: id 0 | task 3465 | stop processing: n_past = 891, truncated = 0
slot launch_slot_: id 0 | task 5764 | processing task
slot update_slots: id 0 | task 5764 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 782
slot update_slots: id 0 | task 5764 | kv cache rm [5, end)
slot update_slots: id 0 | task 5764 | prompt processing progress, n_past = 782, n_tokens = 777, progress = 0.993606
slot update_slots: id 0 | task 5764 | prompt done, n_past = 782, n_tokens = 777
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
slot update_slots: id 0 | task 5764 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
srv operator(): operator(): cleaning up before exit...

ggerganov · 2025-07-02T05:05:36Z

ggerganov
Jul 2, 2025
Maintainer

Likely you are running out of context, because you set just --ctx-size 1024 tokens.

Try to replace this argument with the following in order to utilize the full context size of the model:

--ctx-size 0 --no-context-shift

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generation never stops? (Infinite GPU load) #14489

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Generation never stops? (Infinite GPU load) #14489

Uh oh!

treyus30 Jul 2, 2025

Replies: 1 comment

Uh oh!

ggerganov Jul 2, 2025 Maintainer

treyus30
Jul 2, 2025

ggerganov
Jul 2, 2025
Maintainer