We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama-cli --version version: 5336 (053367d) built with gcc-12.4 (GCC) 12.4.0 for x86_64-redhat-linux
Linux
Vulkan
AMD RX 7600
Phi-4-mini-reasoning-Q8_0.gguf
server crashes after a few minutes of inference.
./llama-server -m /models/Phi-4-mini-reasoning-Q8_0.gguf -t 8 --batch-size 2048 --ubatch-size 1024 -fa -ctk q8_0 -ctv q8_0 --gpu-layers 99 -c 32768 --temp 0.8 --top-p 0.95 --min-p 0 --jinja
No response
/media/build/llama/ggml/src/ggml-backend.cpp:748: pre-allocated tensor (cache_k_l0 (view) (copy of cache_k_l0 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY) [New LWP 11936] [New LWP 11937] [New LWP 11938] [New LWP 11939] [New LWP 11940] [New LWP 11941] [New LWP 11942] [New LWP 11943] [New LWP 11944] [New LWP 11945] [New LWP 11946] [New LWP 11947] [New LWP 20707] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f51be9db5a6 in waitpid () from /lib64/libpthread.so.0 #0 0x00007f51be9db5a6 in waitpid () from /lib64/libpthread.so.0 #1 0x000000000070a5e8 in ggml_abort () #2 0x000000000071dfff in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) () #3 0x000000000071ee1a in ggml_backend_sched_split_graph(ggml_backend_sched*, ggml_cgraph*) [clone .part.0] () #4 0x00000000007227d1 in ggml_backend_sched_alloc_graph () #5 0x00000000005089ce in llama_kv_cache_unified::update(llama_context&) () #6 0x00000000004e146f in llama_context::kv_self_update() () #7 0x00000000004e487e in llama_context::decode(llama_batch&) () #8 0x00000000004e62ea in llama_decode () #9 0x00000000003676da in server_context::update_slots() () #10 0x00000000003323dc in server_queue::start_loop() () #11 0x00000000003a1420 in main () [Inferior 1 (process 11925) detached]
The text was updated successfully, but these errors were encountered:
This seems like a possible out of memory or 4GB limit issue.
Sorry, something went wrong.
The card has 8GB VRAM and I am able to run other 8GB models which fully occupy the VRAM, so not sure it is due to OOM.
No branches or pull requests
Name and Version
llama-cli --version
version: 5336 (053367d)
built with gcc-12.4 (GCC) 12.4.0 for x86_64-redhat-linux
Operating systems
Linux
GGML backends
Vulkan
Hardware
AMD RX 7600
Models
Phi-4-mini-reasoning-Q8_0.gguf
Problem description & steps to reproduce
server crashes after a few minutes of inference.
./llama-server -m /models/Phi-4-mini-reasoning-Q8_0.gguf -t 8 --batch-size 2048 --ubatch-size 1024 -fa -ctk q8_0 -ctv q8_0 --gpu-layers 99 -c 32768 --temp 0.8 --top-p 0.95 --min-p 0 --jinja
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: