-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Eval bug: inference of 32B eats too much memory on ROCM HIP (5x AMD Radeon Instinct Mi50 (gfx906)) #12369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tried to build llama.cpp with Vulcan, on the same 5x AMD Radeon Instinct Mi50 the same inference command works fine but 2x slower, the VRAM doesn't grow at prompt reading. |
I can not repoduce this with a trio of gfx908 devices as well as gfx1030, rocm 6.3.2, latest master lamacpp and There is no difference in the amount allocated regardless of if there is no prompt or if i start with a 16K long prompt via -f. Its possible that the issue is in your rocm environment or in a code path not hit by gfx908 or 1030. Could you get a hsa trace with rocprof? that would show the allocations. |
i use rocm-5.7.3 and llama.cpp head commit is 10f2e81 i launched rocprof (see attachment) |
I will take a look, but 5.7.3 is very old and I am aware of several issues with this version, I would strongly suggest upgrading to at least 6.2 |
hi @IMbackK, rocprof files attached |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Name and Version
Operating systems
Linux
GGML backends
HIP
Hardware
8 * Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
5 * AMD Radeon Instinct Mi50 (gfx906)
Models
Qwen2.5-32B-Instruct-Q4_K_M.gguf
Problem description & steps to reproduce
when I run ./llama-cli with context -c 32768 and big prompt, it eats too much VRAM and fails with OOM. The same run on Nvidia gpus, works fine.
This is my exact command.
What happens: first the model is loaded it occupies 50% VRAM on all 5 gpus.
Then it starts reading the prompt, VRAM starts to gradually grow, eating all VRAM and ends with error.
Same behaviour is observed when using llama-server.
Build command
I also add
LLAMA_CUDA_NO_PEER_COPY=1
to prevent gibberish output, as mentioned in #3051, but it has no effect on VRAM eating problem.I tried to run the same inference on nvidia gpus, it worked fine, at the stage of prompt reading VRAM only grows by 10-15% compared to model loading stage, and then stops growing.
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: