You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a hybrid setup that I have been testing lately, with the hardware mentioned before.
The issue consists that, loading the same model on Windows and Linux, the behavior is different. Using the command
./llama-server -m '/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 4096-ngl 25 -ts 16,20,25,41 --no-warmup On Linux (Ubuntu 24.04, Fedora 41) and
.\llama-server.exe -m 'DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 4096 -ngl 25 -ts 16,20,25,41 --no-warmup on Windows (11).
On Linux: It starts correctly, loading the model and up to to gen. Then, the RAM usage ends near ~150GB. Then, when doing inference (via SillyTavern or http://127.0.0.1:8080), it takes a bit on pre processing but then it starts to gen soonish. RAM usage hovers around that usage an doesn't go up. ~3t/s.
On Windows: It starts correctly, loading the model and up to gen. Then, the RAM usage starts at ~120-130GB. Then, when doing inference, the RAM usage starts to go up slowly until it reaches ~190GB, and then Windows starts to use swap.. This makes the preprocessing really slow, and then on gen time, it is also very slow (~0.8 t/s). Also it seems the SSD used to load the model get constant read usage while RAM usage increases.
Note that the CUDA VRAM usage seems to be the same on both Windows and Linux.
RAM usage after loading the model
RAM usage when sending a request to generate (first seconds, and you can notice I: disk usage where the model is located)
RAM usage when it starts to use swap (and now C: Disk has usage)
Panchovix
changed the title
Eval bug: llama.cpp on Windows, for DeepSeek V3 Q2_K_XL uses too much RAM vs Linux (Hybrid GPU + CPU) and starts using swap.
Eval bug: llama.cpp on Windows, for DeepSeek V3 Q2_K_XL/IQ2_XXS uses too much RAM vs Linux (Hybrid GPU + CPU) and starts using swap.
Mar 30, 2025
Panchovix
changed the title
Eval bug: llama.cpp on Windows, for DeepSeek V3 Q2_K_XL/IQ2_XXS uses too much RAM vs Linux (Hybrid GPU + CPU) and starts using swap.
Eval bug: Unusual high RAM usage on Windows when running DeepSeek V3 Q2_K_XL/IQ2_XXS, on Hybrid CPU+GPU (vs Linux).
Mar 30, 2025
Name and Version
Operating systems
Windows
GGML backends
CPU + CUDA
Hardware
AMD 7800X3D + 192GB RAM + RTX 5090 + RTX 4090 x2 + RTX A6000
Models
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-Q2_K_XL
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-IQ2_XXS
Problem description & steps to reproduce
Hi there, many thanks for all your work!
I have a hybrid setup that I have been testing lately, with the hardware mentioned before.
The issue consists that, loading the same model on Windows and Linux, the behavior is different. Using the command
./llama-server -m '/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 4096-ngl 25 -ts 16,20,25,41 --no-warmup
On Linux (Ubuntu 24.04, Fedora 41) and.\llama-server.exe -m 'DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 4096 -ngl 25 -ts 16,20,25,41 --no-warmup
on Windows (11).On Linux: It starts correctly, loading the model and up to to gen. Then, the RAM usage ends near ~150GB. Then, when doing inference (via SillyTavern or http://127.0.0.1:8080), it takes a bit on pre processing but then it starts to gen soonish. RAM usage hovers around that usage an doesn't go up. ~3t/s.
On Windows: It starts correctly, loading the model and up to gen. Then, the RAM usage starts at ~120-130GB. Then, when doing inference, the RAM usage starts to go up slowly until it reaches ~190GB, and then Windows starts to use swap.. This makes the preprocessing really slow, and then on gen time, it is also very slow (~0.8 t/s). Also it seems the SSD used to load the model get constant read usage while RAM usage increases.
Note that the CUDA VRAM usage seems to be the same on both Windows and Linux.
RAM usage after loading the model
RAM usage when sending a request to generate (first seconds, and you can notice I: disk usage where the model is located)
RAM usage when it starts to use swap (and now C: Disk has usage)
First Bad Commit
N/A
Relevant log output
The text was updated successfully, but these errors were encountered: