Description
My Journey to Building llama-cpp-python
with CUDA on an RTX 5060 Ti (Blackwell Architecture)
This guide details the steps I took to successfully install llama-cpp-python
with full CUDA acceleration on my system, specifically targeting an NVIDIA RTX 5060 Ti (Blackwell architecture). The standard installation methods failed due to various detection and compatibility issues, requiring a fully manual and controlled build process.
My Setup:
- GPU: NVIDIA GeForce RTX 5060 Ti (Blackwell Architecture -
compute_90
) - CUDA Toolkit Version: 12.x (pre-installed on my system at
/usr/lib/cuda
) - Operating System: Linux (Ubuntu/Debian-based, as indicated by
apt-get
usage)
The Problems Encountered and Their Solutions
-
Problem: Missing System Libraries (
libgomp
)- Error: Initial CPU builds failed, complaining about
libgomp
. - Fix: Installed the necessary library:
sudo apt-get install libgomp1
- Error: Initial CPU builds failed, complaining about
-
Problem: Incorrect CUDA Toolkit Path Detection
- Error: The build process incorrectly identified
/usr/include
as the CUDA Toolkit location (-- Found CUDAToolkit: /usr/include (found version "12.0.140")
). - Fix: Manually specified the correct path (
/usr/lib/cuda
) using a CMake flag.
- Error: The build process incorrectly identified
-
Problem: C++ Compiler (GCC/G++) Mismatch with CUDA Toolkit
- Error: My system's default C++ compiler (
g++ 13
) was too new and incompatible with the CUDA Toolkit (version 12.x). This issue manifested in two places:- During the
llama.cpp
compilation. - Crucially, with the Python interpreter itself, which was initially built with
GCC 13
byconda-forge
. This caused silent issues or very slow prompt evaluation even after the GPU build succeeded.
- During the
- Fix:
- Installed a compatible compiler version (
g++-12
). - Forced the
llama.cpp
build to useg++-12
. - Most importantly, created a Conda environment with a Python version built using a compatible GCC (GCC 11 in my case).
- Installed a compatible compiler version (
- Error: My system's default C++ compiler (
-
Problem: Missing
llama.cpp
Source Code (Git Submodules)- Error: The build failed because it couldn't find the core
llama.cpp
source files in thevendor/
directory. - Fix: Realized
llama-cpp-python
uses Git submodules and downloaded them:git submodule update --init --recursive
- Error: The build failed because it couldn't find the core
-
Problem: Incorrect GPU Architecture Detection (Blackwell)
- Error: The final build error was
Unsupported gpu architecture 'compute_120'
. Auto-detection failed for my new RTX 5060 Ti. - Fix: Looked up the correct "Blackwell" architecture number (90) and set it manually using a CMake flag.
- Error: The final build error was
-
Problem: GPU Enabled but Very Slow Prompt Evaluation
- Symptom: After successfully building and offloading layers to the GPU, the initial prompt evaluation was incredibly slow (e.g., 0.27 tokens/sec), though subsequent token generation was fast.
- Fix: Forced the use of CUBLAS for certain operations during the build. This resolved the slow prompt evaluation bottleneck.
The Final, Working Recipe (Step-by-Step)
Here's the exact sequence of commands that ultimately got llama-cpp-python
working optimally with CUDA on my RTX 5060 Ti:
-
Set up the Python Environment with a Compatible GCC
- Create a new Conda environment with Python 3.11. This specific Python version from
conda-forge
often comes pre-built with GCC 11, which is compatible with CUDA 12.x. - Install
ipython
andhuggingface-hub
for convenience.
conda create -n llama_cpp python=3.11 -y conda activate llama_cpp conda install -c conda-forge gxx_linux-64=12 # Ensure GCC 12 is available/preferred for direct use by cmake pip install ipython huggingface-hub
- Verify the Python interpreter's GCC:
python -c "import platform; print(platform.python_compiler())" # Expected output: ... [GCC 11.x.x] on linux
- Install necessary build tools:
sudo apt-get install cmake libgomp1 # cmake is crucial for the build
- Create a new Conda environment with Python 3.11. This specific Python version from
-
Clone the
llama-cpp-python
Repository and Download Submodulesgit clone https://github.com/abetlen/llama-cpp-python.git cd llama-cpp-python git submodule update --init --recursive
-
Compile and Install
llama-cpp-python
with Full CUDA Support- This is the critical step. We pass all the specific CMake flags directly to
pip
to ensure our custom build configuration is used. GGML_CUDA=on
: Enables CUDA support.GGML_CUDA_FORCE_CUBLAS=on
: Resolves the slow prompt evaluation issue.CUDAToolkit_ROOT=/usr/lib/cuda
: Specifies the correct CUDA Toolkit path.CMAKE_C_COMPILER=gcc-12
andCMAKE_CXX_COMPILER=g++-12
: Force specific GCC versions for compilation.CMAKE_CUDA_ARCHITECTURES=90
: Specifies the Blackwell architecture for my RTX 5060 Ti.FORCE_CMAKE=1
: Ensurespip
re-runs CMake even if it thinks it's not necessary.--upgrade --force-reinstall --no-cache-dir
: Ensures a clean build.
CMAKE_ARGS="-DGGML_CUDA=on -DGGML_CUDA_FORCE_CUBLAS=on -DCUDAToolkit_ROOT=/usr/lib/cuda -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DCMAKE_CUDA_ARCHITECTURES=90" \ FORCE_CMAKE=1 pip install . --upgrade --force-reinstall --no-cache-dir
- This step will take some time as it compiles the entire library.
- This is the critical step. We pass all the specific CMake flags directly to
-
Test for GPU Acceleration
- Use the following Python script to load a GGUF model (e.g., Gemma) and confirm GPU layer offloading and performance.
from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="google/gemma-3-4b-it-qat-q4_0-gguf", filename="gemma-3-4b-it-q4_0.gguf", n_gpu_layers=-1, # Offload all possible layers to GPU verbose=True # Important for seeing detailed loading info ) def chat(text: str, messages: list): user_message = {'role': 'user', 'content': text} messages.append(user_message) response = llm.create_chat_completion(messages = messages) messages.append(response["choices"][0]["message"]) return response, messages print("\nFirst chat (expect longer prompt eval time due to initial setup):") response1, messages1 = chat("hello, tell me a joke", []) print("\nResponse 1:", response1["choices"][0]["message"]["content"]) print("\nSecond chat (expect much faster prompt eval):") response2, messages2 = chat("hello, tell me another joke", messages1) # Using previous context print("\nResponse 2:", response2["choices"][0]["message"]["content"])
- Look for these lines in the output to confirm success:
ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
llama_model_loader: offloaded X/Y layers to GPU
(where X should be equal to Y for-1
n_gpu_layers
)- Crucially, check the
prompt eval time
in the performance metrics for subsequent calls to be very low (e.g., milliseconds per token, hundreds of tokens per second).
This detailed walkthrough should equip others facing similar challenges with the knowledge and exact steps needed to get llama-cpp-python
running smoothly on their cutting-edge hardware. Good job, Jonathan!