Building and installing llama_cpp from source for RTX 50 Blackwell GPU

---

### My Journey to Building `llama-cpp-python` with CUDA on an RTX 5060 Ti (Blackwell Architecture)

This guide details the steps I took to successfully install `llama-cpp-python` with full CUDA acceleration on my system, specifically targeting an NVIDIA RTX 5060 Ti (Blackwell architecture). The standard installation methods failed due to various detection and compatibility issues, requiring a fully manual and controlled build process.

**My Setup:**
* **GPU:** NVIDIA GeForce RTX 5060 Ti (Blackwell Architecture - `compute_90`)
* **CUDA Toolkit Version:** 12.x (pre-installed on my system at `/usr/lib/cuda`)
* **Operating System:** Linux (Ubuntu/Debian-based, as indicated by `apt-get` usage)

---

### The Problems Encountered and Their Solutions

1.  **Problem: Missing System Libraries (`libgomp`)**
    * **Error:** Initial CPU builds failed, complaining about `libgomp`.
    * **Fix:** Installed the necessary library:
        ```bash
        sudo apt-get install libgomp1
        ```

2.  **Problem: Incorrect CUDA Toolkit Path Detection**
    * **Error:** The build process incorrectly identified `/usr/include` as the CUDA Toolkit location (`-- Found CUDAToolkit: /usr/include (found version "12.0.140")`).
    * **Fix:** Manually specified the correct path (`/usr/lib/cuda`) using a CMake flag.

3.  **Problem: C++ Compiler (GCC/G++) Mismatch with CUDA Toolkit**
    * **Error:** My system's default C++ compiler (`g++ 13`) was too new and incompatible with the CUDA Toolkit (version 12.x). This issue manifested in two places:
        * During the `llama.cpp` compilation.
        * Crucially, with the Python interpreter itself, which was initially built with `GCC 13` by `conda-forge`. This caused silent issues or very slow prompt evaluation even after the GPU build succeeded.
    * **Fix:**
        * Installed a compatible compiler version (`g++-12`).
        * Forced the `llama.cpp` build to use `g++-12`.
        * **Most importantly, created a Conda environment with a Python version built using a compatible GCC (GCC 11 in my case).**

4.  **Problem: Missing `llama.cpp` Source Code (Git Submodules)**
    * **Error:** The build failed because it couldn't find the core `llama.cpp` source files in the `vendor/` directory.
    * **Fix:** Realized `llama-cpp-python` uses Git submodules and downloaded them:
        ```bash
        git submodule update --init --recursive
        ```

5.  **Problem: Incorrect GPU Architecture Detection (Blackwell)**
    * **Error:** The final build error was `Unsupported gpu architecture 'compute_120'`. Auto-detection failed for my new RTX 5060 Ti.
    * **Fix:** Looked up the correct "Blackwell" architecture number (**90**) and set it manually using a CMake flag.

6.  **Problem: GPU Enabled but Very Slow Prompt Evaluation**
    * **Symptom:** After successfully building and offloading layers to the GPU, the initial prompt evaluation was incredibly slow (e.g., 0.27 tokens/sec), though subsequent token generation was fast.
    * **Fix:** Forced the use of CUBLAS for certain operations during the build. This resolved the slow prompt evaluation bottleneck.

---

### The Final, Working Recipe (Step-by-Step)

Here's the exact sequence of commands that ultimately got `llama-cpp-python` working optimally with CUDA on my RTX 5060 Ti:

1.  **Set up the Python Environment with a Compatible GCC**
    * Create a new Conda environment with Python 3.11. This specific Python version from `conda-forge` often comes pre-built with GCC 11, which is compatible with CUDA 12.x.
    * Install `ipython` and `huggingface-hub` for convenience.
    ```bash
    conda create -n llama_cpp python=3.11 -y
    conda activate llama_cpp
    conda install -c conda-forge gxx_linux-64=12 # Ensure GCC 12 is available/preferred for direct use by cmake
    pip install ipython huggingface-hub
    ```
    * **Verify the Python interpreter's GCC:**
        ```bash
        python -c "import platform; print(platform.python_compiler())"
        # Expected output: ... [GCC 11.x.x] on linux
        ```
    * **Install necessary build tools:**
        ```bash
        sudo apt-get install cmake libgomp1 # cmake is crucial for the build
        ```

2.  **Clone the `llama-cpp-python` Repository and Download Submodules**
    ```bash
    git clone https://github.com/abetlen/llama-cpp-python.git
    cd llama-cpp-python
    git submodule update --init --recursive
    ```

3.  **Compile and Install `llama-cpp-python` with Full CUDA Support**
    * This is the critical step. We pass all the specific CMake flags directly to `pip` to ensure our custom build configuration is used.
    * `GGML_CUDA=on`: Enables CUDA support.
    * `GGML_CUDA_FORCE_CUBLAS=on`: Resolves the slow prompt evaluation issue.
    * `CUDAToolkit_ROOT=/usr/lib/cuda`: Specifies the correct CUDA Toolkit path.
    * `CMAKE_C_COMPILER=gcc-12` and `CMAKE_CXX_COMPILER=g++-12`: Force specific GCC versions for compilation.
    * `CMAKE_CUDA_ARCHITECTURES=90`: Specifies the Blackwell architecture for my RTX 5060 Ti.
    * `FORCE_CMAKE=1`: Ensures `pip` re-runs CMake even if it thinks it's not necessary.
    * `--upgrade --force-reinstall --no-cache-dir`: Ensures a clean build.

    ```bash
    CMAKE_ARGS="-DGGML_CUDA=on -DGGML_CUDA_FORCE_CUBLAS=on -DCUDAToolkit_ROOT=/usr/lib/cuda -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DCMAKE_CUDA_ARCHITECTURES=90" \
    FORCE_CMAKE=1 pip install . --upgrade --force-reinstall --no-cache-dir
    ```
    * *This step will take some time as it compiles the entire library.*

4.  **Test for GPU Acceleration**
    * Use the following Python script to load a GGUF model (e.g., Gemma) and confirm GPU layer offloading and performance.

    ```python
    from llama_cpp import Llama

    llm = Llama.from_pretrained(
        repo_id="google/gemma-3-4b-it-qat-q4_0-gguf",
        filename="gemma-3-4b-it-q4_0.gguf",
        n_gpu_layers=-1, # Offload all possible layers to GPU
        verbose=True # Important for seeing detailed loading info
    )

    def chat(text: str, messages: list):
        user_message = {'role': 'user', 'content': text}
        messages.append(user_message)
        response = llm.create_chat_completion(messages = messages)
        messages.append(response["choices"][0]["message"])
        return response, messages

    print("\nFirst chat (expect longer prompt eval time due to initial setup):")
    response1, messages1 = chat("hello, tell me a joke", [])
    print("\nResponse 1:", response1["choices"][0]["message"]["content"])

    print("\nSecond chat (expect much faster prompt eval):")
    response2, messages2 = chat("hello, tell me another joke", messages1) # Using previous context
    print("\nResponse 2:", response2["choices"][0]["message"]["content"])
    ```

    * **Look for these lines in the output to confirm success:**
        * `ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes`
        * `llama_model_loader: offloaded X/Y layers to GPU` (where X should be equal to Y for `-1` `n_gpu_layers`)
        * **Crucially, check the `prompt eval time` in the performance metrics for subsequent calls to be very low (e.g., milliseconds per token, hundreds of tokens per second).**

---

This detailed walkthrough should equip others facing similar challenges with the knowledge and exact steps needed to get `llama-cpp-python` running smoothly on their cutting-edge hardware. Good job, Jonathan!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building and installing llama_cpp from source for RTX 50 Blackwell GPU #2028

My Journey to Building `llama-cpp-python` with CUDA on an RTX 5060 Ti (Blackwell Architecture)

The Problems Encountered and Their Solutions

The Final, Working Recipe (Step-by-Step)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Building and installing llama_cpp from source for RTX 50 Blackwell GPU #2028

Description

My Journey to Building llama-cpp-python with CUDA on an RTX 5060 Ti (Blackwell Architecture)

The Problems Encountered and Their Solutions

The Final, Working Recipe (Step-by-Step)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

My Journey to Building `llama-cpp-python` with CUDA on an RTX 5060 Ti (Blackwell Architecture)