Skip to content

Building and installing llama_cpp from source for RTX 50 Blackwell GPU #2028

Open
@Johnnyboycurtis

Description

@Johnnyboycurtis

My Journey to Building llama-cpp-python with CUDA on an RTX 5060 Ti (Blackwell Architecture)

This guide details the steps I took to successfully install llama-cpp-python with full CUDA acceleration on my system, specifically targeting an NVIDIA RTX 5060 Ti (Blackwell architecture). The standard installation methods failed due to various detection and compatibility issues, requiring a fully manual and controlled build process.

My Setup:

  • GPU: NVIDIA GeForce RTX 5060 Ti (Blackwell Architecture - compute_90)
  • CUDA Toolkit Version: 12.x (pre-installed on my system at /usr/lib/cuda)
  • Operating System: Linux (Ubuntu/Debian-based, as indicated by apt-get usage)

The Problems Encountered and Their Solutions

  1. Problem: Missing System Libraries (libgomp)

    • Error: Initial CPU builds failed, complaining about libgomp.
    • Fix: Installed the necessary library:
      sudo apt-get install libgomp1
  2. Problem: Incorrect CUDA Toolkit Path Detection

    • Error: The build process incorrectly identified /usr/include as the CUDA Toolkit location (-- Found CUDAToolkit: /usr/include (found version "12.0.140")).
    • Fix: Manually specified the correct path (/usr/lib/cuda) using a CMake flag.
  3. Problem: C++ Compiler (GCC/G++) Mismatch with CUDA Toolkit

    • Error: My system's default C++ compiler (g++ 13) was too new and incompatible with the CUDA Toolkit (version 12.x). This issue manifested in two places:
      • During the llama.cpp compilation.
      • Crucially, with the Python interpreter itself, which was initially built with GCC 13 by conda-forge. This caused silent issues or very slow prompt evaluation even after the GPU build succeeded.
    • Fix:
      • Installed a compatible compiler version (g++-12).
      • Forced the llama.cpp build to use g++-12.
      • Most importantly, created a Conda environment with a Python version built using a compatible GCC (GCC 11 in my case).
  4. Problem: Missing llama.cpp Source Code (Git Submodules)

    • Error: The build failed because it couldn't find the core llama.cpp source files in the vendor/ directory.
    • Fix: Realized llama-cpp-python uses Git submodules and downloaded them:
      git submodule update --init --recursive
  5. Problem: Incorrect GPU Architecture Detection (Blackwell)

    • Error: The final build error was Unsupported gpu architecture 'compute_120'. Auto-detection failed for my new RTX 5060 Ti.
    • Fix: Looked up the correct "Blackwell" architecture number (90) and set it manually using a CMake flag.
  6. Problem: GPU Enabled but Very Slow Prompt Evaluation

    • Symptom: After successfully building and offloading layers to the GPU, the initial prompt evaluation was incredibly slow (e.g., 0.27 tokens/sec), though subsequent token generation was fast.
    • Fix: Forced the use of CUBLAS for certain operations during the build. This resolved the slow prompt evaluation bottleneck.

The Final, Working Recipe (Step-by-Step)

Here's the exact sequence of commands that ultimately got llama-cpp-python working optimally with CUDA on my RTX 5060 Ti:

  1. Set up the Python Environment with a Compatible GCC

    • Create a new Conda environment with Python 3.11. This specific Python version from conda-forge often comes pre-built with GCC 11, which is compatible with CUDA 12.x.
    • Install ipython and huggingface-hub for convenience.
    conda create -n llama_cpp python=3.11 -y
    conda activate llama_cpp
    conda install -c conda-forge gxx_linux-64=12 # Ensure GCC 12 is available/preferred for direct use by cmake
    pip install ipython huggingface-hub
    • Verify the Python interpreter's GCC:
      python -c "import platform; print(platform.python_compiler())"
      # Expected output: ... [GCC 11.x.x] on linux
    • Install necessary build tools:
      sudo apt-get install cmake libgomp1 # cmake is crucial for the build
  2. Clone the llama-cpp-python Repository and Download Submodules

    git clone https://github.com/abetlen/llama-cpp-python.git
    cd llama-cpp-python
    git submodule update --init --recursive
  3. Compile and Install llama-cpp-python with Full CUDA Support

    • This is the critical step. We pass all the specific CMake flags directly to pip to ensure our custom build configuration is used.
    • GGML_CUDA=on: Enables CUDA support.
    • GGML_CUDA_FORCE_CUBLAS=on: Resolves the slow prompt evaluation issue.
    • CUDAToolkit_ROOT=/usr/lib/cuda: Specifies the correct CUDA Toolkit path.
    • CMAKE_C_COMPILER=gcc-12 and CMAKE_CXX_COMPILER=g++-12: Force specific GCC versions for compilation.
    • CMAKE_CUDA_ARCHITECTURES=90: Specifies the Blackwell architecture for my RTX 5060 Ti.
    • FORCE_CMAKE=1: Ensures pip re-runs CMake even if it thinks it's not necessary.
    • --upgrade --force-reinstall --no-cache-dir: Ensures a clean build.
    CMAKE_ARGS="-DGGML_CUDA=on -DGGML_CUDA_FORCE_CUBLAS=on -DCUDAToolkit_ROOT=/usr/lib/cuda -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DCMAKE_CUDA_ARCHITECTURES=90" \
    FORCE_CMAKE=1 pip install . --upgrade --force-reinstall --no-cache-dir
    • This step will take some time as it compiles the entire library.
  4. Test for GPU Acceleration

    • Use the following Python script to load a GGUF model (e.g., Gemma) and confirm GPU layer offloading and performance.
    from llama_cpp import Llama
    
    llm = Llama.from_pretrained(
        repo_id="google/gemma-3-4b-it-qat-q4_0-gguf",
        filename="gemma-3-4b-it-q4_0.gguf",
        n_gpu_layers=-1, # Offload all possible layers to GPU
        verbose=True # Important for seeing detailed loading info
    )
    
    def chat(text: str, messages: list):
        user_message = {'role': 'user', 'content': text}
        messages.append(user_message)
        response = llm.create_chat_completion(messages = messages)
        messages.append(response["choices"][0]["message"])
        return response, messages
    
    print("\nFirst chat (expect longer prompt eval time due to initial setup):")
    response1, messages1 = chat("hello, tell me a joke", [])
    print("\nResponse 1:", response1["choices"][0]["message"]["content"])
    
    print("\nSecond chat (expect much faster prompt eval):")
    response2, messages2 = chat("hello, tell me another joke", messages1) # Using previous context
    print("\nResponse 2:", response2["choices"][0]["message"]["content"])
    • Look for these lines in the output to confirm success:
      • ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
      • llama_model_loader: offloaded X/Y layers to GPU (where X should be equal to Y for -1 n_gpu_layers)
      • Crucially, check the prompt eval time in the performance metrics for subsequent calls to be very low (e.g., milliseconds per token, hundreds of tokens per second).

This detailed walkthrough should equip others facing similar challenges with the knowledge and exact steps needed to get llama-cpp-python running smoothly on their cutting-edge hardware. Good job, Jonathan!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions