Inference of Meta's LLaMA model (and others) in pure C/C++.
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.
- Plain C/C++ implementation without any dependencies
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
This fork was created using this instruction, based on gcc 8.5 and nvcc 10.2. To use this, you will need the following software packages installed. The section "Install prerequisites" describes the process in detail. The installation of gcc 8.5 and cmake 3.27 of these might take several hours.
- Nvidia CUDA Compiler nvcc 10.2 -
nvcc --version - GCC and CXX (g++) 8.5 -
gcc --version - cmake >= 3.14 -
cmake --version nano,curl,libcurl4-openssl-dev,python3-pipandjtop
We need to add a few extra flags to the recommended first instruction cmake -B build, otherwise there are several error like Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler extensions). that would stop the compilation. There will we a few warning: constexpr if statements are a C++17 feature after the second instruction, but we can ignore them. Let's start with the first one:
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=14 -DCMAKE_CUDA_STANDARD_REQUIRED=true -DGGML_CPU_ARM_ARCH=armv8-a -DGGML_NATIVE=offAnd 15 seconds later we're ready for the last step, the instruction that will take 85 minutes to have llama.cpp compiled:
cmake --build build --config ReleaseNow you can use binaries from build/bin folder. If you want to make binaries globally available, add this to your ~/.bashrc file:
export PATH="$PATH:$HOME/Llama.cpp/build/bin"A lightweight, OpenAI API compatible, HTTP server for serving LLMs.
-
Start a local HTTP server with default configuration on port 8080
llama-server -m model.gguf --port 8080 # Basic web UI can be accessed via browser: http://localhost:8080 # Chat completion endpoint: http://localhost:8080/v1/chat/completions
-
Support multiple-users and parallel decoding
# up to 4 concurrent requests, each with 4096 max context llama-server -m model.gguf -c 16384 -np 4 -
Enable speculative decoding
# the draft.gguf model should be a small variant of the target model.gguf llama-server -m model.gguf -md draft.gguf -
Serve an embedding model
# use the /embedding endpoint llama-server -m model.gguf --embedding --pooling cls -ub 8192 -
Serve a reranking model
# use the /reranking endpoint llama-server -m model.gguf --reranking -
Constrain all outputs with a grammar
# custom grammar llama-server -m model.gguf --grammar-file grammar.gbnf # JSON llama-server -m model.gguf --grammar-file grammars/json.gbnf
