Llama.cpp

Inference of Meta's LLaMA model (and others) in pure C/C++.

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.

Plain C/C++ implementation without any dependencies
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Building the fork

This fork was created using this instruction, based on gcc 8.5 and nvcc 10.2. To use this, you will need the following software packages installed. The section "Install prerequisites" describes the process in detail. The installation of gcc 8.5 and cmake 3.27 of these might take several hours.

Nvidia CUDA Compiler nvcc 10.2 - nvcc --version
GCC and CXX (g++) 8.5 - gcc --version
cmake >= 3.14 - cmake --version
nano, curl, libcurl4-openssl-dev, python3-pip and jtop

We need to add a few extra flags to the recommended first instruction cmake -B build, otherwise there are several error like Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler extensions). that would stop the compilation. There will we a few warning: constexpr if statements are a C++17 feature after the second instruction, but we can ignore them. Let's start with the first one:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=14 -DCMAKE_CUDA_STANDARD_REQUIRED=true -DGGML_CPU_ARM_ARCH=armv8-a -DGGML_NATIVE=off

And 15 seconds later we're ready for the last step, the instruction that will take 85 minutes to have llama.cpp compiled:

cmake --build build --config Release

Now you can use binaries from build/bin folder. If you want to make binaries globally available, add this to your ~/.bashrc file:

export PATH="$PATH:$HOME/Llama.cpp/build/bin"

`llama-server`

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

Start a local HTTP server with default configuration on port 8080

llama-server -m model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

Support multiple-users and parallel decoding

# up to 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4

Enable speculative decoding

# the draft.gguf model should be a small variant of the target model.gguf
llama-server -m model.gguf -md draft.gguf

Serve an embedding model

# use the /embedding endpoint
llama-server -m model.gguf --embedding --pooling cls -ub 8192

Serve a reranking model

# use the /reranking endpoint
llama-server -m model.gguf --reranking

Constrain all outputs with a grammar

# custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf

# JSON
llama-server -m model.gguf --grammar-file grammars/json.gbnf

Name		Name	Last commit message	Last commit date
Latest commit History 5,422 Commits
.devops		.devops
.github		.github
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
prompts		prompts
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
icon.png		icon.png
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Llama.cpp

Description

Building the fork

`llama-server`

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

About

Uh oh!

Releases 5

Packages

Languages

License

LLabsTech/Llama.cpp

Folders and files

Latest commit

History

Repository files navigation

Llama.cpp

Description

Building the fork

llama-server

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

`llama-server`

Packages