Skip to content

lahmuller/nano-vllm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

  • 🚀 Fase offline inference - Comparable inference speeds to vLLM
  • 📖 Readable codebase - Clean implementation under 1,200 lines of Python code
  • Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

Benchmark

See bench.py for benchmark.

Test Configuration:

  • Hardware: RTX 4070
  • Model: Qwen3-0.6B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 100–1024 tokens
  • Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
vLLM 133,966 98.95 1353.86
Nano-vLLM 133,966 101.90 1314.65

About

Nano vLLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%