AI-Powered Development Environment with Local vLLM Inference
👨💻 Author: Shlomo Kashani 🏫 Affiliation: Johns Hopkins University, Maryland U.S.A. 🏢 Organization: QNeura.ai
This project provides a complete Docker-based development environment combining vLLM (high-performance LLM inference), LMCache (KV cache optimization), and OpenCode (AI coding assistant) - all optimized for NVIDIA RTX 5090 on WSL2/Windows and Linux.
The unique feature of this setup is OpenCode running inside the same Docker container as vLLM. This enables:
- ✅ Zero-latency AI coding - OpenCode connects to vLLM via localhost
- ✅ Complete privacy - All inference happens locally, no cloud API calls
- ✅ Cost-free AI assistance - Use your own models (Qwen, Gemma, DeepSeek, etc.)
- ✅ Full control - Switch models, tune parameters, monitor GPU usage
- ✅ RTX 5090 Optimized - Tuned for 32GB GDDR7 and Ada Lovelace architecture
┌─────────────────────────────────────────────────────┐
│ Docker Container │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ OpenCode │ ←───→ │ vLLM │ │
│ │ │localhost │ Server │ │
│ │ (AI Coding) │ :8000 │ (Inference) │ │
│ └──────────────┘ └──────────────┘ │
│ ↓ │
│ NVIDIA RTX 5090 │
│ 32GB GDDR7 │
└─────────────────────────────────────────────────────┘
# Windows
.\build.bat
# Linux/Mac
docker build -t lmcache-vllm:latest --target image-build -f docker/Dockerfile ..\run-d.bat
bash /workspace/oc-vllm.shopencode -m vllm/Qwen/Qwen3-4BRunning large language models on cutting-edge GPUs like the RTX 5090 in WSL2 can be challenging due to dependency management and environment conflicts. This project provides a ready-to-use Docker container that:
- Eliminates setup headaches - Pre-configured with all dependencies
- Enables local AI coding - OpenCode integrated with vLLM for private, fast AI assistance
- Optimizes for RTX 5090 - Tuned settings for 32GB GDDR7 memory
- Works on WSL2/Linux - Seamless support for both platforms
- Includes training tools - Complete data science stack with unified scripts
- CUDA 12.8 runtime optimized for RTX 5090
- Python 3.12 with system-wide packages (no venv restrictions)
- vLLM (nightly) with runai, tensorizer, and flashinfer support
- LMCache for KV cache optimization
- Node.js v18.19.1 and npm 9.2.0
- opencode-ai - AI coding assistant (pre-installed)
- @anthropic-ai/claude-code - Claude Code CLI (pre-installed)
- TypeScript and ts-node (pre-installed)
- Training: peft, trl, accelerate, bitsandbytes
- Core ML: pandas, numpy, scipy, scikit-learn
- Deep Learning: torch, transformers, xformers
- Utilities: jupyter, jupyterlab, gradio, matplotlib, seaborn
- GPU: NVIDIA RTX 5090 with 32GB GDDR7 VRAM (or compatible GPU)
- RAM: Minimum 32GB system RAM (64GB recommended)
- Storage: 100GB+ free space for models and Docker images
- Windows 11 (Build 22000 or later)
- WSL2 with Ubuntu 20.04+ or compatible distribution
- NVIDIA GPU drivers with WSL2 support (550+ series)
- Docker Desktop with WSL2 backend
- Ubuntu 20.04+ or compatible distribution
- NVIDIA GPU drivers (550+ series recommended)
- Docker CE/EE
- Docker 24.0+
- NVIDIA Container Toolkit (nvidia-docker2)
- Python 3.12
- CUDA 12.8
- PyTorch 2.7.0 (installed during build)
- Node.js 18.19.1 & npm 9.2.0
# Windows
.\build.bat
# Linux/Mac
docker build -t lmcache-vllm:latest --target image-build -f docker/Dockerfile .
Build time: ~15-20 minutes
Final image size: ~33.9 GB
Image name: lmcache/vllm-openai:build-latest
# Windows
.\run-d.bat
# Linux/Mac
docker run --gpus all --rm -it \
--shm-size=8gb --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
--memory=16g --env=DISPLAY \
-p 8000:8000 -p 8078:7842 -p 7861:7860 -p 8502:8501 \
-v $(pwd):/root/sharedfolder \
-v ~/cache:/root/.cache \
lmcache/vllm-openai:build-latestPort mappings:
8000:8000- vLLM OpenAI API7861:7860- Gradio apps8078:7842- Additional services8502:8501- Streamlit apps
Inside the container, use the pre-configured script:
bash /workspace/oc-vllm.shOr start manually with custom settings:
vllm serve Qwen/Qwen3-4B \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.8 \
--max-model-len 20000 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--trust-remote-codeVerify it's running:
curl http://localhost:8000/v1/modelsStart coding with AI:
opencode --model vllm/qwen3-4bThat's it! OpenCode now uses your local vLLM for AI assistance. No cloud, no costs, complete privacy!
Locations in Docker:
/workspace/opencode.json/root/.config/opencode/opencode.json
Pre-configured Models:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "vLLM (local)",
"options": {
"baseURL": "http://localhost:8000/v1"
},
"models": {
"qwen3-4b": {
"name": "Qwen3 4B (vLLM)",
"contextWindow": 32768,
"maxTokens": 4096
},
"qwen-0.5b": { "name": "Qwen 0.5B", "contextWindow": 32768, "maxTokens": 2048 },
"qwen-14b": { "name": "Qwen 14B", "contextWindow": 32768, "maxTokens": 4096 },
"gemma-4b": { "name": "Gemma 4B", "contextWindow": 8192, "maxTokens": 2048 },
"gemma-9b": { "name": "Gemma 9B", "contextWindow": 8192, "maxTokens": 2048 },
"deepseek-7b": { "name": "DeepSeek 7B", "contextWindow": 16384, "maxTokens": 4096 }
}
}
}
}The configuration points OpenCode to http://localhost:8000/v1 - your local vLLM server!
Location: /workspace/oc-vllm.sh
Contents:
vllm serve Qwen/Qwen3-4B \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.8 \
--max-model-len 20000 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--trust-remote-codeUsage:
bash /workspace/oc-vllm.shUse the unified training script:
cd /workspace
# LoRA fine-tuning (Qwen/Gemma/DeepSeek)
python scripts/train.py \
--model_path /root/sharedfolder/mdls/qwen3-4b \
--output_path /root/sharedfolder/output \
--epochs 1
# DeBERTa training (auto-detected)
python scripts/train.py \
--model_path microsoft/deberta-v3-base \
--output_path /root/sharedfolder/outputUse the unified inference script:
cd /workspace
# Single GPU inference
python scripts/infer.py \
--model_path /root/sharedfolder/mdls/qwen3-4b \
--lora_path /root/sharedfolder/output \
--output_file submission.csv
# Multi-GPU inference (auto-configured)
python scripts/infer.py \
--model_path /root/sharedfolder/mdls/qwen-14b \
--output_file submission.csvjupyter lab --ip=0.0.0.0 --port=8888 --allow-root --no-browserAccess at: http://localhost:8888
python app-vllm-gradio.pyAccess at: http://localhost:7861/ (mapped from container's 7860)
Python packages:
pip install <package>
# or
uv pip install --system <package>npm packages:
npm install -g <package>No restrictions - install freely!
d-git-vllm/
├── README.md # This file
├── build.bat # Build Docker image (Windows)
├── run-d.bat # Run container with bash (Windows)
├── run_qwen4b.bat # Run container with Qwen-4B
├── oc-vllm.sh # Quick vLLM launcher script
├── opencode.json # OpenCode configuration for vLLM
│
├── assets/ # Logo and screenshots
│ ├── light-mode-logo.svg
│ ├── 01.png, 02.png, 03.png
│
├── docker/
│ ├── Dockerfile # Main Dockerfile
│ ├── BUILD.md # Docker build guide
│ └── CONTAINER_USAGE.md # Container usage reference
│
├── requirements/
│ ├── common.txt # Common Python dependencies
│ ├── cuda.txt # CUDA dependencies (torch, etc.)
│ ├── ds-stack.txt # Data science stack
│ └── build.txt # Build-time dependencies
│
├── scripts/
│ ├── train.py # Unified training script
│ ├── infer.py # Unified inference script
│ ├── README.md # Scripts documentation
│ ├── constants.py # Constants and configuration
│ ├── utils.py # Utility functions
│ └── legacy/ # Archived model-specific scripts
│
├── app-*.py # Gradio/Streamlit apps
│
├── lmcache/ # LMCache Python source
├── csrc/ # CUDA C++ extensions
├── setup.py # LMCache setup
└── pyproject.toml # Python project metadata
Documentation:
├── OPENCODE_SETUP.md # Detailed OpenCode setup guide
├── NPM_AI_TOOLS.md # AI tools reference
└── DOCKER_CHANGES.md # Summary of Docker changes
| Path | Description |
|---|---|
/workspace/ |
Main working directory |
/workspace/opencode.json |
OpenCode configuration |
/workspace/oc-vllm.sh |
vLLM launcher script |
/workspace/scripts/ |
Unified training/inference scripts |
/workspace/app-*.py |
Gradio/Streamlit apps |
/root/.config/opencode/ |
OpenCode config directory |
/root/sharedfolder/ |
Mounted host directory (%USERPROFILE%/dev/) |
/root/.cache/ |
Hugging Face cache |
- OpenCode config:
/root/.config/opencode/opencode.json - OpenCode workspace copy:
/workspace/opencode.json - vLLM launcher:
/workspace/oc-vllm.sh
- Host models:
%USERPROFILE%/dev/mdls/→/root/sharedfolder/mdls/ - Host data:
%USERPROFILE%/dev/→/root/sharedfolder/ - HF cache:
%USERPROFILE%/dev/mdls/.cache→/root/.cache/
Pre-configured in opencode.json:
| Model | Context | Max Tokens | Best For | Memory (Est.) |
|---|---|---|---|---|
| Qwen3 4B | 32k | 4k | General coding, balanced performance | ~8 GB |
| Qwen 0.5B | 32k | 2k | Fast responses, simple tasks | ~2 GB |
| Qwen 14B | 32k | 4k | Complex reasoning, multi-GPU | ~28 GB |
| Gemma 4B | 8k | 2k | Efficient coding, small footprint | ~8 GB |
| Gemma 9B | 8k | 2k | Advanced tasks, Google ecosystem | ~18 GB |
| DeepSeek 7B | 16k | 4k | Code-focused, debugging | ~14 GB |
This setup is optimized for NVIDIA RTX 5090 (32GB GDDR7):
- CUDA Version: 12.8 (latest)
- Flash Attention: Enabled with latest optimizations
- Tensor Cores: Automatically utilized
- Architecture List:
10.0+PTX 12.0+PTX(includes Ada Lovelace+) - CUDA Arch (Flash Attn):
120
--gpu-memory-utilization 0.8 # Use 80% of 32GB (25.6GB)
--tensor-parallel-size 1 # Single GPU (can use 2-4 for larger models)
--max-model-len 20000 # Optimize for RTX 5090 memory
--enforce-eager # Predictable memory usage| Model Size | Tensor Parallel | GPU Memory | Recommended |
|---|---|---|---|
| 0.5B - 7B | 1 | ~2-14 GB | ✅ Single RTX 5090 |
| 7B - 13B | 1 | ~14-26 GB | ✅ Single RTX 5090 |
| 14B - 30B | 2 | ~28-60 GB | 🔄 Needs 2x RTX 5090 |
| 30B+ | 4 | 60GB+ | 🔄 Needs 4x RTX 5090 |
| Variable | Default | Description |
|---|---|---|
MAX_JOBS |
2 |
Number of parallel build jobs |
NVCC_THREADS |
8 |
CUDA compilation threads |
TORCH_CUDA_ARCH_LIST |
'10.0+PTX 12.0+PTX' |
Target CUDA architectures |
FLASH_ATTN_CUDA_ARCHS |
120 |
FlashAttention architectures |
GPU_MEMORY_UTILIZATION |
0.8 |
GPU memory usage ratio |
# 1. Start container
.\run-d.bat
# 2. Inside container, start vLLM
bash /workspace/oc-vllm.sh
# 3. Open new terminal, exec into same container
docker exec -it <container-id> bash
# 4. Authenticate OpenCode (first time only)
opencode auth login
# 5. Start coding with AI
opencode --provider vllm --model qwen3-4b
# Now you have AI-powered coding with your local model!# 1. Train a LoRA adapter
cd /workspace
python scripts/train.py \
--model_path /root/sharedfolder/mdls/qwen3-4b \
--output_path /root/sharedfolder/lora-output
# 2. Run inference with trained LoRA
python scripts/infer.py \
--model_path /root/sharedfolder/mdls/qwen3-4b \
--lora_path /root/sharedfolder/lora-output \
--output_file results.csv
# 3. Deploy with vLLM
vllm serve /root/sharedfolder/mdls/qwen3-4b \
--port 8000 \
--enable-lora \
--lora-modules lora-1=/root/sharedfolder/lora-output# Terminal 1: Run Qwen3-4B for coding
vllm serve Qwen/Qwen3-4B --port 8000
# Terminal 2: Use OpenCode
opencode --provider vllm --model qwen3-4b
# Terminal 3: Run Gemma-9B on different port
vllm serve google/gemma-2-9b --port 8001
# Terminal 4: Test both models
curl http://localhost:8000/v1/chat/completions -d '{...}'
curl http://localhost:8001/v1/chat/completions -d '{...}'| Model | Memory Usage | Tokens/s | Max Context | Batch Size |
|---|---|---|---|---|
| Qwen3 4B | ~8 GB | 80-120 | 32K | 1-4 |
| Gemma 9B | ~18 GB | 50-80 | 8K | 1-2 |
| Qwen 14B | ~28 GB | 40-60 | 32K | 1 |
| DeepSeek 7B | ~14 GB | 60-90 | 16K | 1-2 |
Benchmarks are estimates for RTX 5090. Actual performance may vary based on specific use case, context length, and batch size.
Problem: Connection refused or ECONNREFUSED
Solution:
# 1. Verify vLLM is running
curl http://localhost:8000/v1/models
# 2. Check if port 8000 is listening
netstat -tlnp | grep 8000
# 3. Check opencode.json has correct URL
cat ~/.config/opencode/opencode.json
# Should show: "baseURL": "http://localhost:8000/v1"Problem: CUDA not available or No GPU found
Solution:
# Check GPU is visible
nvidia-smi
# Verify Docker has GPU access
docker run --gpus all nvidia/cuda:12.8-base nvidia-smi
# For WSL2: Update WSL2 kernel
wsl --update --web-download
# Restart container with --gpus all flag
.\run-d.batProblem: CUDA out of memory during inference
Solution:
# Reduce GPU memory utilization
vllm serve <model> --gpu-memory-utilization 0.6
# Reduce max model length
vllm serve <model> --max-model-len 10000
# Use smaller batch sizes
vllm serve <model> --max-num-seqs 1
# Enable memory defragmentation
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Use smaller model
vllm serve Qwen/Qwen3-0.5B # Instead of 4BProblem: Port already in use
Solution:
# Change default ports
-p 8001:8000 # API server
-p 7862:7860 # Gradio UI
# Or kill process using the port (Windows)
netstat -ano | findstr :8000
taskkill /PID <process-id> /FProblem: Can't download models from HuggingFace
Solution:
# Set HuggingFace cache
export HF_HOME=/root/.cache/huggingface
# Login to HuggingFace (if needed)
huggingface-cli login
# Use mirror (for regions with slow access)
export HF_ENDPOINT=https://hf-mirror.comProblem: Can't install packages
Solution:
# This Docker has NO restrictions - you can install freely
pip install <any-package>
npm install -g <any-package>
# No virtual env needed!
# All packages install system-wide- OPENCODE_SETUP.md - Complete OpenCode setup guide
- NPM_AI_TOOLS.md - AI coding tools reference
- DOCKER_CHANGES.md - Summary of Docker modifications
- docker/BUILD.md - Docker build documentation
- docker/CONTAINER_USAGE.md - Container usage guide
- scripts/README.md - Training & inference scripts guide
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch
- Test thoroughly on both WSL2 and Linux
- Submit a pull request with detailed description
# Install development dependencies
pip install -r requirements/build.txt
# Run tests
python -m pytest tests/Feel free to:
- Add more models to
opencode.json - Create custom training scripts in
scripts/ - Build Gradio/Streamlit apps as
app-*.py - Share your optimizations for RTX 5090
This project combines multiple open-source components:
- vLLM: Apache 2.0 License
- LMCache: Apache 2.0
- PyTorch: BSD-style License
- NVIDIA Components: NVIDIA Software License
- OpenCode: Check npm package license
- Individual models: Check model cards (Qwen, Gemma, DeepSeek)
- vLLM Team - High-performance LLM inference (GitHub)
- LMCache Team - KV cache optimization
- OpenCode - AI coding assistant
- Anthropic - Claude Code CLI
- Model Authors:
- Qwen (Alibaba Cloud)
- Gemma (Google DeepMind)
- DeepSeek (DeepSeek AI)
Your Code → Cloud API → $$$$ → Response
(Slow, Expensive, Privacy Concerns)
Your Code → OpenCode → localhost:8000 → vLLM → RTX 5090 → Response
(Fast, Free, Private, Unlimited)
You get enterprise-grade AI coding on your local machine with zero ongoing costs and complete privacy. All powered by your RTX 5090!
- 💰 Zero Cost: No API fees, no subscriptions - unlimited use
- 🔒 Complete Privacy: Your code never leaves your machine
- ⚡ Zero Latency: Localhost connection = instant responses
- 🎛️ Full Control: Switch models, tune parameters, customize everything
- 🚀 High Performance: RTX 5090's 32GB GDDR7 enables large context windows
- 🛠️ Dev-Friendly: Jupyter, training scripts, Gradio apps all included
If you use this project or vLLM in your research, please cite:
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}- ✨ NEW: OpenCode AI coding assistant integrated
- ✨ NEW: LMCache for KV cache optimization
- ✨ NEW: Unified training and inference scripts
- ✨ NEW: Complete data science stack (jupyter, gradio, peft, trl)
- ✨ NEW: Pre-configured for 6 models
- 🔧 Removed pip restrictions (free package installation)
- 🔧 Node.js 18.19 + npm 9.2 added
- 🔧 TypeScript and development tools included
- 🚀 Docker container for RTX 5090 WSL2 deployment
- 🚀 vLLM optimized for 5090 architecture
- 🚀 Seamless WSL2 integration
- 🚀 CUDA 12.8, PyTorch 2.7.0
- 🚀 Video captioning demo applications
Built with ❤️ for high-performance AI inference on cutting-edge hardware.
Happy Coding with Local AI! 🎉
For questions or issues, check the documentation files or open an issue.


