A comprehensive course on building and improving Retrieval-Augmented Generation (RAG) systems through systematic evaluation and optimization. This repository contains course materials that supplement the popular Maven course. For additional free materials, visit improvingrag.com.
This course teaches you to move beyond trial-and-error RAG development through a data-driven approach. You'll learn to:
- Build robust evaluation frameworks to measure RAG performance objectively
- Fine-tune embedding models for 15-30% performance improvements
- Understand user query patterns through topic modeling and classification
- Enhance retrieval with structured metadata and SQL integration
- Implement sophisticated tool selection and orchestration
RAG systems often fail to meet user needs because developers lack systematic approaches to improvement. This course provides:
- Objective Measurement: Learn to distinguish real improvements from random variation
- Targeted Optimization: Identify exactly where your system fails and why
- Production-Ready Techniques: Apply methods proven in real-world applications
- End-to-End Coverage: From basic retrieval to complex multi-tool orchestration
Learn the fundamental tools for the course: Jupyter Notebooks, LanceDB for vector search, and Pydantic Evals for systematic evaluation.
Key Skills: Vector databases, hybrid search, evaluation frameworks
Build a comprehensive evaluation framework using synthetic data generation, retrieval metrics, and statistical validation.
Key Skills: Synthetic question generation, recall@k, MRR@k, bootstrapping, statistical significance testing
Fine-tune embedding models using both managed services (Cohere) and open-source approaches (sentence-transformers) for significant performance gains.
Key Skills: Hard negative mining, triplet loss training, model deployment
Apply topic modeling to discover user query patterns and build classification systems for ongoing monitoring.
Key Skills: BERTopic, query classification, pattern discovery, satisfaction analysis
Enhance RAG with structured metadata filtering, SQL integration, and PDF parsing for handling complex queries.
Key Skills: Metadata extraction, hybrid retrieval, Text-to-SQL, document parsing
Evaluate and improve tool selection in multi-tool RAG systems through systematic testing and prompting strategies.
Key Skills: Tool orchestration, precision/recall for tools, few-shot prompting
Before starting, please read:
- Notebook Versions Guide - Explains the different notebook versions (standard, logfire, modal)
- Setup Verification - Run this first to verify your environment
- Python 3.9 (required for BERTopic dependency)
- Basic knowledge of Python, machine learning concepts, and APIs
- API keys for various services (see Environment Setup)
First, install uv if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh
Then create a virtual environment and install dependencies:
# Create a virtual environment with Python 3.9
uv venv --python 3.9
# Activate the virtual environment
source .venv/bin/activate # On macOS/Linux
# or
.venv\Scripts\activate # On Windows
# Install all dependencies
uv sync
pip install -e .
Copy .env.example
to .env
and add your API keys:
cp .env.example .env
Required API keys:
- COHERE_API_KEY: Production key (not trial) from Cohere
- OPENAI_API_KEY: From OpenAI
- HF_TOKEN: Write-enabled token from Hugging Face
- LOGFIRE_TOKEN: From Pydantic Logfire
- BRAINTRUST_API_KEY: From Braintrust
Load environment variables in notebooks:
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())
By completing this course, you'll be able to:
- Measure What Matters: Build evaluation frameworks that objectively measure RAG performance
- Improve Systematically: Apply data-driven techniques instead of random experimentation
- Handle Complex Queries: Support queries requiring metadata filtering, SQL access, and multi-tool coordination
- Deploy with Confidence: Verify improvements are statistically significant before production
- Scale Effectively: Apply these techniques to any RAG application or domain
Each week contains 2-4 Jupyter notebooks with hands-on exercises. Notebooks include:
- Detailed explanations of concepts
- Working code examples
- Visualization of results
- Best practices and tips
- Bird-Bench Text-to-SQL dataset for evaluation
- Synthetic transaction data for embedding fine-tuning
- Klarna FAQ pages for query understanding
- Clothing dataset for metadata extraction
- 70+ commands for tool selection evaluation
The office_hours/
directory contains transcripts and summaries from live sessions, providing additional insights and Q&A content.
For contributors, install pre-commit hooks:
pip install pre-commit
pre-commit install
These hooks ensure code quality through:
- Black formatting
- Ruff linting with auto-fixes
- YAML validation
- Large file prevention
When running code in Python files instead of notebooks, wrap async calls:
import asyncio
asyncio.run(main())
Notebooks include built-in visualizations. If running outside Jupyter, you may need to explicitly call plt.show()
for matplotlib plots.
- Report issues at GitHub Issues
- Visit improvingrag.com for additional resources
- Join the Maven course for live instruction and community access
Special thanks to Dmitry Labazkin for frequent feedback and contributions to improve the notebooks.
Note: This is an advanced course assuming familiarity with LLMs and basic RAG concepts. For beginners, we recommend starting with introductory materials on vector databases and semantic search before diving into systematic improvement techniques.