Raw RAG: Retrieval-Augmented Generation from Scratch

Introduction

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances language models by retrieving relevant information from a knowledge base before generating responses. This repository demonstrates how to implement RAG using simple Python functions and libraries, without relying on complex frameworks like LangChain or LlamaIndex.

The goal of this project is to educate developers on building a RAG system from the ground up, providing a deeper understanding of the underlying processes and greater control over the implementation.

Why "Raw" RAG?

While libraries like LangChain and LlamaIndex offer quick setup and implementation of RAG systems, there are several advantages to building a RAG pipeline from scratch:

Deterministic Results: By controlling each step of the process, you can ensure more consistent and reproducible outcomes.
Enhanced Control: Understanding and implementing each component allows for fine-tuned adjustments and optimizations.
Reduced Dependencies: Minimizing external libraries reduces potential security vulnerabilities and version conflicts.
Transparency: A "raw" approach makes the entire pipeline more inspectable and understandable.
Customization: Easily modify and extend the system to fit specific use cases without library constraints.
Learning Opportunity: Building from scratch provides invaluable insights into the RAG process and its components.

Updates

I've made another simple app for RAG evaluation. Please check it out. Rag Eval You can drop a JSON output and it will evaluate the results.

Repository Contents

This repository contains several Jupyter notebooks demonstrating different aspects of building a RAG system:

raw_rag_01_basics.ipynb: Basic RAG implementation with embedding and retrieval
raw_rag_02_no_embed.ipynb: Retrieval techniques, such as BM25, without embeddings
raw_rag_03_clearer_ocr.ipynb: Document preprocessing and OCR enhancement
raw_rag_04_summarize.ipynb: Text summarization techniques
raw_rag_05_pydantic_is_all_you_need.ipynb: Implementing JSON parsing for structured output
raw_rag_06_metadata.ipynb: Metadata extraction
raw_rag_07_memory.ipynb: Memory implementation
raw_rag_08_evaluation.ipynb: Evaluation metrics for RAG

More to come ...

Getting Started

Prerequisites

Python 3.9+
Jupyter Notebook or JupyterLab

Installation

Clone this repository:

git clone https://github.com/yourusername/raw-rag.git
cd raw-rag

Create a virtual environment:

python -m venv venv
source venv/bin/activate

Install the required dependencies, although you see pip install in the notebooks, it is recommended to install all dependencies at once:
```
pip install -r requirements.txt
```
Create a .env file in the root directory and add the following environment variables needed for the notebooks:
```
cp .env.example .env
```
Start Jupyter Notebook or JupyterLab:
```
jupyter notebook
```

Usage

Each notebook in the repository is self-contained and includes detailed explanations. To get started, open any notebook and run the cells sequentially. For example:

# Example from raw_rag.ipynb
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def simple_retrieval(query_embedding, document_embeddings, k=5):
    similarities = cosine_similarity([query_embedding], document_embeddings)[0]
    top_k_indices = np.argsort(similarities)[-k:][::-1]
    return top_k_indices, similarities[top_k_indices]

# Use this function in your RAG pipeline

Benefits over Library-based Approaches

Full Control: Understand and modify every aspect of the RAG pipeline.
Performance Optimization: Fine-tune each component for your specific use case.
Minimal Overhead: Avoid unnecessary features and dependencies of larger libraries.
Easy Debugging: Quickly identify and fix issues in your implementation.
Flexible Integration: Easily incorporate the RAG system into existing projects.
Educational Value: Gain a deep understanding of RAG principles and implementation details.

Todo

Add additional advanced memory techniques
Implement more agentic approaches to RAG
More advanced document pro-processing techniques
GraphRAG implementation
If you have any suggestions, please open an issue or submit a pull request.

Contributing

Contributions to this project are welcome! Please follow these steps:

Fork the repository
Create a new branch for your feature
Implement your changes
Write or update tests as necessary
Submit a pull request with a clear description of your changes

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

All the contributors to the RAG research and development
The open-source community for providing valuable tools and libraries

Remember, while this "raw" approach offers many advantages, libraries like LangChain and LlamaIndex still have their place in rapid prototyping and development. The goal here is to provide an alternative that promotes understanding and control over the RAG process.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
notebooks		notebooks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt
utils.py		utils.py
utils_readme.md		utils_readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Raw RAG: Retrieval-Augmented Generation from Scratch

Introduction

Why "Raw" RAG?

Updates

Repository Contents

Getting Started

Prerequisites

Installation

Usage

Benefits over Library-based Approaches

Todo

Contributing

License

Acknowledgments

About

Uh oh!

Uh oh!

Languages

License

yudataguy/RawRAG

Folders and files

Latest commit

History

Repository files navigation

Raw RAG: Retrieval-Augmented Generation from Scratch

Introduction

Why "Raw" RAG?

Updates

Repository Contents

Getting Started

Prerequisites

Installation

Usage

Benefits over Library-based Approaches

Todo

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages