A Decentralized Retrieval Augmented Generation System with Source Reliabilities Secured on Blockchain
TL;DR: We built dRAG, a decentralized RAG system that addresses data reliability challenges in real-world settings
Paper on arXiv
Our dRAG system can be abstracted into three components: Decentralized Data Sources, LLM Service, and Decentralized Blockchain Network. Details can be found in Section 5 of the paper.
We provide detailed instructions for starting the retrieval service, LLM service, and interacting with the smart contract. For your convenience, we also provide a one-line command to launch the entire dRAG system.
To start all services using Docker Compose:
docker compose up -dThis will:
- Start the contract service (a smart contract deployed on local Hardhat node for testing)
- Start all three data sources simultaneously (data-source-0, data-source-20, data-source-100, each is assigned with a testing private key provided by Hardhat)
- Start the LLM service after all data sources are healthy. The LLM service will also have a private key.
To view logs:
docker compose logs -fTo stop all services:
docker compose downOnce all Docker containers are up and running, you can use the test.ipynb Jupyter notebook to interact with the system. The notebook contains:
- Health check tests for all services
- Example queries to test the RAG system
- Integration tests to verify the end-to-end functionality or check Reliable-dRAG-anonymous/drag_llm_service/README.md for endpoint description.
To use the notebook:
jupyter notebook test.ipynbMake sure all services are healthy before running the tests. You can check the service status with:
docker compose psIf you wish to test the system with publicly smart contract, there is an example deployment on Sepolia (a public Ethereum testnet) at 0x5F67901BC1A22010BA438EDa426A70d0B5eA17Be. A private key of Sepolia wallet and Infura api key are needed to connect to the testnet.
The dRAG system provides real-time monitoring of source scores (usefulness and reliability) over queries.
The visualization demonstrates how the system learns and adapts to source quality over time, with sources showing different performance characteristics based on their pollution levels and content quality.
data/ // Synthetic polluted datasets for experiments
drag_contract/ // Hardhat project + Solidity DragScores contract
drag_data_source/ // Dockerized retrieval service
drag_llm_service/ // Dockerized LLM orchestrator service
drag_python_client/ // Minimal Python client for DragScores
result/ // Result example for live visualization

