A Retrieval-Augmented Generation (RAG) system designed to help analysts and customers query consumer financial complaints from the CFPB dataset using natural language. Built with LangChain, ChromaDB, Sentence Transformers, and Streamlit.
- End-to-end pipeline from raw complaint data to interactive semantic search.
- Data cleaning, narrative filtering, and chunking for optimized embeddings.
- Vector search with
ChromaDBusing metadata for traceability. - Question-answering using
MiniLMfor embedding andMistral-7B-Instructfor LLM generation. - Clean, interactive chat interface built with
Streamlit. - Modular design with reproducible scripts and Docker support.
rag_project/
├── app.py # Streamlit interface (Task 4)
├── Dockerfile # Docker container setup
├── requirements.txt # Python dependencies
├── data/
│ └── filtered_complaints.csv # Cleaned data ready for chunking
├── vector_store/
│ └── chroma_db/ # Persistent ChromaDB store
├── notebooks/
│ └── eda_preprocessing.ipynb # Task 1: EDA + cleaning notebook
├── src/
│ ├── chunking_embedding_indexing.py # Task 2: chunk + embed + index
│ ├── rag_pipeline.py # Task 3: retrieval + generation logic
│ └── utils.py # Optional helpers
└── .dockerignore
pip install -r requirements.txtjupyter notebook notebooks/eda_preprocessing.ipynbpython src/chunking_embedding_indexing.pypython src/rag_pipeline.pystreamlit run app.pydocker build -t creditrust-rag .docker run -p 8501:8501 creditrust-ragThe app will be available at
http://localhost:8501
Check evaluation.md or the final report for:
- A table of 5–10 test questions
- Quality scores from 1–5
- Example retrieved chunks
- Generation analysis
- CFPB Public Complaint Dataset
- Hugging Face Transformers
- LangChain
- ChromaDB
- Streamlit
- Add support for feedback-driven retraining
- Swap in larger LLMs via LangChain integration (e.g., Llama 3)
- Add support for citation highlighting
- Deploy as a cloud API