The aim of this project is to enable efficient exploration of a corpus of documents through semantic search and visualization.
Currently, Searchica only supports email archives in an mbox format. The semantic search implementation is BERT embeddings. The stack is Flask/React. There are sample emails that can be used for testing, retrieved from the public release of enron emails.
- Utilizes
msmarco-MiniLM-L6-cos-v5
BERT model - Implements lazy loading pattern for vector computation
- Caches embeddings in SQLite with JSON serialization
- Cosine similarity computation between query and document vectors
- Weighted field scoring across email components
- Result ranking based on similarity score
- PCA dimensionality reduction for visualization mapping
- Normalized score used to color nodes on plot
- Plotly data structure generation
- Search interface
- Plotly.js visualization integration
- REST API integration
- Email parsing (mbox format)
- Content extraction from plain text and HTML
- Vector embedding computation
- SQLite storage with vector caching
- Query processing and similarity scoring
- 2D projection for visualization
Pytest suite covering:
- Document vector operations
- Email parsing and storage
- Search functionality
- Visualization processing
git clone [repository]
pip install -r requirements.txt
# For production
# Build frontend
cd frontend
npm run build
# Start backend
$env:FLASK_ENV = "production"
cd backend
python app.py
# For development
# Start frontend
cd frontend
npm run dev
# Start backend
$env:FLASK_ENV = ""
cd backend
python app.py
# For deployment (Docker)
# Build frontend
cd frontend
npm run build
# Build Docker
docker build -t searchica .
# Deploy to Google Cloud
docker tag searchica [full Google artifact registry address]:latest
docker push [full Google artifact registry address]:latest
# update Google Cloud Run image to be newly uploaded Docker container