A powerful tool for semantically searching through your Telegram chat history using natural language processing and vector embeddings.
Telegram Semantic Search allows you to import your Telegram chat export files and perform semantic (meaning-based) searches through your message history. Unlike traditional keyword search, semantic search understands the meaning behind your query and returns messages that are conceptually similar, even if they don't contain the exact keywords.
- Import Telegram chat export files (JSON format)
- Generate vector embeddings for all messages using transformer models
- Perform semantic searches with adjustable similarity thresholds
- View conversation context around search results
- Filter results by specific contacts
- Modern web interface with responsive design
- The application uses transformer models (like BERT variants) to convert messages into high-dimensional vector embeddings
- These embeddings capture the semantic meaning of each message
- When you search, your query is converted to a vector using the same model
- PostgreSQL with pgvector extension finds messages with similar vectors using cosine similarity
- Results are ranked by similarity and displayed in the web interface
- Python 3.8+
- Node.js 14+ and npm
- PostgreSQL 12+ with pgvector extension
- GPU support is optional but recommended for faster processing
git clone https://github.com/ryletko/telegram-semantic-search.git
cd telegram-semantic-search
# Windows
python -m venv venv
venv\Scripts\activate
# Linux/macOS
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Ensure C++ support in Visual Studio is installed, and run:
call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
Note: The exact path will vary depending on your Visual Studio version and edition
Then use nmake
to build:
set "PGROOT=C:\Program Files\PostgreSQL\16"
cd %TEMP%
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
nmake /F Makefile.win
nmake /F Makefile.win install
See the installation notes if you run into issues
You can also install it with Docker or conda-forge.
# Install PostgreSQL
sudo apt update
sudo apt install postgresql postgresql-contrib
# Install pgvector from source
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
# Enable the extension
sudo -u postgres psql -c "CREATE EXTENSION vector;"
# Using Homebrew
brew install postgresql
# Install pgvector
brew install pgvector
# Start PostgreSQL service
brew services start postgresql
# Enable the extension
psql postgres -c "CREATE EXTENSION vector;"
# Connect to PostgreSQL
psql -U postgres
# Create database
CREATE DATABASE telegram_search;
\q
Copy the example environment file and update it with your settings:
cp .env.example .env
Edit the .env
file with your database credentials and other settings:
# Database configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=telegram_search
DB_USER=postgres
DB_PASSWORD=your_password
# Flask configuration
FLASK_ENV=development
FLASK_DEBUG=1
# Application settings
UPLOAD_FOLDER=uploads
DEFAULT_MODEL=ai-forever/ru-en-RoSBERTa
The application consists of a Flask backend and a Vue.js frontend. You can start both simultaneously using the provided start script:
# Activate virtual environment if not already activated
# Windows: venv\Scripts\activate
# Linux/macOS: source venv/bin/activate
# Start the application
python start.py
This will:
- Start the Flask backend server on port 5000
- Start the Vue.js development server on port 5173
- Open your default web browser to the application
Alternatively, you can start the components separately:
# Start backend
python app.py
# In a separate terminal, start frontend
cd frontend
npm install # Only needed first time
npm run dev
- Open Telegram Desktop
- Go to Settings > Advanced > Export Telegram data
- Select "JSON" as the format and choose which chats to export
- Download the export file
- Open the Telegram Semantic Search application in your browser
- Click "Import Chat"
- Select your Telegram export JSON file
- Wait for the import to complete (this may take some time for large chats)
- Enter a search query in the search box
- Adjust similarity threshold if needed (lower values return more results)
- View results ranked by semantic similarity
- Click on a result to see the conversation context
- Backend: Flask (Python) with SQLAlchemy
- Frontend: Vue.js with Tailwind CSS
- Database: PostgreSQL with pgvector extension
- Embedding Models: Sentence Transformers (BERT variants)
The application uses three main tables:
-
imports: Stores metadata about imported chat exports
- id (UUID): Primary key
- timestamp: Import time
- chat_name: Name of the chat
- chat_id: Telegram chat ID
- type: Type of chat (private, group, etc.)
- model_name: Embedding model used
-
messages: Stores individual messages with embeddings
- id: Message ID
- import_id: Foreign key to imports table
- text: Message content
- date: Message timestamp
- is_self: Whether the message is from the user
- embedding: Vector representation (1024 dimensions)
- from_id: Sender ID
- from_name: Sender name
-
message_chunks: Stores chunks of messages for more granular embedding
- id: Chunk ID (auto-incremented)
- message_id: Foreign key to messages table
- import_id: Foreign key to imports table
- text: Chunk content
- embedding: Vector representation (1024 dimensions)
The application uses pgvector's cosine similarity operator (<=>
) to find semantically similar messages. The SQL query looks like:
SELECT
m.id,
m.text,
m.date,
m.from_id,
m.from_name,
1 - (m.embedding <=> query_vector) as similarity,
m.is_self
FROM
messages m
WHERE
m.import_id = import_id
AND 1 - (m.embedding <=> query_vector) > min_similarity
ORDER BY
similarity DESC
-
Database connection errors:
- Verify PostgreSQL is running
- Check your database credentials in the
.env
file - Ensure pgvector extension is installed
-
Import failures:
- Verify your Telegram export is in JSON format
- Check that the file is not corrupted
- Ensure you have sufficient disk space
-
Slow performance:
- Consider using a GPU for faster embedding generation
- Adjust the batch size in the import service
- Optimize PostgreSQL settings for your hardware
- Sentence Transformers for embedding models
- pgvector for vector similarity search in PostgreSQL
- Flask and Vue.js for the web framework