1 unstable release
| new 0.1.0 | Nov 24, 2025 |
|---|
#7 in #sparse
8.5MB
1K
SLoC
SparkBERT: Sparse representations using k-means and BERT
This repository contains a Rust implementation of the algorithm proposed in Efficient sparse retrieval through embedding-based inverted index construction.
Naming:
SparkBERT: Sparse representations using k-means and BERT
⚠️ Note: This is an early version of the project and the code is still a work in progress.
TL;DR
Vector search underpins many modern search engines thanks to its strong quality. It is also widely used when building LLM agents for Retrieval-Augmented Generation (RAG). The most commonly used algorithm for vector indexing is Hierarchical Navigable Small World (HNSW) (see Table 3 in this paper). Its main drawback is high memory consumption. Moreover, Google DeepMind has shown that for a fixed vector dimensionality, beyond a certain number of documents retrieving all relevant documents becomes mathematically impossible (see this paper).
This work proposes a hybrid retrieval algorithm that combines the quality of deep neural networks such as BERT with the resource efficiency of an inverted index. To achieve this, it uses a so‑called vector vocabulary. Its construction is based on the idea that each word can appear in different usage contexts, which can be captured in the embedding of that word produced by BERT in a given context. By collecting as many such context-aware embeddings as possible, we can cluster them and obtain a set of semantic meanings for the word. The centroids representing these semantic meanings become the elements of the vector vocabulary.
Figure 1: Score calculation. t i - token, E i - embedding, C i - centroid.
The resulting vector vocabulary is used both during indexing and during search. It is important to note that for relevance scoring we use the MaxSim function proposed by the authors of ColBERT, which improves the quality of the scoring (see Figure).
As a result, the algorithm delivers search quality comparable to HNSW while requiring less memory. In addition, because the index is built using vector vocabulary tokens (which effectively represent a sparse vector of the document in the space of all vocabulary tokens) rather than relying on a single dense vector, it addresses the problem described by the Google DeepMind team. Moreover, the use of an inverted index significantly improves the scalability of semantic search.
Usage
Quick start
Since this algorithm uses faiss-rs bindings, which require some additional installation steps, the easiest way to run it is via Docker:
docker pull ghcr.io/viacheslav-dobrynin/spark-bert:latest
docker run -d -p 8000:8000 --name spark-bert -v $YOUR_PATH:/data/index ghcr.io/viacheslav-dobrynin/spark-bert:latest
You will need to wait a bit while the BERT model and vector vocabulary are downloaded from HuggingFace.
After that you can index documents:
curl -X POST http://localhost:8000/index \
-H 'Content-Type: application/json' \
-d '{
"docs": [
{ "doc_id": 1, "text": "First document text" },
{ "doc_id": 2, "text": "Another document to index" }
]
}'
And then search over them:
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "text",
"search_n_neighbors": 3,
"top_k": 5
}'
# fields search_n_neighbors and top_k are optional (defaults: 3 and 10)
As Rust lib
Add spark-bert dependency:
[dependencies]
"spark-bert" = "*"
The current configuration automatically builds the required faiss version for faiss-rs from source, but this requires additional dependencies listed in INSTALL.md.
Original Implementation
The original implementation used in the paper was written in Python; see details in the following repository: source code. It also describes the process of building the vector vocabulary.
The model used for both SparkBERT and HNSW is: sentence-transformers/all-MiniLM-L6-v2.
Python+PyLucene Implementation Search Quality
For the MS MARCO dataset:
| Algorithm | NDCG@10 | MRR@10 |
|---|---|---|
| SparkBERT (Lucene based) | 0.25 | 0.20 |
| BM25 (Lucene based) | 0.22 | 0.18 |
| HNSW (Lucene based) | 0.17 | 0.13 |
For the SciFact dataset:
| Algorithm | NDCG@10 | MRR@10 |
|---|---|---|
| SparkBERT (Lucene based) | 0.63 | 0.60 |
| HNSW (Lucene based) | 0.64 | 0.60 |
| ColBERTv2 | 0.69 | 0.66 |
Python+PyLucene Implementation Memory Usage
Memory usage for the MS MARCO dataset:
Figure 2: Comparison of algorithms by memory usage
Current Rust Implementation Benchmarks
Coming soon 👀
Dependencies
~61MB
~1M SLoC