Similarity Search Using SentenceTransformer, and Redis as the Vector Database.

(A Modular Asymmetric Vector Similarity Search Approach with Redis as the Vector Database.)

Overview of Semantic Search

Semantic search seeks to improve search accuracy by understanding the semantic meaning of the search query and the corpus to search over. Semantic search can also perform well given synonyms, abbreviations, and misspellings, unlike keyword search engines that can only find documents based on lexical matches.

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic similarity with the query.

Symmetric vs. Asymmetric Semantic Search

For symmetric semantic search your query and the entries in your corpus are of about the same length and have the same amount of content.
For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query.

Overview of Vector Databases

A vector database stores, manages and indexes high-dimensional vector data. Data points are stored as arrays of numbers called “vectors,” which are clustered based on similarity. This design enables low-latency queries, making it ideal for AI applications.

Vector databases versus traditional databases

Unlike traditional relational databases with rows and columns, data points in a vector database are represented by vectors with a fixed number of dimensions. Because they use high-dimensional vector embeddings, vector databases are better able to handle unstructured datasets.

The nature of data has undergone a profound transformation. It's no longer confined to structured information easily stored in traditional databases. Unstructured data—including social media posts, images, videos, audio clips and more—is growing 30% to 60% year over year.

Opposed to this, vector search represents data as dense vectors, which are vectors with most or all elements being nonzero. Vectors are represented in a continuous vector space, the mathematical space in which data is represented as vectors.

Vector representations enable similarity search. For example, a vector search for “smartphone” might also return results for “cellphone” and “mobile devices.” Each dimension of the dense vector corresponds to a latent feature or aspect of the data. A latent feature is an underlying characteristic or attribute that is not directly observed but inferred from the data through mathematical models or algorithms.

Latent features capture the hidden patterns and relationships in the data, enabling more meaningful and accurate representations of items as vectors in a high-dimensional space.

Data represented as vectors in a high-dimensional space (Vector embeddings)

How can a vector representation be used?

Let’s say you have an image of a building — for example, the city hall of some midsize city whose name you forgot and you’d like to find all other images of this building in the image collection. A key/value query that is typically used in SQL doesn’t help, because you’ve forgotten the name of the city.

This is where similarity search kicks in. The vector representation for images is designed to produce similar vectors for similar images, where similar vectors are defined as those that are nearby in Euclidean space.

Code Structure

similarity-search/
│
├── src/                     # Application source code
│   ├── __init__.py
│   ├── data/
│   ├── models/ 
│   ├── app/
│   ├── pipelines/
│   ├── utils/
│   ├── tests/
│   ├── README.md            # Overview of the Similarity Search using SentenceTransformer, and Redis as the Vector Database
│   └── ...
│
│
├── k8s/                     # Kubernetes configuration files
│   ├── deployment.yml       # Kubernetes Deployment resource
│   ├── service.yml          # Kubernetes Service resource
│   ├── ingress.yml          # (Optional) Ingress configuration
│   └── secrets.yml          # (Optional) Kubernetes Secrets
│
│
├── .github/                 # CI/CD workflows (e.g., GitHub Actions)
│   └── workflows/
│       └── vss-pipeline.yml # CI/CD pipeline script for building and deploying
│
│
├── .env                     # Environment variables file
├── Dockerfile               # Dockerfile to build the app
├── requirements.txt         # Python dependencies
├── README.md                # Documentation
└── LICENSE                  # License file

Environment Setup

Requires a few packages and intial setups for full implementation;

Redis (redis-py)
Sentence Transformer model (sentence_transformers from HuggingFace, easy to use models for tasks like semantic similarity search, visual search, and many others.)
FastAPI (Search Query in Swagger UI)
Kubernetes cluster (Minikube) -> (Optional[😉])
Setup and Access to a Running Redis Stack instance
Redis connection configuration credentials (secrets)
Alternatively run Redis Stack locally using Docker.

What happens when code is executed;

Upon code execution;

The data is loaded and inspected as a Sample JSON data
Connection to the Redis Database Stack instance is established.
Generated vector embeddings for the text descriptions.
The JSON data is saved with embeddings into Redis.
Creation of a RediSearch index on the data
Execute vector similarity search queries

A simple implementation of Cosine Similarity., used as the Distance Metric (For more details, refer to: Distance Metrics in Vector Similarity Search)

pip3 install sentence_transformers --quiet

import numpy as np
from numpy.linalg import norm
from sentence_transformers import SentenceTransformer

# Define the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# sample data
sentences = [
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
    "A man is riding a white horse on an enclosed ground.",
    "A man is eating a piece of bread.",
    "A man is riding a horse.",
    "A woman is playing violin."
]

# vector embeddings created from dataset
embeddings = model.encode(sentences)

# encode the query vector embedding
query_embedding = model.encode("Someone in a gorilla costume is playing a set of drums.")

# Define the distance metric (Cosine similarity)
def cosine_similarity(A, B):
    score = np.dot(A, B)/(norm(A)*norm(B))
    return f"{score:.4f}"

# Run semantic similarity search
print("Query: Someone in a gorilla costume is playing a set of drums.")
for e, s in zip(embeddings, sentences):
    print(s, " -> similarity score = ",
          cosine_similarity(e, query_embedding))

Code Output: (Semantic similarity score of each data to the provided query)

Query: Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums.  -> similarity score =  0.6433
A cheetah is running behind its prey.  -> similarity score =  0.1080
A man is riding a white horse on an enclosed ground.  -> similarity score =  0.1191
A man is eating a piece of bread.  -> similarity score =  0.0216
A man is riding a horse.  -> similarity score =  0.1389
A woman is playing violin.  -> similarity score =  0.2564

Key Concepts

The core concepts covered include:

Using pre-trained NLP models like SentenceTransformers to generate semantic vector representations of text
Storing and indexing vectors along with structured data in Redis (vector database)
Utilizing vector similarity KNN search and other query types in RediSearch
Ranking and retrieving results by semantic similarity

The techniques presented allow for building powerful semantic search experiences over unstructured data with Redis.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
.idea		.idea
k8s		k8s
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
VECTOR-SIMILARITY-SEARCH.md		VECTOR-SIMILARITY-SEARCH.md
docker-entrypoint.sh		docker-entrypoint.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Similarity Search Using SentenceTransformer, and Redis as the Vector Database.

Overview of Semantic Search

Symmetric vs. Asymmetric Semantic Search

Overview of Vector Databases

Vector databases versus traditional databases

Data represented as vectors in a high-dimensional space (Vector embeddings)

How can a vector representation be used?

Code Structure

Environment Setup

What happens when code is executed;

A simple implementation of Cosine Similarity., used as the Distance Metric (For more details, refer to: Distance Metrics in Vector Similarity Search)

Key Concepts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Sillians/similarity-search-redis-vectorDatabase

Folders and files

Latest commit

History

Repository files navigation

Similarity Search Using SentenceTransformer, and Redis as the Vector Database.

Overview of Semantic Search

Symmetric vs. Asymmetric Semantic Search

Overview of Vector Databases

Vector databases versus traditional databases

Data represented as vectors in a high-dimensional space (Vector embeddings)

How can a vector representation be used?

Code Structure

Environment Setup

What happens when code is executed;

A simple implementation of Cosine Similarity., used as the Distance Metric (For more details, refer to: Distance Metrics in Vector Similarity Search)

Key Concepts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages