DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Running an Airflow DAG/Schedule on a Time Zone
  • Getting Started With C# DataFrame and XPlot.Ploty
  • Why Database Migrations Take Months and How to Speed Them Up
  • Unmasking Entity-Based Data Masking: Best Practices 2025

Trending

  • Performance Optimization Techniques for Snowflake on AWS
  • Memory Leak Due to Time-Taking finalize() Method
  • Streamlining Event Data in Event-Driven Ansible
  • Contextual AI Integration for Agile Product Teams
  1. DZone
  2. Data Engineering
  3. Databases
  4. Using OpenAI Embeddings Search With SingleStoreDB

Using OpenAI Embeddings Search With SingleStoreDB

In this article, we'll test the ability of SingleStoreDB to store and query the OpenAI Wikipedia Vector Database dataset.

By 
Akmal Chaudhri user avatar
Akmal Chaudhri
DZone Core CORE ·
May. 24, 23 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.0K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, we'll test the ability of SingleStoreDB to store and query the OpenAI Wikipedia Vector Database dataset. We'll see that SingleStoreDB can manage this dataset with ease. SingleStoreDB has supported a range of vector functions for some time, and these functions are ideally suited for modern applications using GPT technology.

The notebook file used in this article is available on GitHub.

Introduction

In several previous articles, we have used some of the vector capabilities built into SingleStoreDB:

  • Quick tip: SingleStoreDB's EUCLIDEAN_DISTANCE and JSON_ARRAY_PACK functions
  • Using SingleStoreDB, Spark, and Alternating Least Squares (ALS) to build a Movie Recommender System

In this article, we'll test the JSON_ARRAY_PACK and DOT_PRODUCT vector functions with the OpenAI Wikipedia Vector Database dataset.

There is an OpenAI notebook available on GitHub under an MIT License that tests several Vector Database systems. The tests can be run using local clients or in the cloud. In this article, we'll use a local installation of SingleStoreDB.

Install SingleStoreDB

In this article, we'll install SingleStoreDB in a Virtual Machine (VM) environment. It takes just a few minutes. A previous article described the steps. Alternatively, we could use a Docker image.

For this article, we'll only need two tarball files for the VM installation:

  1. singlestoredb-toolbox
  2. singlestoredb-server

Assuming a two-node cluster was correctly deployed and using the same variable names from the previous article, we can connect to our cluster from a MySQL CLI Client as follows:

Shell
 
mysql -u root -h ubuntu -P 3306 --default-auth=mysql_native_password -p


Once connected to our cluster, we'll create a new database as follows:

SQL
 
CREATE DATABASE IF NOT EXISTS openai_demo;


Install Jupyter

From the command line, we'll install the classic Jupyter Notebook as follows:

Shell
 
pip install notebook


OpenAI API Key

Before launching Jupyter, we must create an account on the OpenAI website. This provides some free credits. Since we will use embeddings, the cost will be minimal. We'll also need to create an OpenAI API Key. This can be created from USER > API keys in our OpenAI account.

From the command line, we'll export the OPENAI_API_KEY variable in our environment, as follows:

Shell
 
export OPENAI_API_KEY=<OpenAI API Key>


Replace <OpenAI API Key> with your key.

Next, we'll launch Jupyter as follows:

Shell
 
jupyter notebook


Fill Out the Notebook

Let's now create a new notebook. We'll adhere to the flow and structure of the OpenAI notebook and use some small code sections directly from the notebook, where required.

Setup

First, some libraries:

Python
 
!pip install matplotlib
!pip install openai
!pip install plotly.express
!pip install scikit-learn
!pip install singlestoredb
!pip install tabulate
!pip install wget


Next, some imports:

Python
 
import openai

import pandas as pd
import os
import wget
from ast import literal_eval


Then, the embedding model:

Python
 
EMBEDDING_MODEL = "text-embedding-ada-002"


Load Data

We'll now obtain the Wikipedia dataset:

Python
 
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)


And unpack it:

Python
 
import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref:
    zip_ref.extractall("data")


Next, we'll load the file into a Pandas Dataframe:

Python
 
article_df = pd.read_csv(
    "data/vector_database_wikipedia_articles_embedded.csv"
)


And we'll take a look at the first few lines as follows:

Python
 
article_df.head()


The next operation from the OpenAI notebook can take a while:

Python
 
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)


Next, we'll look at the dataframe info:

Python
 
article_df.info(show_counts=True)


The result should be as follows:

Plain Text
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)


Create Table

Let's now create a connection to our local installation of SingleStoreDB:

Python
 
import singlestoredb as s2

conn = s2.connect("root:<password>@<host>:3306/openai_demo")

cur = conn.cursor()


We'll replace the values for <password> and <host> with the values that we used earlier at installation time.

We'll now create a table as follows:

Python
 
stmt = """
    CREATE TABLE IF NOT EXISTS wikipedia (
        id INT PRIMARY KEY,
        url VARCHAR(255),
        title VARCHAR(100),
        text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
        title_vector BLOB,
        content_vector BLOB,
        vector_id INT
    )
"""

cur.execute(stmt)


Populate Table

We can populate our database table as follows:

Python
 
# Prepare the statement
stmt = """
    INSERT INTO wikipedia (
        id,
        url,
        title,
        text,
        title_vector,
        content_vector,
        vector_id
    )
    VALUES (
        %s,
        %s,
        %s,
        %s,
        JSON_ARRAY_PACK_F64(%s),
        JSON_ARRAY_PACK_F64(%s),
        %s
    )
"""

# Convert the DataFrame to a NumPy record array
record_arr = article_df.to_records(index=False)

# Set the batch size
batch_size = 1000

# Iterate over the rows of the record array in batches
for i in range(0, len(record_arr), batch_size):
    batch = record_arr[i:i+batch_size]
    values = [(
        row[0],
        row[1],
        row[2],
        row[3],
        str(row[4]),
        str(row[5]),
        int(row[6])
    ) for row in batch]
    cur.executemany(stmt, values)


We can also use JSON_ARRAY_PACK_F32 (32-bit, IEEE standard format), instead of JSON_ARRAY_PACK_F64 (64-bit, IEEE standard format).

Loading the data should take a short time. We can use other data loading methods, such as pipelines, for larger datasets.

Search Data

First, we'll import the following:

Python
 
from openai.embeddings_utils import get_embedding


Next, we'll check that the OPENAI_API_KEY can be read, as follows:

Python
 
if os.getenv("OPENAI_API_KEY") is not None:
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print ("OPENAI_API_KEY is ready")
else:
    print ("OPENAI_API_KEY environment variable not found")


We'll now define a Python function that will allow us to use either of the two vector columns in the database:

Python
 
from typing import Tuple, List

def search_wikipedia(
    query: str,
    column1: str,
    column2: str,
    num_rows: int = 10
) -> Tuple[List[str], List[float]]:
    """Searches Wikipedia for the given query and returns the top `num_rows` results.

    Args:
        query: The query to search for.
        column1: The name of the column in the Wikipedia database to return for each result.
        column2: The name of the column in the Wikipedia database to use as the score for each result.
        num_rows: The number of results to return.

    Returns:
        A list of the top `num_rows` results.
    """

    # Get the embedding of the query
    embedding = get_embedding(query, EMBEDDING_MODEL)

    # Create the SQL statement
    stmt = f"""
        SELECT
            {column1},
            DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(%s), {column2}) AS score
        FROM wikipedia
        ORDER BY score DESC
        LIMIT %s
    """.format(column1=column1, column2=column2)

    # Execute the SQL statement
    cur.execute(stmt, [str(embedding), num_rows])

    # Get the results
    results = cur.fetchall()

    # Separate the results into two lists
    values = [row[0] for row in results]
    scores = [row[1] for row in results]

    # Return the results
    return values, scores


We can test SingleStoreDB using the two examples in the OpenAI notebook. First, we'll use title and title_vector:

Python
 
values1, scores1 = search_wikipedia(
    query = "modern art in Europe",
    column1 = "title",
    column2 = "title_vector",
    num_rows = 5
)


We'll format the results using the following:

Python
 
from tabulate import tabulate

# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data1 = list(zip(values1, scores1))

# Add a rank column to the table data
table_data1 = [(i + 1,) + data for i, data in enumerate(table_data1)]

# Create the table
table1 = tabulate(table_data1, headers=["Rank", "Title", "Score"])

# Print the table
print(table1)


The output should be similar to the following:

Plain Text
 
Rank    Title                    Score
------  --------------------  --------
     1  Museum of Modern Art  0.875081
     2  Western Europe        0.867523
     3  Renaissance art       0.864172
     4  Pop art               0.860346
     5  Northern Europe       0.854755


Next, we'll use text and content_vector:

Python
 
values2, scores2 = search_wikipedia(
    query = "Famous battles in Scottish history",
    column1 = "text",
    column2 = "content_vector",
    num_rows = 5
)


We'll format the results using the following:

Python
 
# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data2 = list(zip([value[:50] for value in values2], scores2))

# Add a rank column to the table data
table_data2 = [(i + 1,) + data for i, data in enumerate(table_data2)]

# Create the table
table2 = tabulate(table_data2, headers=["Rank", "Text", "Score"])

# Print the table
print(table2)


The output should be similar to the following:

Plain Text
 
  Rank  Text                                                   Score
------  --------------------------------------------------  --------
     1  The Battle of Bannockburn, fought on 23 and 24 Jun  0.869338
     2  The Wars of Scottish Independence were a series of  0.86148
     3  Events                                              0.852533
         January 1 – Charles II crowned King of
     4  The First War of Scottish Independence lasted from  0.849642
     5  Robert I of Scotland (11 July 1274 – 7 June 1329)   0.846184


Summary

In this article, we've seen that SingleStoreDB can store vectors with ease and that we can also store other data types in the same table, such as numeric and text. With its powerful SQL and multi-model support, SingleStoreDB provides a one-stop solution for modern applications bringing both technical and business benefits through a single product.

Database jupyter notebook Virtual Machine

Published at DZone with permission of Akmal Chaudhri. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Running an Airflow DAG/Schedule on a Time Zone
  • Getting Started With C# DataFrame and XPlot.Ploty
  • Why Database Migrations Take Months and How to Speed Them Up
  • Unmasking Entity-Based Data Masking: Best Practices 2025

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: