Diving into AI: An Exploration of Embeddings and Vector Databases

With the help of ChatGTP, AI has officially changed everything we do. It’s only been a few months since chatGPT was released, and like many, I’ve been exploring how to best use it. I used it as a coding buddy, to brainstorm, to help with writing, etc.

The potential of this new technology blows me away. But as a technology enthusiast, I’d love to understand better how it works, or at least better understand some of the common terms and underlying technology. I keep hearing about Large Language Models (LLMs), vector databases, training, models, etc. I’d love to learn more about it, and what better way to just dive in, get my hands dirty and experiment with it?

So in today’s blog, I’m sharing some learnings on one of these building blocks, called embeddings. Why embeddings? Well, originally, I planned to learn more about Vector databases, but I quickly learned that in order to understand these better, I should start with vectors and embeddings.

What is an embedding

This is the definition from the OpenAI website

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness.

Hhm, ok, but what does that really mean? Imagine you have a word, say hamburger. In order to use this in an LLM (large language model) like GPT, the LLM needs to know what it means. To do that, we can turn the word hamburger into an embedding. An embedding is essentially a (set of) numerical representations of a word, indicating its meaning. We call this the Vector.

With embeddings, we can now represent the words as vectors:

dog: [0.2, -0.1, 0.5, …]
cat: [0.1, -0.3, 0.4, …]
fish: [-0.3, 0.6, -0.1, …]

Notice that it’s a representation of the meaning (semantics) of the word. For example, the word embeddings for “dog” and “puppy” would be close together in the vector space because they share a similar meaning and often appear in similar contexts. In contrast, the embeddings for “dog” and “car” would be farther apart because their meanings and contexts are quite different.

It is this “Word embeddings” technology that enables semantic search, which goes beyond simple keyword matching to understand the meaning and context behind a query. “Semantic” refers to the similarity in meaning between words or phrases.

For example, traditional string matching would fail to connect the “searching for something to eat” query with the sentence “the mouse is looking for food.” However, with semantic search powered by word embeddings, a search engine recognizes that both phrases share a similar meaning, and it would successfully find the sentence.

Ok, great. Now that we somewhat understand how this works, how does it really work?

Turning words or sentences into embeddings

A word or sentence can be turned into an embedding (a vector representation) using the OpenAI API. To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding you can extract, save, and use.

In my case, I’m using the Python API. Using this API, you can simply use the code below to turn the word hamburger into an embedding.

from openai.embeddings_utils import get_embedding
hamburger_embedding = get_embedding("hamburger", engine='text-embedding-ada-002')

# will look like something like [-0.01317964494228363, -0.001876765862107277, …

If you have a text document, you would turn all the words or sentences from that document into embeddings. Once you’ve done that, you essentially have a semantic representation of the document as a series of vectors. These vectors capture the meaning and context of the individual words or sentences.

Finding Similarities

Once you have embeddings for words or sentences, you can use them to find semantic similarities. A common approach to measuring the similarity between two embeddings is by calculating how close the vectors are to each other.

Calculating the distance between vectors is done by calculating the cosine similarity; if you’re really interested, you can read about that here. https://en.wikipedia.org/wiki/Cosine_similarity

Luckily the Python module that OpenAI ships has an implementation of this cosine_similarity, and you can simply use it like this:

import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
openai.api_key = "<YOUR OPENAI API KEY HERE>"
embedding1 = get_embedding("the kids are in the house",engine='text-embedding-ada-002')
embedding2 = get_embedding("the children are home",engine="text-embedding-ada-002")
cosine_similarity(embedding1, embedding2)

Which prints: 0.9387390865828703, meaning they’re very close.

A real-life example

Below is a slightly longer example. It reads the document called words.csv, which looks like this:

text
"red"
"potatoes"
"soda"
"cheese"
"water"
"blue"
"crispy"
"hamburger"
"coffee"
"green"
"milk"
"la croix"
"yellow"
"chocolate"
"french fries"
"latte"
"cake"
"brown"
"cheeseburger"
"espresso"
"cheesecake"
"black"
"mocha"
"fizzy"
"carbon"
"banana"
"sunshine"
"orange carrot"
"sun"
"hay"
"cookies"
"fish"

The script below then calculates the embeddings for all these words and adds it all to a Panda data frame. Next, it will take a search term (hotdog) and calculates what words are closest to the word hotdog.

import openai
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

openai.api_key = "<YOUR OPENAI API KEY HERE>"

# read the data
df = pd.read_csv('words.csv')

# Lamda to add embedding column
df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
# Safe it to a csv file, for caching later. So we dont need to call the API all the time
# You'd store this in a vector database
df.to_csv('word_embeddings.csv')
df = pd.read_csv('word_embeddings.csv')

# Convert the string representation of the embedding to a numpy array
# neeeded since we wrote it to a csv file
df['embedding'] = df['embedding'].apply(eval).apply(np.array)

# Hotdog is not in the CSV. Let calculate the embedding for it
search_term = "hotdog"
search_term_vector = get_embedding(search_term, engine="text-embedding-ada-002")

# now we can calculate the similarity between the search term and all the words in the CSV 
df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))
# print the top 5 most similar words
print(df.sort_values("similarities", ascending=False).head(5))

The code above prints this:

Unnamed: 0          text                                          embedding  similarities
7            7     hamburger  [-0.01317964494228363, -0.001876765862107277, ...      0.913613
18          18  cheeseburger  [-0.01824556663632393, 0.00504859397187829, 0....      0.886365
14          14  french fries  [0.0014257068978622556, -0.016548126935958862,...      0.853839
3            3        cheese  [-0.0032112577464431524, -0.0088559715077281, ...      0.838909
13          13     chocolate  [0.0015315973432734609, -0.012976923026144505,...      0.830742

Pretty neat, right?! It calculated that a hotdog is most similar to a hamburger, cheeseburger, and fries!

Let’s do one more thing! In the example below, we add the embeddings for milk and coffee together, just like a simple math addition.

We then again calculate what this new embedding is most similar to (hint, what do you call a drink that adds coffee to milk?).

# Let's make a copy of the data frame we created earlier, so we can compare the embeddings of two words
food_df = df.copy()
milk_vector = food_df['embedding'][10]
coffee_vector = food_df['embedding'][8]

# lets add the two vectors together
milk_coffee_vector = milk_vector + coffee_vector

# now calculate the similarity between the combined vector and all the words in the CSV
food_df["similarities"] = food_df['embedding'].apply(lambda x: cosine_similarity(x, milk_coffee_vector))

print(food_df.sort_values("similarities", ascending=False).head(5))

The result is this

Unnamed: 0 text embedding similarities
8 8 coffee [-0.0007212135824374855, -0.01943901740014553,… 0.959562
10 10 milk [0.0009238893981091678, -0.019352708011865616,… 0.959562
15 15 latte [-0.015634406358003616, -0.003936054650694132,… 0.905960
19 19 espresso [-0.02250547707080841, -0.012807613238692284, … 0.898178
22 22 mocha [-0.012473775073885918, -0.026152553036808968,… 0.889710

Ha! Yes, it’s obviously similar to coffee and milk, as that’s what we started with, but next up, we see a latte! That’s pretty cool, right? Coffee + Milk = Late 😀

Vector database

Now that we’ve seen how embeddings work and how they can be used to find semantic similarities, let’s talk about vector databases. In our example, we saw that calculating the embeddings was done using an API call to the OpenAI API. This can be slow and will cost you credits. That’s why, in the example code, we saved the calculated embeddings to a CSV file for caching purposes.

While this approach works for small-scale experiments, it may not be practical for large amounts of data or production environments where performance and scalability are important. This is where vector databases come in.

There are a few popular ones; a well-known one is Pinecone, but even Postgres can be used as a vector database. These vector databases are specifically designed for storing, managing, and efficiently searching through large amounts of embeddings. They are optimized for high-dimensional vector data and can handle operations such as nearest neighbor search, which is crucial for finding the most similar items to a given query.

Wrap up

In this exploration of the technology behind LLMs and AI, I delved into some of the foundational building blocks that power these advanced systems; specifically, we looked at embeddings and vectors. My initial curiosity about vector databases and their potential applications for my own data led me to first understand the underlying principles and the importance of vectors. It’s pretty cool to see how easy it was to get going, thanks to the existing API’s and libraries.

Perhaps in another weekend adventure, I’ll look further into the next logical topic: vector databases. I’d also love to explore Langchain, a fascinating framework for developing applications powered by language models.

That’s it for now; thanks for reading!

Cheers
Andree