PostgreSQL Tutorial: Semantic search with pgvector

February 20, 2025

Summary: in this tutorial, you will learn how to implement semantic search with pgvector on PostgreSQL.

Table of Contents

Initial setup

Semantic search has become increasingly popular in recent years, thanks to all the progress in the Generative AI and Large Language Model ecosystem.

In this tutorial, we will use the The Movies Dataset on Kaggle, which includes 45,000 movies that go up until 2017, which was more than enough for us.

We will use the pgvector, which is an open-source extension for PostgreSQL that adds support for vector operations and similarity searches. It lets you store, index, and query vector data directly within your PostgreSQL database. First, let’s create a table to store the movies dataset:

CREATE TABLE IF NOT EXISTS movies (
    id integer,
    imdb_id text,
    title text,
    tagline text,
    overview text,
    release_date date,
    vote_average float,
    embedding vector(1536)
);

Now, let’s build a simple semantic search engine using pgvector and OpenAI embeddings!

import openai
import psycopg2
import csv

# Set up OpenAI API (replace with your actual API key)
openai.api_key = "your_openai_api_key"

# Function to get embeddings from OpenAI
def get_embedding(text):
    response = openai.embeddings.create(input=text, model="text-embedding-ada-002")
    return response['data'][0]['embedding']

# Function to add a document
def add_document(cur, row):
    content = """
                 Title: {0}
                 Tagline: {1}
                 Overview: {2}
                 Release date: {3}
                 Genres: {4}""".format(
                 row['title'],
                 row['tagline'],
                 row['overview'],
                 row['release_date'],
                 row['genres'])
    embedding = get_embedding(content)
    cur.execute("""INSERT INTO movies
                   (id, imdb_id, title, tagline, overview, release_date, vote_average, embedding)
                   VALUES (%s, %s, %s, %s, %s, %s, %s, %s)""",
                (row['id'], row['imdb_id'],
                 row['title'], row['tagline'], row['overview'], row['release_date'],
                 row['vote_average'], embedding))

# Connect to the database
conn = psycopg2.connect("dbname=your_database user=your_username")
cur = conn.cursor()

with open("./movies_metadata.csv", "r") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row['first_name'], row['last_name'])
        add_document(cur, row)
        conn.commit()

# Clean up
cur.close()
conn.close()

It embeds documents using OpenAI’s text embedding model and stores them in a PostgreSQL database with pgvector.

Demonstration

If you search for “old movies”, you get some old movies, since I included the release date in the embedding.

There’s also the interesting; if you search for “virtual reality”, you get Oculus, a horror movie by Mike Flanagan, in the top 20 results. The movie has no relation to VR whatsoever, but it’s the name of a popular VR headset, which indicates the model has some sort of world knowledge.

# Function to search for similar documents
def search_documents(query, limit=5):
    query_embedding = get_embedding(query)
    cur.execute("""
        SELECT title, overview, embedding <-> %s AS distance
        FROM movies
        ORDER BY distance
        LIMIT %s
    """, (query_embedding, limit))
    return cur.fetchall()

# Connect to the database
conn = psycopg2.connect("dbname=your_database user=your_username")
cur = conn.cursor()

# Perform a search
search_query = "wizard school poor boy"
results = search_documents(search_query)
print(f"Search results for: '{search_query}'")
for i, (title, overview, distance) in enumerate(results, 1):
    print(f"{i}. {title} {overview} (Distance: {distance:.4f})")

# Clean up
cur.close()
conn.close()

Conclusion

pgvector brings powerful vector similarity search capabilities to PostgreSQL, making it an excellent choice for developers looking to add AI-powered features to their existing PostgreSQL-based applications.

While pgvector may not offer the same scalability and specialized features as dedicated vector databases like Pinecone or Milvus, its seamless integration with PostgreSQL makes it an attractive option for many use cases.

It’s particularly well-suited for projects that already use PostgreSQL and need to add vector search capabilities without introducing a new database system.

We encourage you to try pgvector in your own projects. Whether you’re building a recommendation system, a semantic search engine, or any other application that requires similarity searches, pgvector can be a valuable tool in your data science toolkit.

See more

PostgreSQL Administration