September 19, 2024
Summary: pg_duckdb
is a PostgreSQL extension that embeds DuckDB’s columnar-vectorized analytics engine and features into PostgreSQL.
Table of Contents
Introduction
The pg_duckdb
extension will be fully capable of querying against data stored in the cloud in DuckDB as if it were local. DuckDB’s “dual execution” capabilities let us join local PostgreSQL data against DuckDB data seamlessly, and we will figure out the best place to run the query. As a user, you don’t really need to care where the computation runs, we’ll just figure out how to make it run fast.
Moreover, it is common in analytics to want to offload your data from your transactional database into an analytical store. The pg_duckdb
extension along with DuckDB can help; you can just run a query in PostgreSQL that pulls recent data from your PostgreSQL database and write it to DuckDB. You don’t need to export and reimport data, or set up CDC.
Finally, there are some downsides to running analytics on the same database that runs your application. Analytics can be resource hungry in terms of the amount of memory and CPU needed to make it run well. Above a certain size, folks may not want to run this on their production transactional database. DuckDB will help offload this to the cloud, in a way that people don’t even have to change the queries that they’re running; they just get faster.
Installation
To build pg_duckdb, you need:
- PostgreSQL 16 or 17
- Linux or MacOS
- Standard set of build tools for building PostgreSQL extensions
- Build tools that are required to build DuckDB
To build and install, run:
make install
Next, load the pg_duckdb extension:
CREATE EXTENSION pg_duckdb;
IMPORTANT: Once loaded you can use DuckDB execution by running SET duckdb.execution TO true
. This is opt-in to avoid breaking existing queries. To avoid doing that for every session, you can configure it for a certain user by doing ALTER USER my_analytics_user SET duckdb.execution TO true
.
Features
-
SELECT
queries executed by the DuckDB engine can directly read PostgreSQL tables.- Able to read data types that exist in both PostgreSQL and DuckDB. The following data types are supported: numeric, character, binary, date/time, boolean, uuid, json, and arrays.
- If DuckDB cannot support the query for any reason, execution falls back to PostgreSQL.
-
Read parquet and CSV files from object storage (AWS S3, Cloudflare R2, or Google GCS).
SELECT n FROM read_parquet('s3://bucket/file.parquet') AS (n int)
SELECT n FROM read_csv('s3://bucket/file.csv') AS (n int)
- You can pass globs and arrays to these functions, just like in DuckDB.
-
Enable the DuckDB Iceberg extension using
SELECT duckdb.enable_extension('iceberg')
and read Iceberg files withiceberg_scan
. -
Write a query — or an entire table — to parquet in object storage.
COPY (SELECT foo, bar FROM baz) TO 's3://...'
COPY table TO 's3://...'
-
Read and write to Parquet format in a single query
COPY ( SELECT count(*), name FROM read_parquet('s3://bucket/file.parquet') AS (name text) GROUP BY name ORDER BY count DESC ) TO 's3://bucket/results.parquet';
-
Query and
JOIN
data in object storage with PostgreSQL tables, views, and materialized views. -
Create indexes on PostgreSQL tables to accelerate your DuckDB queries.
-
Install DuckDB extensions using
SELECT duckdb.install_extension('extension_name');
-
Toggle DuckDB execution on/off with a setting:
SET duckdb.execution = true|false
Getting Started
The best way to get started is to connect PostgreSQL to a new or existing object storage bucket (AWS S3, Cloudflare R2, or Google GCS) with pg_duckdb. You can query data in Parquet, CSV, and Iceberg format using read_parquet
, read_csv
, and iceberg_scan
respectively.
-
Add a credential to enable DuckDB’s httpfs support.
-- Session Token is Optional INSERT INTO duckdb.secrets (type, id, secret, session_token, region) VALUES ('S3', 'access_key_id', 'secret_access_key', 'session_token', 'us-east-1');
-
Copy data directly to your bucket - no ETL pipeline!
COPY (SELECT user_id, item_id, price, purchased_at FROM purchases) TO 's3://your-bucket/purchases.parquet;
-
Perform analytics on your data.
SELECT SUM(price) AS total, item_id FROM read_parquet('s3://your-bucket/purchases.parquet') AS (price float, item_id int) GROUP BY item_id ORDER BY total DESC LIMIT 100;