PostgreSQL Tutorial: Speed up COUNT(DISTINCT) using aggregations

January 8, 2025

Summary: in this tutorial, you will learn how to speed up COUNT(DISTINCT) using aggregations in PostgreSQL.

Table of Contents

Introduction

The solutions based on estimates are useful only if we have understood the raw estimate from the database point of view. What if we need a precise number at a precise point in time? If the timing is known upfront (as example, every day at 8AM), we could pre-calculate the counts in dedicated structures.

This option could speed up COUNT(DISTINCT) queries in read-intensive environments, with rare updates/writes, since the overhead to update the aggregated data will be minimal due to the scarcity of writes. If for example we partition our order tables per day, we could perform the aggregated count in all the previous day’s partitions since they will not have any updates/deletes happening.

Before analyzing the aggregates, let’s populate the EXAMPLE_TBL with a bit more data using the following SQL:

INSERT INTO EXAMPLE_TBL (Username, Purchase_Date, Product, Qty)
SELECT 'User' || floor(random() * 200 + 1)::int, 
    CURRENT_DATE - floor(random() * 5 + 1)::int,  
    CASE floor(random() * 3 + 1)::int 
        WHEN 1 then ''
        WHEN 2 then ''
        WHEN 3 then ''
    END,
    floor(random() * 10 + 1)::int
FROM generate_series(1,1000000);

Once run, the above SQL generates 1,000,000 rows in the EXAMPLE_TBL.

Aggregating via materialized views

Most databases including PostgreSQL have the concept of a Materialized View: a way to store the result of a select statement in a dedicated table and have the database automatically use this table for matching queries.

If, for example, we create a materialized view over our table, calculating the different usernames for each product with:

CREATE MATERIALIZED VIEW EXAMPLE_MW AS 
SELECT PRODUCT, 
COUNT(DISTINCT USERNAME) 
FROM EXAMPLE_TBL 
GROUP BY PRODUCT;

We can see in the database a new materialized view, called example_mw, by querying the pg_matviews system view:

SELECT * FROM pg_matviews WHERE matviewname = 'example_mw';

Keep in mind that the difference between a normal view and a materialized one is that the results of the latter are actually stored in the database. If now we issue a query like:

SELECT PRODUCT, 
COUNT(DISTINCT USERNAME) 
FROM EXAMPLE_TBL
GROUP BY PRODUCT;

The above query takes approximately 6346.786 ms, while if we run a similar query over the materialized view:

SELECT * FROM example_mw;

It returns in 16.683 ms! However, we need to remember that materialized views are physicalized on creation and not automatically updated, therefore to have accurate results, we need to refresh the materialized view with:

REFRESH MATERIALIZED VIEW example_mw;

Which, in the example, took 6490.287 ms. As mentioned before, using a materialized view can be a good solution in cases when the underlying table doesn’t have frequent writes. Calculating the materialized view once and querying it continuously speeds up the performance.

Aggregating via dedicated data structures: hyperloglog

Citus postgresql-hll introduces a new data type, called hll, which is a HyperLogLog data structure, a fixed-size, set-like structure used for distinct value counting with tunable precision. The beauty of the HyperLogLog table is that we can feed in the data structure once at a detailed granularity (as example at Purchase_Date) and query it at a more aggregate level (e.g. at Purchase Month) relying on the ability of the hll structure to estimate the aggregations.

In our example we first need to enable the extension:

CREATE EXTENSION hll;

Then we can create a dedicated table with:

CREATE TABLE daily_users (
  	Purchase_Date date UNIQUE,
  	users hll
);

And feed it with:

INSERT INTO daily_users
SELECT Purchase_Date, hll_add_agg(hll_hash_text(Username))
  	FROM EXAMPLE_TBL
  	GROUP BY 1;

With the hll table loaded (in 614.442 ms) we can now query it to retrieve the distinct number of users for a specific date:

SELECT Purchase_Date, 
  	hll_cardinality(users) 
  	FROM daily_users;

Returning the result set in 18.357 ms. As mentioned above, the beauty of hll is that it can approximate the aggregations, so if, for example we want to understand the number of distinct users in the last two days, we can query the table and have it aggregate the results with:

SELECT hll_cardinality(hll_union_agg(users))
FROM daily_users 
WHERE Purchase_Date >= CURRENT_DATE -2;

Even if we store the data at daily level, the hll data structure is able to provide an estimation of the aggregation over multiple days.

Where hll gets even more interesting is the ability to perform set computation across the dataset. If, for example, we want to understand which people purchased yesterday but not today we can do it with the following query:

SELECT Purchase_Date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
FROM daily_users
WINDOW two_days AS (ORDER BY Purchase_Date ASC ROWS 1 PRECEDING);

The above query:

  • Generates a window (WINDOW two_days) of two days containing the current and previous date (ORDER BY Purchase_Date ASC ROWS 1 PRECEDING)
  • Calculates the overall aggregate number of users for the two days (#hll_union_agg(users)) and subtracts the number of users of the second day.

Providing the results in just 157.709 ms. To achieve the same result we could write a query full of self-joins and count distinct with far worse performances and impact on the database.

See more

PostgreSQL Optimization