PostgreSQL Tutorial: Speed up COUNT(DISTINCT) using estimates

January 3, 2025

Summary: in this tutorial, you will learn how to speed up COUNT(DISTINCT) using database estimates in PostgreSQL.

Table of Contents

Introduction

Sometimes we don’t need full precision in the number we are reporting, what’s the difference between 1,568,121 users and 1,569,200 users if we are showing 1.5 million users anyway? In such cases we can make use of the database cardinality estimates.

The benefit of using estimates is that we can rely on querying database infrastructure tables containing a recent approximate cardinality instead of having to scan the real table.

Approximate count using the database cardinality estimate tables

The database cardinality estimates are rough counts of the number of rows and distinct values available in the tables that are contained in the database.

These counts are useful for the database engine to determine which is the most effective plan to implement a specific SQL query: should it scan the entire table or use an index? Should it perform a hash join or a nested loop? All these questions can be answered properly with a good grasp of the data volumes included in the query, that the database engine keeps in a dedicated set of tables.

In PostgreSQL we can query pg_class which contains the most recent estimate for a specific table. In our example, if we execute:

SELECT reltuples AS estimate 
FROM pg_class 
WHERE relname = 'example_tbl';

The estimate will show 0 rows. Weird right? This is due to the fact that we didn’t run ANALYZE after loading the table, therefore the database doesn’t have up to date statistics about the EXAMPLE_TBL. If we run:

ANALYZE EXAMPLE_TBL;

And then rerun the same SELECT statement we defined above, we’ll have the updated 5 count. We can use the pg_class table to get an estimate of the COUNT(*) on a specific query, but we need to be aware that the number is only updated by VACUUM, ANALYZE and other DDL commands like CREATE INDEX.

Note: A similar view is available for MySQL as well in the STATISTICS table within the INFORMATION_SCHEMA.

Approximate count using the result of EXPLAIN

The above solution is pretty static and works on the assumption we are counting (and not counting distinct) a single table column without specific WHERE conditions applied. PostgreSQL wiki suggests a better estimate, especially for complex queries, by parsing the result of the EXPLAIN command. The EXPLAIN command provides back the current database cardinality estimation and plan for the query and takes into consideration WHERE conditions.

We can create a function that parses the result of the EXPLAIN call like the following:

CREATE OR REPLACE FUNCTION count_estimate(
    query text
) RETURNS integer LANGUAGE plpgsql AS $$
DECLARE
    plan jsonb;
BEGIN
    EXECUTE 'EXPLAIN (FORMAT JSON)' || query INTO plan;
    RETURN plan->0->'Plan'->'Plan Rows';
END;
$$;

And then get the database estimate for the specific query with:

SELECT count_estimate('SELECT DISTINCT username FROM example_tbl WHERE Qty >= 2');

The result of the EXPLAIN is the following:

                              QUERY PLAN
-----------------------------------------------------------------------
 Unique  (cost=1.07..1.08 rows=2 width=5)
   ->  Sort  (cost=1.07..1.08 rows=2 width=5)
         Sort Key: username
         ->  Seq Scan on example_tbl  (cost=0.00..1.06 rows=2 width=5)
               Filter: (qty >= 2)
(5 rows)

And the function is able to correctly retrieve the expected cardinality of 2. Please note again that the result is an estimate, the correct result would be 3 (Fred, Liza and Mary).

See more

PostgreSQL Optimization