January 3, 2025
Summary: in this tutorial, you will learn how to speed up COUNT(DISTINCT) using database estimates in PostgreSQL.
Table of Contents
Introduction
Sometimes we don’t need full precision in the number we are reporting, what’s the difference between 1,568,121
users and 1,569,200
users if we are showing 1.5 million
users anyway? In such cases we can make use of the database cardinality estimates.
The benefit of using estimates is that we can rely on querying database infrastructure tables containing a recent approximate cardinality instead of having to scan the real table.
Approximate count using the database cardinality estimate tables
The database cardinality estimates are rough counts of the number of rows and distinct values available in the tables that are contained in the database.
These counts are useful for the database engine to determine which is the most effective plan to implement a specific SQL query: should it scan the entire table or use an index? Should it perform a hash join or a nested loop? All these questions can be answered properly with a good grasp of the data volumes included in the query, that the database engine keeps in a dedicated set of tables.
In PostgreSQL we can query pg_class
which contains the most recent estimate for a specific table. In our example, if we execute:
SELECT reltuples AS estimate
FROM pg_class
WHERE relname = 'example_tbl';
The estimate will show 0
rows. Weird right? This is due to the fact that we didn’t run ANALYZE
after loading the table, therefore the database doesn’t have up to date statistics about the EXAMPLE_TBL
. If we run:
ANALYZE EXAMPLE_TBL;
And then rerun the same SELECT
statement we defined above, we’ll have the updated 5
count. We can use the pg_class
table to get an estimate of the COUNT(*)
on a specific query, but we need to be aware that the number is only updated by VACUUM
, ANALYZE
and other DDL commands like CREATE INDEX
.
Note: A similar view is available for MySQL as well in the STATISTICS
table within the INFORMATION_SCHEMA
.
Approximate count using the result of EXPLAIN
The above solution is pretty static and works on the assumption we are counting (and not counting distinct) a single table column without specific WHERE
conditions applied. PostgreSQL wiki suggests a better estimate, especially for complex queries, by parsing the result of the EXPLAIN command. The EXPLAIN
command provides back the current database cardinality estimation and plan for the query and takes into consideration WHERE
conditions.
We can create a function that parses the result of the EXPLAIN
call like the following:
CREATE OR REPLACE FUNCTION count_estimate(
query text
) RETURNS integer LANGUAGE plpgsql AS $$
DECLARE
plan jsonb;
BEGIN
EXECUTE 'EXPLAIN (FORMAT JSON)' || query INTO plan;
RETURN plan->0->'Plan'->'Plan Rows';
END;
$$;
And then get the database estimate for the specific query with:
SELECT count_estimate('SELECT DISTINCT username FROM example_tbl WHERE Qty >= 2');
The result of the EXPLAIN
is the following:
QUERY PLAN
-----------------------------------------------------------------------
Unique (cost=1.07..1.08 rows=2 width=5)
-> Sort (cost=1.07..1.08 rows=2 width=5)
Sort Key: username
-> Seq Scan on example_tbl (cost=0.00..1.06 rows=2 width=5)
Filter: (qty >= 2)
(5 rows)
And the function is able to correctly retrieve the expected cardinality of 2
. Please note again that the result is an estimate, the correct result would be 3
(Fred
, Liza
and Mary
).