PostgreSQL Tutorial: Tune selectivity estimation of multivariate matching

March 3, 2024

Summary: in this tutorial, you will learn how to tune selectivity estimation of multivariate matching.

Table of Contents

PostgreSQL 10 introduced the ability to collect statistics from several columns simultaneously, also known as multivariate statistics. This requires generating the necessary extended statistics manually.

There are three types of multivariate statistics.

Functional dependencies between columns

When values in one column are determined (fully or partially) by values in another column, and there are conditions referencing both columns in a query, the resulting cardinality will be underestimated. Let’s start with a little test setup:

CREATE TABLE test3 (id integer, val varchar);

INSERT INTO test3 (id, val)
  SELECT i % 100, 'PG00' || i % 100
  FROM generate_series(1, 10000) AS s(i);

CREATE INDEX idx_test3_id ON test3 (id);

ANALYZE test3;

Here’s an example with two conditions:

SELECT count(*) FROM test3
WHERE val = 'PG0027' AND id = 27;

 count
-------
   100
(1 row)

The estimate is significantly lower than it should be, at just 1 row:

EXPLAIN SELECT * FROM test3
WHERE val = 'PG0027' AND id = 27;

                                 QUERY PLAN
-----------------------------------------------------------------------------
 Bitmap Heap Scan on test3  (cost=5.04..62.45 rows=1 width=10)
   Recheck Cond: (id = 27)
   Filter: ((val)::text = 'PG0027'::text)
   ->  Bitmap Index Scan on idx_test3_id  (cost=0.00..5.04 rows=100 width=0)
         Index Cond: (id = 27)
(5 rows)

This is the infamous correlated predicates problem. The planner expects the predicates to be independent and calculates the resulting selectivity as the product of selectivities of the conditions, combined with AND. The Bitmap Index Scan estimate, calculated for the id condition, drops significantly after the val condition in Bitmap Heap Scan is applied.

Naturally, a id number already unambiguously defines the val value, so the val condition is actually redundant. This is where extended functional dependencies statistics can help improve the estimate.

Let’s create extended functional dependencies statistics for the two columns:

CREATE STATISTICS test3_dep (dependencies) ON id, val FROM test3;

Analyze again, now with the new statistics, and the estimate improves:

ANALYZE test3;

EXPLAIN SELECT * FROM test3
WHERE val = 'PG0027' AND id = 27;

                                 QUERY PLAN
-----------------------------------------------------------------------------
 Bitmap Heap Scan on test3  (cost=5.06..62.48 rows=100 width=10)
   Recheck Cond: (id = 27)
   Filter: ((val)::text = 'PG0027'::text)
   ->  Bitmap Index Scan on idx_test3_id  (cost=0.00..5.04 rows=100 width=0)
         Index Cond: (id = 27)
(5 rows)

The statistics are stored in the system catalog and can be displayed with this command:

SELECT dependencies
FROM pg_stats_ext WHERE statistics_name = 'test3_dep';

               dependencies
------------------------------------------
 {"1 => 2": 1.000000, "2 => 1": 1.000000}
(1 row)

The numbers 1 and 2 are the table column numbers from pg_attribute. The values next to them represent the degree of functional dependency, from 0 (independent) to 1 (values in the second column are entirely defined by values in the first one).

Multivariate number of distinct values

Statistics on the number of distinct combinations of values from several columns will significantly improve the cardinality of a GROUP BY operation over multiple columns. Let’s start with a little test setup:

CREATE TABLE test4 (id integer, val varchar);

INSERT INTO test4 (id, val)
  SELECT i % 100, 'PG00' || (i % 40 + 1)
  FROM generate_series(1, 100000) AS s(i);

ANALYZE test4;

In this example, the planner estimates the number of possible pairs of id and val, as the multiplication of the number of distinct id values and the number of distinct val values. The true number of pairs is significantly lower, however, because there are some correlation betweent id and val:

SELECT count(*) FROM (
  SELECT DISTINCT id, val FROM test4
) t;

 count
-------
   200
(1 row)

EXPLAIN
SELECT DISTINCT id, val FROM test4;

                             QUERY PLAN
--------------------------------------------------------------------
 HashAggregate  (cost=2041.00..2081.00 rows=4000 width=10)
   Group Key: id, val
   ->  Seq Scan on test4  (cost=0.00..1541.00 rows=100000 width=10)
(3 rows)

Let’s create an extended statistics for the number of distinct values:

CREATE STATISTICS test4_nd(ndistinct) ON id, val FROM test4;
ANALYZE test4;

EXPLAIN
SELECT DISTINCT id, val FROM test4;

                             QUERY PLAN
--------------------------------------------------------------------
 HashAggregate  (cost=2041.00..2043.00 rows=200 width=10)
   Group Key: id, val
   ->  Seq Scan on test4  (cost=0.00..1541.00 rows=100000 width=10)
(3 rows)

The statistics are stored in the system catalog:

SELECT n_distinct
FROM pg_stats_ext WHERE statistics_name = 'test4_nd';

  n_distinct
---------------
 {"1, 2": 200}
(1 row)

Multivariate most common values lists

When values are distributed non-uniformly, functional dependency data alone may not suffice, as the estimate will vary significantly depending on specific pairs of values. Let’s start with a little test setup:

CREATE TABLE test5 (id integer, val varchar);

INSERT INTO test5 (id, val)
  SELECT i % 50, 'PG00' || (i % 20 + 1)
  FROM generate_series(1, 10000) AS s(i);

ANALYZE test5;

Consider this example, where the planner inaccurately estimates the number of rows, given id equal to 35 and val equal to ‘PG0016’:

SELECT count(*) FROM test5
WHERE val = 'PG0016' AND id = 35;

 count
-------
   100
(1 row)

EXPLAIN SELECT * FROM test5
WHERE val = 'PG0016' AND id = 35;

                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on test5  (cost=0.00..205.00 rows=10 width=10)
   Filter: (((val)::text = 'PG0016'::text) AND (id = 35))
(2 rows)

We can improve the estimate with multivariate MCV list statistics:

CREATE STATISTICS test5_mcv(mcv) ON id, val FROM test5;
ANALYZE test5;

EXPLAIN SELECT * FROM test5
WHERE val = 'PG0016' AND id = 35;

                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on test5  (cost=0.00..205.00 rows=100 width=10)
   Filter: (((val)::text = 'PG0016'::text) AND (id = 35))
(2 rows)

Now there is frequency data in the system catalog for the planner to use:

SELECT values, frequency
FROM pg_statistic_ext stx
  JOIN pg_statistic_ext_data stxd ON stx.oid = stxd.stxoid,
  pg_mcv_list_items(stxdmcv) m
WHERE stxname = 'test5_mcv'
AND values = '{35,PG0016}';

   values    | frequency
-------------+-----------
 {35,PG0016} |      0.01
(1 row)

A multivariate most common values list stores default_statistics_target values, just like a regular MCV list. If the parameter is defined at the column level, the highest value is used.

As with extended expression statistics, you can change the list size (in PostgreSQL 13 and higher):

ALTER STATISTICS ... SET STATISTICS ...;

In these examples, multivariate statistics are collected for only two columns, but you can collect them for as many columns as you want.

You can also collect different types of statistics into a single extended statistics object. To do this, just list required statistics types separated by commas when creating an object. If no specific statistics types are defined, the system will collect all available statistics at once.

PostgreSQL 14 also allows you to use not just column names, but arbitrary expressions when making multivariate and expression statistics.