Planner statistics in PostgreSQL

By John Doe February 25, 2024

Summary: Planner statistics play a key role in improving the performance of the database. The query planner uses the statistical data to generate efficient run plans for queries. The purpose of this post is to explain the types of statistics in PostgreSQL and how to read and understand them.

Table of Contents

Data distribution statistics

These statistics are related to the data distribution for each relation. They provide information about the most common values in each column in a relation, average width of the column, number of distinct values in the column, and more. They’re collected when we run ANALYZE or when analyze is triggered by autovacuum, and are stored in the pg_statistic system catalog (whose public readable view is pg_stats).

The following is an example of how you can generate and see these stats:

Create a table and insert some dummy data:

postgres=# CREATE TABLE test_stats(id INT, name VARCHAR);
CREATE TABLE
postgres=> INSERT INTO test_stats VALUES (generate_series(1,10),'test'||generate_series(1,10));
INSERT 0 10
postgres=> INSERT INTO test_stats VALUES (generate_series(1,10),'test'||generate_series(1,10));
INSERT 0 10
postgres=> INSERT INTO test_stats VALUES (generate_series(1,10),'test'||generate_series(1,10));
INSERT 0 10
postgres=> INSERT INTO test_stats VALUES (generate_series(11,20),'test'||generate_series(11,20));
INSERT 0 10

Analyze the table to generate the stats for the table:

postgres=> ANALYZE VERBOSE test_stats ;
INFO:  analyzing "public.test_stats"
INFO:  "test_stats": scanned 1 of 1 pages, containing 40 live rows and 0 dead rows; 40 rows in sample, 40 estimated total rows
ANALYZE

Check the stats of the table in the pg_stats view:

postgres=> SELECT * FROM pg_stats WHERE tablename ='test_stats';
-[ RECORD 1 ]----------+------------------------------------------------
schemaname             | public
tablename              | test_stats
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -0.5
most_common_vals       | {1,2,3,4,5,6,7,8,9,10}
most_common_freqs      | {0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075}
histogram_bounds       | {11,12,13,14,15,16,17,18,19,20}
correlation            | 0.7551595
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

-[ RECORD 2 ]----------+------------------------------------------------
schemaname             | public
tablename              | test_stats
attname                | name
inherited              | f
null_frac              | 0
avg_width              | 6
n_distinct             | -0.5
most_common_vals       | {test1,test10,test2,test3,test4,test5,test6,test7,test8,test9}
most_common_freqs      | {0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075}
histogram_bounds       | {test11,test12,test13,test14,test15,test16,test17,test18,test19,test20}
correlation            | -0.19043152
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

postgres=>

As shown in the output, the pg_stats view has the data distribution stats for each column of the test_stats table. For example, there are 20 unique values for column id; however, you can see only 10 values for most_common_values because these are duplicates, and for the name column, test1, test2, …test9, as the most common values. The most common value list is used to help the planner predict the selectivity of equality expressions, such as where name='test5' or where state=’TX’. The histogram bounds are used to help the planner predict the selectivity of inequality or range expressions, such as where id is between 5000–10000.

If you look at the correlation column for the name column, its value is -0.19 (near 0). When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0, due to the reduction of random access to the disk. Because there are only 30 rows, a value near 0 indicates that an index isn’t required for this column. There are no null values in the tables, so null_frac is 0.

The query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. For making these estimations, these data distribution statistics are used.

The following are a few key things to note about the statistics:

For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time.
The amount of samples considered by ANALYZE depends on the default_statistics_target parameter. Larger values increase the time needed to do ANALYZE, but might improve the quality of the planner’s estimates. The default value for this is 100. To get an accurate plan, the default value is sufficient; however, default_statistics_target is the global default. For the case where there is 1 column that needs more stats, you can use ALTER TABLE [ TABLE ] ALTER [ COLUMN ] column_name SET STATISTICS integer. However, it will consume more CPU, memory, and time. If the value of this parameter is 100, then 300 * 100 = 30,000 rows are sampled from each table. This sample is used to determine up to 100 most common values to store in the most_common_vals array column, and up to 100 histogram bounds to store in that array, plus a few other scalar statistics, like the number of distinct values.
There is no command or function for resetting these statistics (as in resetting to 0).
After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the pg_statistic table (to have the statistics updated for the planner’s use).
For a read replica in PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in PostgreSQL). This is also the reason why it isn’t allowed (and also not logical) to run an ANALYZE on a replica or a reader node (both can read from the pg_statistics relation, but can’t update it).

Extended statistics

By default, the statistics from ANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.

You can use CREATE STATISTICS for a single expression (which is called univariate statistics), which provides benefits similar to an expression index without the overhead of index maintenance. You can also use CREATE STATISTICS on multiple columns or expressions (known as multivariate statistics).

Consider a common table with a relation between a zone and city in China. For example, we have a query to find the rows from the zone of High-tech in the city of Shanghai. There is a zone named High-tech in most cities, and the city of Shanghai has many zones. Determining the number of distinct rows with that combination without a relationship is simply combining the two probabilities. Extended statistics gather stats on the two columns together so the estimates of the number of rows from High-tech zone, Shanghai, is a lot closer.

Let’s use a table with a two integer columns to demonstrate the usage of extended stats. Complete the following steps:

Create a table with a and b columns and insert some data:

postgres=> CREATE TABLE ext_tests(a int, b int);
CREATE TABLE
postgres=> INSERT INTO ext_tests SELECT x/1000, x/10000 FROM generate_series(1,1000000) s(x);
INSERT 0 1000000
postgres=>
postgres=> ANALYZE VERBOSE ext_tests;
INFO:  analyzing "public.ext_tests"
INFO:  "ext_tests": scanned 4425 of 4425 pages, containing 1000000 live rows and 0 dead rows; 30000 rows in sample, 1000000 estimated total rows
ANALYZE
postgres=>

From data inserted, these columns are functionally dependent — the knowledge of a value in the first column is sufficient for determining the value in the other column.

Look at the pg_stats values for these columns:

postgres=> select * from pg_stats where tablename='ext_tests';
-[ RECORD 1 ]----------+-----------------------------------------------------------
schemaname             | public
tablename              | ext_tests
attname                | a
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 1000
most_common_vals       | {698,821,983}
most_common_freqs      | {0.0015,0.0014666667,0.0014666667}
histogram_bounds       | {0,10,20,….. ,999}
correlation            | 1
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |
-[ RECORD 2 ]----------+-----------------------------------------------------------
schemaname             | public
tablename              | ext_tests
attname                | b
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 100
most_common_vals       | {84, 19,…..,55,5,74,62}
most_common_freqs      | {0.011166667, 0. 011033333,….0.008733333,0.008433334}
histogram_bounds       |
correlation            | 1
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

You can see the distinct values for column a as 1000 and b as 100.

Without extended statistics, selecting one column looks fine with the estimated rows as 999, which is almost the same as actual rows which is 1000:

postgres=> set max_parallel_workers_per_gather =0;
SET
postgres=> explain analyze select * from ext_tests where a=1;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on ext_tests  (cost=0.00..16925.00 rows=999 width=8) (actual time=0.073..58.429 rows=1000 loops=1)
   Filter: (a = 1)
   Rows Removed by Filter: 999000
 Planning Time: 0.037 ms
 Execution Time: 58.485 ms
(5 rows)

However, if you add the second column to the predicate, the planner estimates rows very differently, because no dependency statistics are gathered:

postgres=> explain analyze select * from ext_tests where a=1 and b=0;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on ext_tests  (cost=0.00..194247.65 rows=1 width=8) (actual time=0.077..612.787 rows=1000 loops=1)
   Filter: ((a = 1) AND (b = 0))
   Rows Removed by Filter: 9999000
 Planning Time: 0.044 ms
 Execution Time: 612.844 ms
(5 rows)

You can see estimated rows as 1, which is far from the actual rows (1,000).

Let’s gather extended statistics:

postgres=> create statistics s_ext_depend(dependencies) on a,b from ext_tests ;
CREATE STATISTICS
postgres=> analyze VERBOSE ext_tests ;
INFO:  analyzing "public.ext_tests"
INFO:  "ext_tests": scanned 30000 of 44248 pages, containing 6779952 live rows and 0 dead rows; 30000 rows in sample, 9999977 estimated total rows
ANALYZE
postgres=> explain analyze select * from ext_tests where a=1 and b=0;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Seq Scan on ext_tests  (cost=0.00..194247.65 rows=1017 width=8) (actual time=0.076..615.219 rows=1000 loops=1)
   Filter: ((a = 1) AND (b = 0))
   Rows Removed by Filter: 9999000
 Planning Time: 0.126 ms
 Execution Time: 615.287 ms
(5 rows)

postgres=>

As you see, estimated rows (1,017) is very close to the actual number.

All the queries with filters on both columns will have much better estimates using extended statistics.

Data distribution statistics

Extended statistics

Search

Categories

Tags