May 16, 2024
Summary: In this tutorial, you will learn the basic relation-level statistics in the system catalog.
Table of Contents
Getting ready
Let’s start with a little test setup:
CREATE TABLE test1 (id integer, str text);
INSERT INTO test1 (id, str)
SELECT i, repeat('x', 64) as str
FROM generate_series(1, 240000) AS s(i);
VACUUM ANALYZE test1;
Basic statistics
Basic relation-level statistics are stored in the table pg_class
in the system catalog. The statistics include the following data:
- Relation’s row count (
reltuples
). - Relation’s size in pages (
relpages
). - Number of pages marked in the relation’s visibility map (
relallvisible
).
SELECT reltuples, relpages, relallvisible
FROM pg_class WHERE relname = 'test1';
reltuples | relpages | relallvisible
-----------+----------+---------------
240000 | 2963 | 2963
(1 row)
For queries with no conditions (filters), the cardinality estimate will equal reltuples
:
EXPLAIN SELECT * FROM test1;
QUERY PLAN
--------------------------------------------------------------
Seq Scan on test1 (cost=0.00..5363.00 rows=240000 width=69)
(1 row)
How are statistics collected?
Statistics are collected during automatic or manual analysis. Basic statistics, being vital information, are also calculated when some operations are performed, such as VACUUM FULL
and CLUSTER
or CREATE INDEX
and REINDEX
. The system also updates statistics during vacuuming.
To collect statistics, the analyzer randomly selects 300 × default_statistics_target rows (the default value is 100, so 30,000 rows in total). Table sizes are not taken into account here because overall dataset size has little effect on what sample size would be considered sufficient for accurate statistics.
Random rows are selected from 300 × default_statistics_target random pages. If a table is smaller than the desired sample size, the analyzer just reads the whole table.
In large tables, statistics will be imprecise because the analyzer does not scan every single row. Even if it did, the statistics would always be somewhat outdated, because table data keeps changing. We don’t need the statistics to be all that precise anyway: variations up to an order of magnitude are still accurate enough to produce an adequate plan.
Example
Let’s create a copy of the test1
table with autovacuuming disabled, so we can control when analysis happens.
CREATE TABLE test1_copy(LIKE test1)
WITH (autovacuum_enabled = false);
There are no statistics for the new table yet:
SELECT reltuples, relpages, relallvisible
FROM pg_class WHERE relname = 'test1_copy';
reltuples | relpages | relallvisible
-----------+----------+---------------
-1 | 0 | 0
(1 row)
The value reltuples
= −1 (in PostgreSQL 14 and higher) helps us distinguish between a table that has never had statistics collected for it and a table that just doesn’t have any rows.
More often than not, a newly created table is populated right away. The planner does not know anything about the new table, so it assumes the table to be 10 pages long by default:
EXPLAIN SELECT * FROM test1_copy;
QUERY PLAN
---------------------------------------------------------------
Seq Scan on test1_copy (cost=0.00..22.70 rows=1270 width=36)
(1 row)
The planner calculates the row count based on the width of a single row. The width is usually an average value that’s calculated during analysis. This time, however, there is no analysis data, so the system approximates the width based on column data types.
Let’s copy the data from test1
into the new table and run the analyzer:
INSERT INTO test1_copy SELECT * FROM test1;
ANALYZE test1_copy;
Now the statistics match the actual row count. The table is compact enough for the analyzer to run through every row:
SELECT reltuples, relpages, relallvisible
FROM pg_class WHERE relname = 'test1_copy';
reltuples | relpages | relallvisible
-----------+----------+---------------
240000 | 2963 | 0
(1 row)
The relallvisible
value updates after vacuuming:
VACUUM test1_copy;
SELECT relallvisible FROM pg_class WHERE relname = 'test1_copy';
relallvisible
---------------
2963
(1 row)
This value is used when estimating index-only scan cost.
Let’s double the number of rows while keeping the old statistics and see what cardinality the planner comes up with:
INSERT INTO test1_copy SELECT * FROM test1;
SELECT count(*) FROM test1_copy;
count
--------
480000
(1 row)
EXPLAIN SELECT * FROM test1_copy;
QUERY PLAN
--------------------------------------------------------------------
Seq Scan on test1_copy (cost=0.00..10726.00 rows=480000 width=69)
(1 row)
The estimate is accurate, despite the outdated pg_class
data:
SELECT reltuples, relpages
FROM pg_class WHERE relname = 'test1_copy';
reltuples | relpages
-----------+----------
240000 | 2963
(1 row)
The planner notices that the size of the data file no longer matches the old relpages
value, so it scales reltuples
appropriately in an attempt to increase accuracy. The file size has doubled, so the number of rows is adjusted accordingly (data density is presumed constant):
SELECT reltuples * (pg_relation_size('test1_copy') / 8192) / relpages
FROM pg_class WHERE relname = 'test1_copy';
?column?
----------
480000
(1 row)
This adjustment doesn’t always work (for example, you can delete several rows, and the estimate will not change), but when big changes are made, this approach allows the statistics to hold on until the analyzer comes around.