PostgreSQL 18 预览: 分区表与优化器统计信息

由 John Doe 四月 22, 2025

当分区表的数据量相当大的时候，统计信息的更新会出现滞后吗？现在，PostgreSQL 可以单独收集分区表的统计信息了。

一头与 SQL 共舞的大象

介绍

PostgreSQL 的优化器依靠准确的统计信息，为给定的语句找到最高效的计划来执行。错误或不准确的统计信息，通常会导致产生不是最优的执行计划，进而在性能方面表现不佳。这不仅适用于 PostgreSQL，所有使用基于成本的优化器的数据库系统都依赖统计信息。通常情况下，你无需关心统计信息，因为 PostgreSQL 会根据自上次收集统计信息以来表中发生的更改量，自动地收集它们。下面，我们先来了解这一过程的一般原理，然后深入探讨分区表的相关细节。

普通表的统计信息收集

让我们先创建一个简单的表，并向其中插入 100 万行数据：

create table x ( a int, b text );

insert into x select i, 'aaa' from generate_series(1,1000000) i;

在填充完表后，立即查询pg_stats中的可用统计信息，很可能什么都看不到：

select count(*) from pg_stats where tablename = 'x';
 count
-------
     0
(1 row)

原因是，自动清理进程（autovacuum，它也负责自动收集统计信息）会在唤醒之前休眠一段时间，再检查是否有任务需要执行。默认情况下，这段时间是 1 分钟：

show autovacuum_naptime;
 autovacuum_naptime
--------------------
 1min
(1 row)

最多 1 分钟后，你会发现统计信息可用了：

select count(*) from pg_stats where tablename = 'x';
 count
-------
     2
(1 row)

这里，我们不会详细来研究这些统计信息的具体内容，只需记住自动清理进程默认会收集它们。它何时进行收集，取决于以下参数：

show autovacuum_analyze_scale_factor;
 autovacuum_analyze_scale_factor
---------------------------------
 0.1
(1 row)

show autovacuum_analyze_threshold;
 autovacuum_analyze_threshold
------------------------------
 50
(1 row)

对于我们的表x，这意味着大约需要更改表中 10% 的数据，才会触发自动收集统计信息，这一点很容易证明（同样，你可能需要等待长达 1 分钟，才能看到pg_stat_all_tables中的last_autoanalyze更新）：

update x set b = 'bbb' where a < 110000;

-- 等待长达 1 分钟
select last_autoanalyze from pg_stat_all_tables where relname = 'x';
       last_autoanalyze
-------------------------------
 2025-04-15 10:49:06.095124+08
(1 row)

分区表的统计信息收集

对于分区表，分区的统计信息收集机制与普通表相同，但分区表本身并非如此。在探讨原因之前，让我们再进行一个简单的测试：

create table y ( a int, b text ) partition by list (b);

create table y_1 partition of y for values in ('a');

create table y_2 partition of y for values in ('b');

postgres=# \d+ y
                                     Partitioned table "public.y"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 a      | integer |           |          |         | plain    |             |              |
 b      | text    |           |          |         | extended |             |              |
Partition key: LIST (b)
Partitions: y_1 FOR VALUES IN ('a'),
            y_2 FOR VALUES IN ('b')

这是一个简单的列表分区表，包含两个分区。当我们添加一些数据时，PostgreSQL 会负责收集统计信息，但有一个意外情况：

insert into y select i, 'a' from generate_series(1,1000000) i;

insert into y select i, 'b' from generate_series(1,1000000) i;

-- 等待长达 1 分钟
select relname,last_autoanalyze from pg_stat_all_tables where relname like 'y%';
 relname |       last_autoanalyze
---------+-------------------------------
 y       |
 y_1     | 2025-04-15 11:22:07.463249+08
 y_2     | 2025-04-15 11:22:07.180454+08

我们看到，各个分区的统计信息已被收集，但分区表本身却没有。原因是，分区表本身不会跟踪更改量，只有分区会跟踪：

select relname,last_autoanalyze,n_mod_since_analyze from pg_stat_all_tables where relname like 'y%';
 relname |       last_autoanalyze        | n_mod_since_analyze
---------+-------------------------------+---------------------
 y       |                               |                   0
 y_1     | 2025-04-15 11:22:07.463249+08 |                   0
 y_2     | 2025-04-15 11:22:07.180454+08 |                   0
(3 rows)

update y set b = 'a' where a < 1000;

select relname,last_autoanalyze,n_mod_since_analyze from pg_stat_all_tables where relname like 'y%';
 relname |       last_autoanalyze        | n_mod_since_analyze
---------+-------------------------------+---------------------
 y       |                               |                   0
 y_1     | 2025-04-15 11:22:07.463249+08 |                1998
 y_2     | 2025-04-15 11:22:07.180454+08 |                 999
(3 rows)

由于分区表本身不包含任何行，因此无需在此处跟踪更改。结果是，分区表级别没有统计信息，只有分区有：

select tablename,attname,null_frac,n_distinct from pg_stats where tablename like 'y%';
 tablename | attname | null_frac | n_distinct
-----------+---------+-----------+------------
 y_2       | a       |         0 |         -1
 y_2       | b       |         0 |          1
 y_1       | a       |         0 |         -1
 y_1       | b       |         0 |          1
(4 rows)

如果你也想在分区表上获取统计信息，则必须手动进行：

analyze y;

select tablename,attname,null_frac,n_distinct from pg_stats where tablename like 'y%';
 tablename | attname | null_frac | n_distinct
-----------+---------+-----------+-------------
 y         | a       |         0 |   -0.525885
 y         | b       |         0 |           2
 y_1       | a       |         0 | -0.99563634
 y_1       | b       |         0 |           1
 y_2       | a       |         0 |          -1
 y_2       | b       |         0 |           1
(6 rows)

需要注意的是，这不仅会收集分区表的统计信息，还会收集所有分区的统计信息：

select relname,last_analyze from pg_stat_all_tables where relname like 'y%';
 relname |         last_analyze
---------+-------------------------------
 y       | 2025-04-15 11:34:53.503998+08
 y_1     | 2025-04-15 11:34:53.527063+08
 y_2     | 2025-04-15 11:34:53.545629+08
(3 rows)

如果你的分区数据量巨大，这可能会花费相当长的时间，而且对此你能做的并不多。PostgreSQL 18 中已经提交了一个改进，来解决这个问题。到时候你就可以使用analyze only命令，仅收集分区表的统计信息了：

analyze only y;

select relname,last_analyze from pg_stat_all_tables where relname like 'y%';
 relname |         last_analyze
---------+-------------------------------
 y       | 2025-04-15 11:41:00.619087+08
 y_1     | 2025-04-15 11:34:53.527063+08
 y_2     | 2025-04-15 11:34:53.545629+08
(3 rows)

这虽然没有解决分区表缺少统计信息的问题，但避免了重复工作，因为分区的统计信息无论如何都会自动收集。

非常不错的改进。感谢社区的所有相关人员。

参考

提交日志：https://git.postgresql.org/pg/commitdiff/62ddf7ee9a399e0b9624412fc482ed7365e38958

介绍

普通表的统计信息收集

分区表的统计信息收集

参考

搜索

分类

标签