PostgreSQL 18: ANALYZE ONLY 分区表

由 John Doe 十月 21, 2025

你有在 PostgreSQL 中遇到过分区表查询的性能问题吗？现在，PostgreSQL 可以单独分析分区表了。

特性提交日志

为 VACUUM 和 ANALYZE 命令添加 ONLY 关键字支持。

由于自动清理（autovacuum）不会触发分区表的ANALYZE操作，因此用户必须手动执行该操作。然而，对分区表手动执行ANALYZE时，总会递归分析每个分区，而实际上自动清理已经处理了分区的分析工作，这种重复操作往往是不必要的。对于包含大量分区的分区表而言，逐个分析所有分区会耗费过长时间；若表中还包含大量列，耗时问题会更为突出。

在此变更中，我们允许在表名前添加ONLY关键字，使ANALYZE操作可以跳过对分区的处理。该选项同样可用于VACUUM命令，但需注意：若对分区表使用“VACUUM ONLY”，由于分区表本身不存储实际数据，此操作将无任何实际效果。

此提交同时修改了VACUUM和ANALYZE在“继承父表”场景下的行为。在此之前，对继承父表执行这两个命令时，不会处理其继承子表；而现在，默认行为变为会同时处理子表。若需保留旧版“不处理子表”的行为，可通过添加ONLY关键字实现。发布说明中需将此变更标注为“不兼容变更”。需要特别说明的是，分区表的默认行为未发生改变，此前分区表执行这两个命令时就会递归处理所有分区，此次变更后仍保持该逻辑。

讨论：https://postgr.es/m/CADofcAWATx_haD=QkSxHbnTsAe6+e0Aw8Eh4H8cXyogGvn_kOg@mail.gmail.com

示例

autovacuum 不会触发分区表的ANALYZE，用户需手动执行，但手动ANALYZE会递归分析所有分区，若分区数量多（如按日分区的年度表有 365 个分区）、字段多，整个过程会耗时极久，且多数情况下分区已被 autovacuum 处理，无需重复分析。

本次提交允许用户单独对分区表进行VACUUM/ANALYZE。

为演示单独分析分区表的应用场景，我们将使用一个简单示例：

CREATE TABLE tab (
   pk integer NOT NULL
) PARTITION BY LIST ((pk % 2));

CREATE TABLE tab_even
PARTITION OF tab FOR VALUES IN (0);

CREATE TABLE tab_odd
PARTITION OF tab FOR VALUES IN (1);

向分区表插入数据时，两个分区将各接收一半数据：

INSERT INTO tab
SELECT * FROM generate_series(1, 1000000);

PostgreSQL 在估算连接（JOIN）操作的结果行数时，会依赖分区表自身的统计信息。请看如下示例：

EXPLAIN (ANALYZE)
SELECT * FROM generate_series(1, 100) AS gs(id)
  JOIN tab ON gs.id = tab.pk;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=2.25..28178.25 rows=500000 width=8) (actual time=0.080..360.601 rows=100.00 loops=1)
   Hash Cond: (tab.pk = gs.id)
   Buffers: shared hit=4426
   ->  Append  (cost=0.00..19426.00 rows=1000000 width=4) (actual time=0.017..185.943 rows=1000000.00 loops=1)
         Buffers: shared hit=4426
         ->  Seq Scan on tab_even tab_1  (cost=0.00..7213.00 rows=500000 width=4) (actual time=0.016..53.249 rows=500000.00 loops=1)
               Buffers: shared hit=2213
         ->  Seq Scan on tab_odd tab_2  (cost=0.00..7213.00 rows=500000 width=4) (actual time=0.023..49.753 rows=500000.00 loops=1)
               Buffers: shared hit=2213
   ->  Hash  (cost=1.00..1.00 rows=100 width=4) (actual time=0.057..0.059 rows=100.00 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         ->  Function Scan on generate_series gs  (cost=0.00..1.00 rows=100 width=4) (actual time=0.024..0.034 rows=100.00 loops=1)
 Planning:
   Buffers: shared hit=93
 Planning Time: 0.740 ms
 Execution Time: 360.665 ms
(16 rows)

所有行数估算结果都是准确的，除了哈希连接（Hash Join）的估算：实际行数为 100，估算行数却为 500000，偏差高达 5000 倍！让我们对分区表单独执行ANALYZE，看看是否能改善估算结果：

ANALYZE ONLY tab;

EXPLAIN (ANALYZE)
SELECT * FROM generate_series(1, 100) AS gs(id)
  JOIN tab ON gs.id = tab.pk;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1002.25..14966.87 rows=100 width=8) (actual time=0.779..190.264 rows=100.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=4426
   ->  Hash Join  (cost=2.25..13956.87 rows=42 width=8) (actual time=59.993..182.525 rows=33.33 loops=3)
         Hash Cond: (tab.pk = gs.id)
         Buffers: shared hit=4426
         ->  Parallel Append  (cost=0.00..12391.69 rows=416668 width=4) (actual time=0.015..92.782 rows=333333.33 loops=3)
               Buffers: shared hit=4426
               ->  Parallel Seq Scan on tab_even tab_1  (cost=0.00..5154.18 rows=294118 width=4) (actual time=0.009..20.290 rows=166666.67 loops=3)
                     Buffers: shared hit=2213
               ->  Parallel Seq Scan on tab_odd tab_2  (cost=0.00..5154.18 rows=294118 width=4) (actual time=0.017..50.072 rows=250000.00 loops=2)
                     Buffers: shared hit=2213
         ->  Hash  (cost=1.00..1.00 rows=100 width=4) (actual time=0.093..0.093 rows=100.00 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 12kB
               ->  Function Scan on generate_series gs  (cost=0.00..1.00 rows=100 width=4) (actual time=0.042..0.055 rows=100.00 loops=3)
 Planning:
   Buffers: shared hit=7
 Planning Time: 0.425 ms
 Execution Time: 190.360 ms
(20 rows)

问题解决了！

非常不错的改进，感谢社区的所有相关人员。

参考

提交日志：https://git.postgresql.org/pg/commitdiff/62ddf7ee9a399e0b9624412fc482ed7365e38958

特性提交日志

示例

参考

搜索

分类

标签