PostgreSQL 教程: 优化表达式的选择率估计

三月 1, 2024

摘要:在本教程中,您将学习如何优化表达式的选择率估计。

介绍

一般来说,只有当条件运算调用到列本身时,才会使用列统计信息,而在将列作为参数的表达式运算中,是不会的。规划器不知道一个函数会如何影响列统计信息,因此像 “函数调用 = 常量” 这样的条件表达式的选择率,总是估算为 0.5%。

示例

让我们开始设置一个小的测试用例:

CREATE TABLE events (id integer, occurred_at timestamp with time zone);

INSERT INTO events (id, occurred_at)
  SELECT i, TIMESTAMP '2024-01-01 00:00:00+08' + i * '2 minutes'::interval
  FROM generate_series(1, 250000) AS s(i);

ANALYZE events;

例如,我们想从表中查询一些记录,通过一个以列occurred_at为参数的表达式进行过滤:

EXPLAIN SELECT * FROM events
WHERE extract(
  month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
) = 2;

                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on events  (cost=0.00..5727.00 rows=1250 width=12)
   Filter: (EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'::text)) = '2'::numeric)
(2 rows)

SELECT round(reltuples * 0.005)
FROM pg_class WHERE relname = 'events';

 round
−−−−−−−
  1250
(1 row)

规划器甚至还不能处理标准函数,而对我们来说,很明显,2 月份发生的事件占比约为总事件的 1/12:

SELECT count(*) AS total,
  count(*) FILTER (WHERE extract(
    month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
  ) = 2) AS february
FROM events;

 total  | february
--------+----------
 250000 |    20880
(1 row)

这就是为什么要引入表达式统计了。

扩展的表达式统计

PostgreSQL 14 引入了一个叫做扩展表达式统计的特性。扩展表达式统计信息不会自动收集。要手动收集它们,请使用CREATE STATISTICS命令创建一个扩展统计的数据库对象。

CREATE STATISTICS events_expr ON (extract(
  month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
))
FROM events;

新的统计信息会提升估算的精确性:

ANALYZE events;

EXPLAIN SELECT * FROM events
WHERE extract(
  month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
) = 2;

                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on events  (cost=0.00..5727.00 rows=20392 width=12)
   Filter: (EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'::text)) = '2'::numeric)
(2 rows)

要使统计信息起作用,统计信息生成命令中的表达式,必须与原始查询中的表达式相同。

扩展的统计信息元数据,存储在系统表pg_statistic_ext中,而统计数据本身存储在一个单独的表pg_statistic_ext_data(PostgreSQL 12 和更高版本才有)。它通过与元数据分开存储,以在必要时限制用户访问敏感信息。

有一些视图,可以用户友好的形式,显示收集的统计数据。可以使用以下命令,来显示扩展表达式统计信息:

SELECT left(expr, 60) || '...' AS expr,
  null_frac, avg_width, n_distinct,
  most_common_vals AS mcv,
  left(most_common_freqs::text, 60) || '...' AS mcf,
  correlation
FROM pg_stats_ext_exprs WHERE statistics_name = 'events_expr' \gx

-[ RECORD 1 ]----------------------------------------------------------------
expr        | EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'...
null_frac   | 0
avg_width   | 8
n_distinct  | 12
mcv         | {10,8,9,6,5,1,3,7,11,4,2,12}
mcf         | {0.09046666,0.09036667,0.089933336,0.089666665,0.0894,0.0876...
correlation | 1

可以使用ALTER STATISTICS命令更改收集的统计数据量:

ALTER STATISTICS events_expr SET STATISTICS 50;

了解更多

PostgreSQL 优化