三月 1, 2024
摘要:在本教程中,您将学习如何优化表达式的选择率估计。
目录
介绍
一般来说,只有当条件运算调用到列本身时,才会使用列统计信息,而在将列作为参数的表达式运算中,是不会的。规划器不知道一个函数会如何影响列统计信息,因此像 “函数调用 = 常量” 这样的条件表达式的选择率,总是估算为 0.5%。
示例
让我们开始设置一个小的测试用例:
CREATE TABLE events (id integer, occurred_at timestamp with time zone);
INSERT INTO events (id, occurred_at)
SELECT i, TIMESTAMP '2024-01-01 00:00:00+08' + i * '2 minutes'::interval
FROM generate_series(1, 250000) AS s(i);
ANALYZE events;
例如,我们想从表中查询一些记录,通过一个以列occurred_at
为参数的表达式进行过滤:
EXPLAIN SELECT * FROM events
WHERE extract(
month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
) = 2;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on events (cost=0.00..5727.00 rows=1250 width=12)
Filter: (EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'::text)) = '2'::numeric)
(2 rows)
SELECT round(reltuples * 0.005)
FROM pg_class WHERE relname = 'events';
round
−−−−−−−
1250
(1 row)
规划器甚至还不能处理标准函数,而对我们来说,很明显,2 月份发生的事件占比约为总事件的 1/12:
SELECT count(*) AS total,
count(*) FILTER (WHERE extract(
month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
) = 2) AS february
FROM events;
total | february
--------+----------
250000 | 20880
(1 row)
这就是为什么要引入表达式统计了。
扩展的表达式统计
PostgreSQL 14 引入了一个叫做扩展表达式统计的特性。扩展表达式统计信息不会自动收集。要手动收集它们,请使用CREATE STATISTICS
命令创建一个扩展统计的数据库对象。
CREATE STATISTICS events_expr ON (extract(
month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
))
FROM events;
新的统计信息会提升估算的精确性:
ANALYZE events;
EXPLAIN SELECT * FROM events
WHERE extract(
month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
) = 2;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on events (cost=0.00..5727.00 rows=20392 width=12)
Filter: (EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'::text)) = '2'::numeric)
(2 rows)
要使统计信息起作用,统计信息生成命令中的表达式,必须与原始查询中的表达式相同。
扩展的统计信息元数据,存储在系统表pg_statistic_ext
中,而统计数据本身存储在一个单独的表pg_statistic_ext_data
(PostgreSQL 12 和更高版本才有)。它通过与元数据分开存储,以在必要时限制用户访问敏感信息。
有一些视图,可以用户友好的形式,显示收集的统计数据。可以使用以下命令,来显示扩展表达式统计信息:
SELECT left(expr, 60) || '...' AS expr,
null_frac, avg_width, n_distinct,
most_common_vals AS mcv,
left(most_common_freqs::text, 60) || '...' AS mcf,
correlation
FROM pg_stats_ext_exprs WHERE statistics_name = 'events_expr' \gx
-[ RECORD 1 ]----------------------------------------------------------------
expr | EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'...
null_frac | 0
avg_width | 8
n_distinct | 12
mcv | {10,8,9,6,5,1,3,7,11,4,2,12}
mcf | {0.09046666,0.09036667,0.089933336,0.089666665,0.0894,0.0876...
correlation | 1
可以使用ALTER STATISTICS
命令更改收集的统计数据量:
ALTER STATISTICS events_expr SET STATISTICS 50;