六月 11, 2024
摘要:在本教程中,您将了解只用索引的扫描(Index Only Scan),及其成本估算。
目录
介绍
当一个索引包含了处理查询所需的所有数据时,我们将其称为查询的覆盖索引。使用覆盖索引,访问方法可以直接从索引中检索数据,而无需进行单个表扫描。这称为只用索引的扫描,可由任何带有 RETURNABLE 属性的访问方法使用。
让我们开始设置一个小的测试用例:
CREATE TABLE events (event_id char(7) PRIMARY KEY, occurred_at timestamptz);
INSERT INTO events (event_id, occurred_at)
SELECT to_hex(i),
TIMESTAMP '2024-01-01 00:00:00+08' + floor(random() * 100000)::int * '1 minutes'::interval
FROM generate_series(x'1000000'::int, x'2000000'::int, 16) AS s(i);
VACUUM ANALYZE events;
该扫描操作在计划中表示为一个 “Index Only Scan” 节点:
EXPLAIN SELECT event_id FROM events WHERE event_id < '1100000';
QUERY PLAN
--------------------------------------------------------------------------------------
Index Only Scan using events_pkey on events (cost=0.43..2090.81 rows=73279 width=8)
Index Cond: (event_id < '1100000'::bpchar)
(2 rows)
该名称表明 “Index Only Scan” 节点不会查询表,但在某些情况下会查询。PostgreSQL 的索引不存储可见性信息,因此访问方法会返回与查询匹配的每个行版本的数据,而不管它们对当前事务的可见性如何。元组的可见性随后会由索引机制来验证。
但是,如果该方法无论如何都必须扫描表以验证可见性,那么它与普通的索引扫描有何不同?诀窍在一个称为可见性映射表的功能。每个堆表都有一个可见性映射表,用于跟踪哪些页面仅包含已知对所有活动事务可见的元组。每当一个索引方法返回指向可见性映射表中标记的页面的行 TID 时,您就可以确定其中的所有数据都对事务可见。
成本估算
只用索引的扫描,其成本由可见性映射表中标记的表页数决定。这是收集的统计信息:
SELECT relpages, relallvisible
FROM pg_class WHERE relname = 'events';
relpages | relallvisible
----------+---------------
5668 | 5668
(1 row)
索引扫描和只用索引的扫描,在成本估算上面的唯一区别是,后者将前者的 I/O 成本乘以可见性映射表中不存在的页面的比例。(处理成本保持不变。)
在此示例中,每个页面上的每个行版本,对每个事务都是可见的,因此在 I/O 上的开销基本上乘以零:
WITH costs(idx_cost, tbl_cost) AS (
SELECT
( SELECT round(
current_setting('random_page_cost')::real * pages +
current_setting('cpu_index_tuple_cost')::real * tuples +
current_setting('cpu_operator_cost')::real * tuples
)
FROM (
SELECT relpages * 0.0700 AS pages, reltuples * 0.0700 AS tuples
FROM pg_class WHERE relname = 'events_pkey'
) c
) AS idx_cost,
( SELECT round(
(1 - frac_visible) * -- the fraction of the pages not in the visibility map
current_setting('seq_page_cost')::real * pages +
current_setting('cpu_tuple_cost')::real * tuples
)
FROM (
SELECT relpages * 0.0700 AS pages,
reltuples * 0.0700 AS tuples,
relallvisible::real/relpages::real AS frac_visible
FROM pg_class WHERE relname = 'events'
) c
) AS tbl_cost
)
SELECT idx_cost, tbl_cost, idx_cost + tbl_cost AS total
FROM costs;
idx_cost | tbl_cost | total
----------+----------+-------
1356 | 734 | 2090
(1 row)
堆表中仍然存在的任何尚未清理的更改,都会增加总的计划成本(并使计划对优化器来说不太理想)。您可以使用EXPLAIN ANALYZE
命令,检查实际所需的堆元组获取次数:
CREATE TEMP TABLE events_tmp WITH (autovacuum_enabled = off)
AS SELECT * FROM events ORDER BY event_id;
ALTER TABLE events_tmp ADD PRIMARY KEY(event_id);
ANALYZE events_tmp;
EXPLAIN (analyze, timing off, summary off)
SELECT event_id FROM events_tmp WHERE event_id < '1100000';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Only Scan using events_tmp_pkey on events_tmp (cost=0.43..2490.88 rows=73283 width=8) (actual rows=65536 loops=1)
Index Cond: (event_id < '1100000'::bpchar)
Heap Fetches: 65536
(3 rows)
由于该表禁用了自动 VACUUM 功能,因此规划器必须检查每一行的可见性(Heap Fetches)。但是,在进行 VACUUM 后,就没有必要检查了:
VACUUM events_tmp;
EXPLAIN (analyze, timing off, summary off)
SELECT event_id FROM events_tmp WHERE event_id < '1100000';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Only Scan using events_tmp_pkey on events_tmp (cost=0.43..2090.88 rows=73283 width=8) (actual rows=65536 loops=1)
Index Cond: (event_id < '1100000'::bpchar)
Heap Fetches: 0
(3 rows)