PostgreSQL 教程: 只用索引的扫描

六月 11, 2024

摘要:在本教程中,您将了解只用索引的扫描(Index Only Scan),及其成本估算。

目录

介绍

当一个索引包含了处理查询所需的所有数据时,我们将其称为查询的覆盖索引。使用覆盖索引,访问方法可以直接从索引中检索数据,而无需进行单个表扫描。这称为只用索引的扫描,可由任何带有 RETURNABLE 属性的访问方法使用。

让我们开始设置一个小的测试用例:

CREATE TABLE events (event_id char(7) PRIMARY KEY, occurred_at timestamptz);

INSERT INTO events (event_id, occurred_at)
  SELECT to_hex(i),
  TIMESTAMP '2024-01-01 00:00:00+08' + floor(random() * 100000)::int * '1 minutes'::interval
  FROM generate_series(x'1000000'::int, x'2000000'::int, 16) AS s(i);

VACUUM ANALYZE events;

该扫描操作在计划中表示为一个 “Index Only Scan” 节点:

EXPLAIN SELECT event_id FROM events WHERE event_id < '1100000';
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Index Only Scan using events_pkey on events  (cost=0.43..2090.81 rows=73279 width=8)
   Index Cond: (event_id < '1100000'::bpchar)
(2 rows)

该名称表明 “Index Only Scan” 节点不会查询表,但在某些情况下会查询。PostgreSQL 的索引不存储可见性信息,因此访问方法会返回与查询匹配的每个行版本的数据,而不管它们对当前事务的可见性如何。元组的可见性随后会由索引机制来验证。

但是,如果该方法无论如何都必须扫描表以验证可见性,那么它与普通的索引扫描有何不同?诀窍在一个称为可见性映射表的功能。每个堆表都有一个可见性映射表,用于跟踪哪些页面仅包含已知对所有活动事务可见的元组。每当一个索引方法返回指向可见性映射表中标记的页面的行 TID 时,您就可以确定其中的所有数据都对事务可见。

成本估算

只用索引的扫描,其成本由可见性映射表中标记的表页数决定。这是收集的统计信息:

SELECT relpages, relallvisible
FROM pg_class WHERE relname = 'events';
 relpages | relallvisible
----------+---------------
     5668 |          5668
(1 row)

索引扫描和只用索引的扫描,在成本估算上面的唯一区别是,后者将前者的 I/O 成本乘以可见性映射表中不存在的页面的比例。(处理成本保持不变。)

在此示例中,每个页面上的每个行版本,对每个事务都是可见的,因此在 I/O 上的开销基本上乘以零:

WITH costs(idx_cost, tbl_cost) AS (
  SELECT
    ( SELECT round(
        current_setting('random_page_cost')::real * pages +
        current_setting('cpu_index_tuple_cost')::real * tuples +
        current_setting('cpu_operator_cost')::real * tuples
      )
      FROM (
        SELECT relpages * 0.0700 AS pages, reltuples * 0.0700 AS tuples
        FROM pg_class WHERE relname = 'events_pkey'
      ) c
    ) AS idx_cost,
    ( SELECT round(
        (1 - frac_visible) * -- the fraction of the pages not in the visibility map
        current_setting('seq_page_cost')::real * pages +
        current_setting('cpu_tuple_cost')::real * tuples
      )
      FROM (
        SELECT relpages * 0.0700 AS pages,
          reltuples * 0.0700 AS tuples,
          relallvisible::real/relpages::real AS frac_visible
        FROM pg_class WHERE relname = 'events'
      ) c
    ) AS tbl_cost
)
SELECT idx_cost, tbl_cost, idx_cost + tbl_cost AS total
FROM costs;
 idx_cost | tbl_cost | total
----------+----------+-------
     1356 |      734 |  2090
(1 row)

堆表中仍然存在的任何尚未清理的更改,都会增加总的计划成本(并使计划对优化器来说不太理想)。您可以使用EXPLAIN ANALYZE命令,检查实际所需的堆元组获取次数:

CREATE TEMP TABLE events_tmp WITH (autovacuum_enabled = off)
  AS SELECT * FROM events ORDER BY event_id;
ALTER TABLE events_tmp ADD PRIMARY KEY(event_id);
ANALYZE events_tmp;

EXPLAIN (analyze, timing off, summary off)
  SELECT event_id FROM events_tmp WHERE event_id < '1100000';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using events_tmp_pkey on events_tmp  (cost=0.43..2490.88 rows=73283 width=8) (actual rows=65536 loops=1)
   Index Cond: (event_id < '1100000'::bpchar)
   Heap Fetches: 65536
(3 rows)

由于该表禁用了自动 VACUUM 功能,因此规划器必须检查每一行的可见性(Heap Fetches)。但是,在进行 VACUUM 后,就没有必要检查了:

VACUUM events_tmp;

EXPLAIN (analyze, timing off, summary off)
  SELECT event_id FROM events_tmp WHERE event_id < '1100000';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using events_tmp_pkey on events_tmp  (cost=0.43..2090.88 rows=73283 width=8) (actual rows=65536 loops=1)
   Index Cond: (event_id < '1100000'::bpchar)
   Heap Fetches: 0
(3 rows)

了解更多

PostgreSQL 优化