PostgreSQL Tutorial: Leverage Index-Only Scans

August 29, 2024

Summary: In this tutorial, you will learn how to leverage index only scans in PostgreSQL.

Table of Contents

Introduction

In the article, Index Scan Types: Bitmap, Index, and Index Only, we reviewed three ways PostgreSQL can use an index to retrieve data as quickly as possible. In the same article, we saw the fastest and most efficient method is the Index-Only Scan because this saves time by avoiding the second read to the heap (table pages) and getting all the data requested by the query from a single read to the index only.

By definition, the queries leveraging on the index-only scan are faster than other methods, but the benefits are more visible when discussing transaction volume. The more query workload your systems generate, the more benefit you will get when going through the index-only scan.

Workload tests

To see this more clearly, consider the same table from the article, this time with 500K rows. We will perform the following query:

SELECT age, register_date, is_active
FROM person
WHERE age = <int>;

We will run it first with a “regular” index on the age column, which will have to perform the index and table reads to retrieve the register_date and is_active columns data. Then, we will repeat the exercise with a covering index, which is declared over the age column but includes the other columns.

The first index looks as follows:

CREATE INDEX idx_person_age ON person(age);

The second one (covering index) has the following definition:

CREATE INDEX idx_person_age_cover ON person(age) INCLUDE (register_date,is_active);

To verify the consistency of the results, we will run the load with different numbers of concurrent users: 20, 100, and 200. The results are plotted in the following graphs:

leverage index-only scans img postgresql index-only scans

We got a consistent result through all three scenarios. In all of them, the workload levering on the index-only scan performed about 2.25x more transactions per second (tps) than the workload going via the index scan.

Index-only scan requirements

Using covering or composite indexes for most of our workload can look like a good idea. However, as we saw in the article, a few conditions have to be met to get these benefits:

  • The index type has to support the index-only scan method. The B-tree indexes always do; the GiST and SP-GiST support the index-only scan operations for some operators, but not all, and any other index type doesn’t support it.
  • The query must select only columns covered by the index. In our previous tests, the query asked for the columns age, register_date, and is_active, which are included in the covering index. If any other column is in the SELECT, the planner won’t choose the index-only scan.
  • PostgreSQL scans must ensure retrieved rows are “visible” to the query’s snapshot.

The visibility condition

The first two conditions are easy to understand; we should be ready if we use an index type that supports the index-only scan feature and our queries select only those columns covered by the index. However, the third condition demands some extra caution to ensure we will get the benefits. Let’s expand on the third condition.

The goal of the index-only scan is to avoid the second read access to the table pages (the heap) to save that access time and retrieve the data directly from the index. Still, PostgreSQL must ensure that the index data is up-to-date with the data in the heap.

The visibility information is not stored in the index; it belongs to the table pages, so the table reads are unavoidable at first glance. However, there is a workaround: the visibility map. If all the rows in a given page are old enough to be visible to the current and future transactions, a bit for that page is marked in the visibility map.

Verifying the visibility of the row pages from the visibility map is much cheaper than reading the heap because it is small and mostly cached all the time.

In other words, the more the data changes in a table, the lower the ratio of visibility in the map it will have. So, to ensure we will benefit from the index-only scan feature, we need to verify this condition.

How to check the table’s page visibility

As described in the PostgreSQL documentation, the visibility map stores only two bits for every table page, one to indicate the visibility and the other to mark if the rows have been frozen. This file can not be examined as a traditional file; luckily, we can count on the pg_visibility extension for this purpose.

This extension is delivered with the contrib modules of any base PostgreSQL installation, so adding it to a database is as easy as performing the following statement:

CREATE EXTENSION pg_visibility;

Working with the same table as the previous examples, we can check the table person pages as follows:

SELECT * FROM pg_visibility_map('person');
 blkno | all_visible | all_frozen
-------+-------------+------------
     0 | t           | t
     1 | t           | t
     2 | t           | t
[truncated]
 10202 | t           | t
 10203 | t           | t
 10204 | t           | t
(10205 rows)

So, we can see the values set for the two bits of every table page, which can be false or true.

As soon as a table starts performing updates in its data, the visibility bit changes for the corresponding page. For example, with the following, we can know the page (blkno) where the row with id 10000 is:

SELECT ctid, id FROM person WHERE id = 10000;
  ctid   |  id
---------+-------
 (204,4) | 10000
(1 row)

So, we can check the visibility map for page 204, then update that row and check again, i.e.:

SELECT * FROM pg_visibility_map('person')
  WHERE blkno = 204;
 blkno | all_visible | all_frozen
-------+-------------+------------
   204 | t           | t
(1 row)

UPDATE person SET register_date = now() WHERE id = 10000;

SELECT * FROM pg_visibility_map('person')
  WHERE blkno = 204;
 blkno | all_visible | all_frozen
-------+-------------+------------
   204 | f           | f
(1 row)

After updating a value in the row with id 10000, the visibility map shows the all_visible column as false for the corresponding page.

As the update operations keep going against the table, the visibility map will change, and the number of pages marked as all_visible will reduce.

Checking the visibility percentage

Knowing why the visibility map changes and how to verify its contents, we can check the percentage of visible pages and thus keep the index-only scan feature working.

The following query will show the percentage of pages marked as all visible or not:

SELECT
    x.pages::text,
    round(avg(x.percent), 1)::int AS percentage
FROM (
    SELECT
        CASE WHEN all_visible THEN 'visible'
        WHEN NOT all_visible THEN 'no_visible'
        END pages,
        100 * COUNT(all_visible) OVER (PARTITION BY all_visible) / COUNT(all_visible) OVER () AS percent
    FROM
        pg_visibility_map ('<table>')) x
GROUP BY
    x.pages
UNION ALL
SELECT
    'no_visible'::text,
    0
WHERE NOT EXISTS (SELECT 1 FROM pg_visibility_map ('<table>') WHERE NOT all_visible);

In a freshly loaded table with no updates or just after vacuuming it (the vacuum is the process in charge of updating the visibility map), we can get an output similar to the following:

   pages    | percentage
------------+------------
 visible    |        100
 no_visible |          0
(2 rows)

When the updates happen, the percentage of pages marked as visible will change as well:

UPDATE person SET register_date = now() WHERE age < 15;

SELECT
    x.pages::text,
    round(avg(x.percent), 1)::int AS percentage
FROM (
    SELECT
        CASE WHEN all_visible THEN 'visible'
        WHEN NOT all_visible THEN 'no_visible'
        END pages,
        100 * COUNT(all_visible) OVER (PARTITION BY all_visible) / COUNT(all_visible) OVER () AS percent
    FROM
        pg_visibility_map ('person')) x
GROUP BY
    x.pages
UNION ALL
SELECT
    'no_visible'::text,
    0
WHERE NOT EXISTS (SELECT 1 FROM pg_visibility_map ('person') WHERE NOT all_visible);
   pages    | percentage
------------+------------
 visible    |         46
 no_visible |         53
(2 rows)

PostgreSQL will keep trying to perform the index-only scan, but when the pages are marked as “no visible” in the visibility map, it will have to switch to a different technique and read from the heap, losing all the performance gains. And, the more pages are marked as “no visible”, the fewer index-only scan attempts will succeed.

In this case, after vacuuming the table, we will see all the pages marked as visible again:

VACUUM person;

SELECT
    x.pages::text,
    round(avg(x.percent), 1)::int AS percentage
FROM (
    SELECT
        CASE WHEN all_visible THEN 'visible'
        WHEN NOT all_visible THEN 'no_visible'
        END pages,
        100 * COUNT(all_visible) OVER (PARTITION BY all_visible) / COUNT(all_visible) OVER () AS percent
    FROM
        pg_visibility_map ('person')) x
GROUP BY
    x.pages
UNION ALL
SELECT
    'no_visible'::text,
    0
WHERE NOT EXISTS (SELECT 1 FROM pg_visibility_map ('person') WHERE NOT all_visible);
   pages    | percentage
------------+------------
 visible    |        100
 no_visible |          0
(2 rows)

Conclusion

An index-only scan is undoubtedly a fantastic approach to getting the best possible performance for our queries; however, before adding composite or covering indexes, we also need to pay attention to how the tables behave regarding their updates and page visibility.

With the pg_visibility extension and some queries like the one we saw in this article, we can monitor our tables and decide if they are good to work with the index-only scan or if some tuning for the vacuum cadence or workload type can be implemented.

See more

PostgreSQL Optimization