PostgreSQL Tutorial: Check replication slots blocking VACUUM

May 28, 2024

Summary: In this tutorial, you will learn how to check replication slots blocking VACUUM in PostgreSQL.

Table of Contents

VACUUM blocked by xmin horizon

A MVCC snapshot in a transaction stores information about transaction ID visibility at a particular moment in time. A snapshot has a component called xmin, it represents the lowest transaction ID that was still active. All transaction IDs less than xmin are either committed and visible, or rolled back and dead. VACUUM is potentially blocked by the xmin horizon.

The xmin horizon tells you up to which point the vacuum process can clean up dead rows. When this value is behind and not advancing, VACUUMs will be blocked and will not be able to clean up dead rows.

When VACUUM is blocked and dead rows can’t be cleaned, it can result in table bloat and slow queries.

The xmin horizon is the oldest xmin across all databases, summarized as a server-wide metric. Note that while some types of xmin horizon delays affect all databases on a server, such as the xmin for each physical replication slots, in other cases not all databases of a server will be impacted, as PostgreSQL can still clean up dead rows of tables on unrelated databases. For example, if the xmin horizon is held back by a long running transaction on database1, PostgreSQL can still clean up dead rows of tables in database2, as the transaction of database1 will never be able to see database2.

Lagging or stale physical replication slots

With physical streaming replication with hot_standby_feedback is on, when replication is lagging or a replica server is stale (e.g. down), the oldest transaction that the replication slot needs the database to retain can be “stuck”, holding back the xmin horizon.

A replication slot is a data structure that keeps the PostgreSQL server from discarding information that is still needed by a standby server to catch up with the primary. If replication is delayed or the standby server is down, the replication slot will prevent VACUUM from deleting old rows.

Solution

You can find all replication slots and their xmin value with this query:

SELECT slot_name, database, xmin
FROM pg_replication_slots
WHERE slot_type = 'physical'
ORDER BY age(xmin) DESC;

If the replication slot is no longer used, you can remove it by running the following command:

SELECT pg_drop_replication_slot('<slot_name>');

Lagging or stale logical replication slots

With logical replication slots, replication can also get stale when DDL changes (database migrations) don’t get applied to a replica server (subscriber). When the subscriber was unable to replicate data due to a schema mismatch, replication will error and get stale. This causes the xmin horizon of the system catalogs of the primary (publisher) to be held back until replication resumes.

Solution

You can find all replication slots and their catalog_xmin value with this query:

SELECT slot_name, database, catalog_xmin
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY age(catalog_xmin) DESC;

You may also want to check logs on both the publisher and the subscriber for any error messages regarding logical replication, such as schema differences or too low wal_sender_timeout/wal_receiver_timeout settings, causing the workers to quit after a while without making progress.

If the replication slot is no longer used, remove it it by running the following command:

SELECT pg_drop_replication_slot('<slot_name>');

See more

PostgreSQL Monitoring