May 28, 2024
Summary: In this tutorial, you will learn how to check replication slots blocking VACUUM in PostgreSQL.
Table of Contents
VACUUM blocked by xmin horizon
A MVCC snapshot in a transaction stores information about transaction ID visibility at a particular moment in time. A snapshot has a component called xmin
, it represents the lowest transaction ID that was still active. All transaction IDs less than xmin
are either committed and visible, or rolled back and dead. VACUUM is potentially blocked by the xmin
horizon.
The xmin
horizon tells you up to which point the vacuum process can clean up dead rows. When this value is behind and not advancing, VACUUMs will be blocked and will not be able to clean up dead rows.
When VACUUM is blocked and dead rows can’t be cleaned, it can result in table bloat and slow queries.
The xmin
horizon is the oldest xmin
across all databases, summarized as a server-wide metric. Note that while some types of xmin horizon delays affect all databases on a server, such as the xmin
for each physical replication slots, in other cases not all databases of a server will be impacted, as PostgreSQL can still clean up dead rows of tables on unrelated databases. For example, if the xmin
horizon is held back by a long running transaction on database1, PostgreSQL can still clean up dead rows of tables in database2, as the transaction of database1 will never be able to see database2.
Lagging or stale physical replication slots
With physical streaming replication with hot_standby_feedback
is on, when replication is lagging or a replica server is stale (e.g. down), the oldest transaction that the replication slot needs the database to retain can be “stuck”, holding back the xmin
horizon.
A replication slot is a data structure that keeps the PostgreSQL server from discarding information that is still needed by a standby server to catch up with the primary. If replication is delayed or the standby server is down, the replication slot will prevent VACUUM
from deleting old rows.
Solution
You can find all replication slots and their xmin
value with this query:
SELECT slot_name, database, xmin
FROM pg_replication_slots
WHERE slot_type = 'physical'
ORDER BY age(xmin) DESC;
If the replication slot is no longer used, you can remove it by running the following command:
SELECT pg_drop_replication_slot('<slot_name>');
Lagging or stale logical replication slots
With logical replication slots, replication can also get stale when DDL changes (database migrations) don’t get applied to a replica server (subscriber). When the subscriber was unable to replicate data due to a schema mismatch, replication will error and get stale. This causes the xmin
horizon of the system catalogs of the primary (publisher) to be held back until replication resumes.
Solution
You can find all replication slots and their catalog_xmin
value with this query:
SELECT slot_name, database, catalog_xmin
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY age(catalog_xmin) DESC;
You may also want to check logs on both the publisher and the subscriber for any error messages regarding logical replication, such as schema differences or too low wal_sender_timeout
/wal_receiver_timeout
settings, causing the workers to quit after a while without making progress.
If the replication slot is no longer used, remove it it by running the following command:
SELECT pg_drop_replication_slot('<slot_name>');