August 2, 2024
Summary: in this tutorial, you will learn how to check recovery conflicts in PostgreSQL.
Table of Contents
What is a recovery conflict?
A recovery conflict occurs whenever the recovery process cannot apply WAL information from the primary server to the standby, because the change would disrupt query processing there. These conflicts cannot happen with queries on the primary server, but they happen on the streaming replication standby server because the primary server has limited knowledge about what is going on on the standby.
There are several kinds of recovery conflicts:
Snapshot recovery conflicts
This is the most frequent recovery conflict.
Snapshot conflicts can occur if VACUUM
processes a table and removes dead tuples. This removal is replayed on the standby. Now a query on the standby may have started before VACUUM
on the primary (it has an older snapshot), so it can still see the tuples that should be removed. This constitutes a snapshot conflict.
Lock recovery conflicts
The queries on a standby server take an ACCESS SHARE
lock on the tables they are reading. So any ACCESS EXCLUSIVE
lock on the primary (which conflicts with ACCESS SHARE
) must be replayed on the standby to keep incompatible operations on the table from happening. PostgreSQL takes such a lock for operations that conflict with SELECT
, for example DROP TABLE
, TRUNCATE
and many ALTER TABLE
statements. If the standby should replay such a lock on a table that a query uses, we have a lock conflict.
Buffer pin recovery conflicts
One way to reduce the need for VACUUM
is to use HOT updates. Then any query on the primary that accesses a page with dead heap-only tuples and can get an exclusive lock on it will prune the HOT chains. PostgreSQL always holds such page locks for a short time, so there is no conflict with processing on the primary. There are other causes for page locks, but this is perhaps the most frequent one.
When the standby server should replay such an exclusive page lock and a query is using the page (“has the page pinned” in PostgreSQL jargon), you get a buffer pin recovery conflict. Pages can be pinned for a while, for example during a sequential scan of a table on the outer side of a nested loop join.
HOT chain pruning can of course also lead to snapshot recovery conflicts.
Rare kinds of recovery conflicts
The following types of conflict are rare and will not bother you:
- Deadlock recovery conflicts: A query on the standby blocks while using the shared buffer that is needed to replay WAL from the primary. PostgreSQL will cancel such a query immediately.
- Tablespace recovery conflicts: A tablespace is in
temp_tablespaces
on the standby server, and a query has temporary files there. When aDROP TABLESPACE
occurs from the primary, we get a conflict. PostgreSQL cancels all queries on the standby in that case. - Database recovery conflicts: Replication of
DROP DATABASE
causes a conflict if the standby has active sessions on the database. PostgreSQL terminates all connections to the database on the standby in that case.
How does the standby server resolve recovery conflicts?
The parameter max_standby_streaming_delay determines what happens when WAL replay encounters a recovery conflict (there is a similar parameter max_standby_archive_delay
that does the same thing for archive recovery). PostgreSQL suspends replay of the WAL information for at most max_standby_streaming_delay
milliseconds. If the conflicting query is still running after that time, PostgreSQL cancels it with an error message like:
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
The detail message shows that this was from a snapshot recovery conflict.
max_standby_streaming_delay
has a default value of 30 seconds, so queries on the standby get a “grace time” of half a minute to finish before they get canceled if they cause a recovery conflict. This is a middle ground between the extreme settings of 0 (PostgreSQL cancels queries immediately, no delay for replay) and the special value -1 (PostgreSQL never cancels queries, arbitrarily long replay delay).
Query the pg_stat_database view
The statistics view pg_stat_database_conflicts contains a detailed account of all recovery conflicts that happened since the last statistics reset. You have to look at that view on the standby server, not the primary, because that is where recovery conflicts occur.
Note that this view does not show all recovery conflicts that occurred, it only shows the ones that led to a canceled query on the standby. We can check recovery conflicts using the following query:
SELECT conflicts FROM pg_stat_database
WHERE datname = current_database();
Once you’ve found many recovery conflicts, you can locate the queries caused recovery conflicts to optimize them.