May 12, 2024
Summary: In this tutorial, you will learn how to check transaction id exhaustion in PostgreSQL.
Table of Contents
Transaction ID Wraparound
Most people initially aware of the issue consider the transaction id wraparound itself to be the problem that they’re monitoring for, but it’s technically the exhaustion of the transaction ids that’s the real issue. PostgreSQL is technically capable of handling things just fine if the transaction id value wraps around. However, if the wraparound point is reached, and the transaction ids are close to being used up, that is why wraparound itself is such a cause for concern.
Check Transaction ID Exhaustion
The following query would provide very simple data points to trend/alert on.
WITH max_age AS (
SELECT 2000000000 as max_old_xid
, setting AS autovacuum_freeze_max_age
FROM pg_catalog.pg_settings
WHERE name = 'autovacuum_freeze_max_age' )
, per_database_stats AS (
SELECT datname
, m.max_old_xid::int
, m.autovacuum_freeze_max_age::int
, age(d.datfrozenxid) AS oldest_current_xid
FROM pg_catalog.pg_database d
JOIN max_age m ON (true)
WHERE d.datallowconn )
SELECT max(oldest_current_xid) AS oldest_current_xid
, max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS percent_towards_wraparound
, max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float))) AS percent_towards_emergency_autovac
FROM per_database_stats;
The percent_towards_wraparound metric is the one that is really critical that an alert be set up for. Since it is using the age() function to determine the transaction id value, it is taking into account if they are actually at the point of exhaustion to see if the wraparound is a real concern. If exhaustion is ever reached, the database will be forced to shut down and could cause an indeterminate amount of downtime in order to be fixed. This query has a little bit of a buffer in it since the upper boundary it checks (2 billion exactly) is less than the actual max integer value that causes the exhaustion. But it’s close enough that this alert hitting 100% should be acted upon immediately.
The percent_towards_emergency_autovac metric is the additional value we recommend monitoring for, especially for systems that have never had this monitored for before (see notes on Recent Freezing Benefits below about when this alert priority can be lowered or removed). This watches for the database’s highest transaction id value reaching autovacuum_freeze_max_age. This is a user-tunable value that has a default value of 200 million and when any table’s highest transaction id value reaches it, a higher priority autovacuum kicks in on that table. You’ll recognize this special vacuum session because in pg_stat_activity it will be labelled (to prevent wraparound). It is higher priority in the sense that it will run even if autovacuum is disabled and if that vacuum is manually cancelled, it will almost immediately restart again. It also takes some different internal, low-level locks, so it could cause slightly higher contention on those tables depending on how they’re being used during the emergency vacuum. If you do run into contention/locking issues and they can be narrowed down to the emergency vacuum being the cause, it’s perfectly safe to cancel it to allow your other transactions to finish. Just be aware that it will keep restarting until either that wraparound vacuum is able to be completed successfully or a manual vacuum is run.
For databases with a high transaction rate, it could be beneficial to increase autovacuum_freeze_max_age to avoid that emergency vacuum period coming around quite so often. The main concern with increasing this is that it can increase the storage requirements in the pg_xact and pg_commit_ts folders in the data directory. Again, please read the Routine Vacuuming documentation linked above for what these storage requirements are when you adjust this setting. I’ve often set this value to 1 billion without much issue, but only when I’m sure wraparound is being monitored for and the disk space is available.
Fix Transaction ID Exhaustion
The easiest (but not necessarily the quickest) way to get the highest transaction id age value back down is to force a vacuum on the entire database cluster. And the best way to do this cluster-wide vacuum is the vacuumdb binary utility that comes with PostgreSQL.
vacuumdb --all --freeze --jobs=2 --echo --analyze
The --all option ensures all databases are vacuumed since the transaction id is a global value. The --freeze option ensures a more aggressive vacuum is run to ensure as many tuples as possible are frozen in that table (see Routine Vacuuming for details on what freezing is). --jobs=2 allows multiple vacuums to be run in parallel. This should be set as high as your system can handle to speed things up, but be careful setting it too high since it causes additional IO and faster WAL generation (increased disk usage). –echo just provides some minimal feedback so you can see some progression. --analyze ensures that statistics are updated. If time is of concern to get this vacuum run finished, this can be left off and run as a separate step later using the --analyze-only option.
Recent Freezing Benefits
Another benefit of the --freeze option I’ll mention here can be a huge reduction in IO & WAL generation during future vacuum operations. PostgreSQL 9.6 introduced a feature that allows vacuum to be able to skip over a page if all tuples inside of it are marked as frozen. And PostgreSQL 11 improved on this even more for indexes. So if you have a lot of old tables that don’t get writes anymore, this makes it so that when they do need to be vacuumed for any reason, it’s a much, much less expensive operation. It also makes the percent_towards_emergency_autovac alert less of a concern since it won’t be quite as much of an unexpected activity spike. So once you have things tuned well, you could consider this alert a low priority warning or maybe even remove it and just worry about monitoring for wraparound itself.