January 26, 2024
Summary: in this tutorial, you will learn how to monitor WAL files in PostgreSQL.
Table of Contents
What Are WAL Files Anyway?
Changes that happen to the data managed by the PostgreSQL server are serialized into replayable deltas and written out as an ordered sequence of files called the “Write Ahead Log”, or WAL. This architecture facilitates popular features of PostgreSQL like Point-In-Time-Recovery (PITR) and streaming replication (and therefore read replicas, hot standby, failover and high availability).
PostgreSQL comes with pg_basebackup to help you with creating a full consistent backup of your PostgreSQL instance. If you want to be able to do point in time recovery (PITR) you need to archive the WAL segments which can later be used to roll forward from a base backup.
WAL files are stored in pg_wal
directory. Typically these are 16 MB files with a 24-character filename made of hex numbers (0-9, A-F).
New WAL files keep getting created in the course of operation of the server, and the old ones are effectively deleted (they’re actually renamed and reused, as this incurs slightly less disk I/O than deleting and creating).
Why Should I Monitor Them?
Unfortunately, it can be difficult to predict the number of WAL files needed (and therefore the disk space needed for them) for the normal operation of the server. Configuration settings related to checkpoint (timeout, completion targets), WAL files (min and max wal file sizes, compression) and archiving (timeout) will also influence the number of WAL files lying in pg_wal. On top of all this, features like WAL archiving and replication slots can cause the retention of WAL files.
The biggest risk of leaving WAL files unmonitored is that PostgreSQL can, and will happily, retain them indefinitely if need be. They will grow unchecked and can end up consuming all available disk space. And if the PostgreSQL server dies due to lack of disk space, remember that it needs to do a checkpoint on restart which may require even more disk space.
It also does not help that the creation of WAL files cannot be prevented in any situation, as it is needed for the ACID guarantees provided by PostgreSQL transactions.
In short, it is well worth your while to setup something that will monitor the count of WAL files for each PostgreSQL server. Ideally, the number of files at any given time should lie between an upper and lower limit, with predictable variations arising from maintenance tasks and batch workloads.
What Should I Look For?
Spikes in WAL file count are typically caused by VACUUM-like maintenance tasks that create large volume of changes, or temporary tables and objects in a short period of time. These should come slowly back down to normal levels. These typically result in a lot of disk I/O and CPU activity, causing application queries to run slower until things are back to normal.
Increases in the count that refuse to come back down have to be dealt with quickly. These can be because of:
- Archival failures: If the archive script fails for a certain WAL file, PostgreSQL will retain it and keep retrying until it succeeds. In the meantime, new WAL files will keep getting created. Ensure that your WAL archival processes are not broken, and can keep up with the WAL creation rate.
- Replication failures: When using streaming replication with replication slots, and a standby goes down, PostgreSQL will retain the WAL files needed by the standby so that the standby can resume from the point where it left off. If the standby goes offline for extended periods of time, or if someone forgot to delete the replication slot on the primary, the WAL files can be retained indefinitely. Ensure all your standbys and replication slots are active, and that your standbys can keep up the the changes happening at the primary.
Check number of WAL files
Simply monitor the count of files in the pg_wal directory, and send the values to your existing monitoring systems should help you keep track of the WAL file count. You can execute the following query, to get the number of WAL files:
SELECT COUNT(*) FROM pg_ls_waldir();
that does the job of getting the count of WAL files. pg_ls_waldir
returns the name, size, and last modified time (mtime) of each file in the write ahead log (WAL) directory. By default only superusers and members of the pg_monitor
role can use this function. Access may be granted to others using GRANT
.
If you have a system that can monitor the values returned by SQL queries, then this can fit into it.
Check number of unarchived files
Whenever a WAL file is ready for archiving, you will see the corresponding “.ready” file. After a WAL file successfully archived, you will get the corresponding “.done” file. You can execute the following query, to get the number of unarchived files:
SELECT count(*) AS count FROM pg_ls_dir('pg_wal/archive_status')
WHERE pg_ls_dir ~ E'^[0-9A-F]{24}\.ready$';
that does the job of getting the count of unarchived WAL files. Note that you’ll need superuser privileges or explicit GRANTs to execute the pg_ls_dir
function.
When you see many “.ready” files, this can either mean your archive command is currently failing, or load on the system is so high that archiving can not keep up.