October 28, 2024
Summary: in this tutorial, you will learn how to troubleshooting WAL accumulation in the pg_wal directory in PostgreSQL.
Table of Contents
Background
One of the most frequent concerns for PostgreSQL users is: Why is my pg_wal growing? or pg_wal is filling up the filesystem, or pg_wal does not release the space, and so on.
In PostgreSQL, the pg_wal directory plays a crucial role in managing the Write-Ahead Logging (WAL) mechanism, which ensures the durability of transactions and crash recovery. WAL files are also required for replication purposes (if any).
Eventually, administrators may encounter situations where the pg_wal directory accumulates files, gradually consuming disk space and potentially leading to filesystem issues and database crashes.
This tutorial will cover some common reasons why the pg_wal directory has increased its size, how to troubleshoot, and strategies to mitigate the problem.
High transaction rate spikes
Problem: High transaction rates or spikes. WAL segments are generated due to PostgreSQL processing transactions before the data files are written. If the transaction rate exceeds the capacity of the system to archive or remove these segments, they accumulate in the pg_wal directory, leading to disk space exhaustion.
Although the archiving speed is generally unimportant, it is expected to keep pace with the average WAL generation rate in the pg_wal directory. If the archiving speed falls significantly behind the WAL segment creation rate for too long, pg_wal will start accumulating files until they are archived. If pg_wal does not have enough space to hold some uncommon/unplanned load, it can run out of space.
Troubleshooting: Identify the WAL segment creation speed in pg_wal and compare it with the archiving speed.
Solution: If the archiving speed is not fast enough, we should improve the archiving IO subsystem (or network latency when working with S3 cloud services). Otherwise, assessing how long the workload spikes last and adding additional space for pg_wal.
Inefficient checkpoint
Problem: Checkpoints in PostgreSQL are crucial for flushing modified data from memory to disk and recycling obsolete WAL segments. However, inefficient checkpointing strategies, such as too-infrequent or too-aggressive checkpoints, can impact the accumulation of WAL files. Infrequent checkpoints result in prolonged retention of WAL segments, while overly aggressive checkpoints may lead to excessive disk I/O and WAL generation.
Troubleshooting: Assessing the checkpoint and WAL parametrization regarding the database workload (min_wal_size, max_wal_size, wal_keep_size/wal_keep_segments, bgwriter_lru_maxpages, bgwriter_delay, etc).
Solution: Finding a proper trade-off for checkpoint frequency and bgwriter efficiency.
Replication lag
Scope: This only applies when streaming replication is set.
Problem: Delays in applying changes on standby servers can exacerbate the accumulation of WAL files on the primary server. A standby server might fall behind its primary due to network issues/slowness, high load, or hardware resource constraints, so the primary server retains WAL segments until they are successfully replayed on the standby. This delay can strain disk space availability on the primary server.
The above reasons and a misconfiguration in wal_keep_size/wal_keep_segments parameters might contribute to the space exhaustion.
Abandoned replication slots will hold WAL segments indefinitely.
Troubleshooting: Verify the replication lag between primary and standbys. Verify the configuration of wal_keep_segments/wal_keep_size (depending on your database version). Looking for abandoned replication slots in the primary server.
Solution: Improving the network performance or IO performance on the standby server (or any hardware bottleneck). Dropping any abandoned replication slots. Adjusting the wal_keep_segments/wal_keep_size configuration according to the replication performance and pg_wal directory capacity (in primary).
Failing WAL archive
Scope: This only applies when the database runs in continuous archiving (archive_mode is set to on, and archive_command is also set).
Problem: If the archiver process fails to perform the command in archive_command, WAL segments will remain on the pg_wal until the archiver succeeds.
The most common reasons can be related to full disks/filesystem (where archive_command points to), missing paths, wrong privileges, timeouts, unreachable destinations, and wrong archive_command.
Troubleshooting: Whenever archive_command fails, you will get an error message in the PostgreSQL log.
Solution: It depends on identifying the reason for archiving failure and fixing it.
WAL Retention policies
Problem: Misconfigured or inadequate retention policies for WAL archives can also contribute to accumulating files in the pg_wal directory. If archival processes fail to remove obsolete WAL segments promptly, the directory may become bloated with unnecessary files, consuming disk space that could be used for other purposes.
Troubleshooting: Reviewing the aforementioned min_wal_size, max_wal_size, wal_keep_size/wal_keep_segments parameters. Reviewing PostgreSQL log for failing archive events.
Solution: Improving parametrization and fixing failing archive reasons.
Conclusion
The pg_wal directory in PostgreSQL is critical for ensuring data durability and recovery, but its unchecked growth can lead to disk space constraints and operational challenges. By understanding the common reasons behind the accumulation of files in the pg_wal directory and implementing appropriate strategies such as tuning checkpoint settings, monitoring replication lag, and enforcing efficient retention policies, administrators can effectively manage disk space usage and maintain the stability and performance of their PostgreSQL databases.
Since WAL segments play a lead role in PostgreSQL databases, you should never manually remove WAL segments from pg_wal. It might lead to database crashes, failures on crash recoveries, failures on WAL archive events, and incomplete backup copies.