February 28, 2025
Summary: in this tutorial, you will learn how to troubleshooting the disk full issue on PostgreSQL server.
Table of Contents
Background
If Postgres has run out of disk space, there are a few things to look for. Depending on the server configuration there may be several devices at play, so your data directory, tablespaces, logs, or WAL directory could each be affected.
There are many reasons your disk might be full, here are some of the common ones to look for:
- The
archive_command
is failing and WAL is filling up the disk space. - There are replication slots for a disconnected standby which results in WAL filling up the disk.
- Large database changes generate so much WAL that it consumes all of the available disk space.
- You literally just ran out of disk space storing data and your normal monitors and alerts didn’t let you know.
The situation that is most detrimental to the database system as a whole is a full WAL directory. This results in the application being unable to make any more changes to the database system because it can’t record the WAL changes. Then Postgres has no choice but to issue a PANIC and shut down.
As you can see, many of these cases relate to WAL filling the disk. Having the database shut down will land you in a precarious situation, so that’s where we’ll dig in with this tutorial.
What not to do
Never remove WAL
A common knee jerk reaction to seeing WAL filling up your disk space is to delete these log files. This is a very common approach by system administrators - big log files fill up the disk, get rid of them, right? But, the WAL are not just normal system logs. They’re integral to getting your Postgres up and running, removing WAL will corrupt your database.
Remember, Postgres itself will remove the extra WAL files as soon as it is operating correctly and it has verified that it will not need those files any more (i.e., it confirms it has archived the files successfully).
If you remove WAL files, you are guaranteed that your database will be in an inconsistent state and it will be corrupted. Never remove wal files!
Don’t immediately overwrite the existing data directory with a restore from a backup
In the best case when you restore from a backup, you have determined that you are okay with data loss, as you are effectively choosing to forgo any database changes since the last backup. Restoring from a backup is a disaster recovery approach that is generally meant to be helpful in a situation where the entire system is inoperable or the data files have been corrupted; i.e., true disasters.
In particular, if you have an archive_command that was failing, this means you are giving up all transactions that occurred in the database since the last successful backup. And this is key: if the archive_command
is failing, and that is what you are using for your base backups as well, there is no guarantee of the age of the last backup. (You are monitoring your backups and archive_commands
, right?)
Backup restoration can be done if needed, but this should not be where you first go when you discover an issue.
Don’t just resize in place
As production is down, you’ll obviously be keen to get this fixed ASAP. While your knee-jerk reaction to running out of disk will be to add more storage, it can often be better to provision a new larger instance and work on restoration there, rather than trying to resize in place. First and most importantly, we want to keep the broken instance in place so we can refer to it later; at this point it may be unclear as to what the true cause of the issue is. Preserving any production instance that has stopped working is the best way to perform post-mortem analysis and have confidence in the integrity of any newly working system.
Note: This is a softer “don’t” than the others; if this is your only option for getting the database up and running, this approach should work fine.
What you should do
Take a filesystem-level backup right now
Ensure that Postgres is stopped and take a backup of the PostgreSQL data directory (including the pg_wal
directory and any non-default tablespaces) before you do anything so you can get back to this state if you need to. There are a dozen different things that can go wrong in fixing Postgres and your WAL archive, so being able to preserve as much of the original evidence/state will both protect you in the rebuild process, as well as provide valuable forensic data for determining root causes.
Any method of handling this is possible; you can use an offline backup, filesystem-level snapshots, rsync to a remote server, tarball, etc.
Create a new instance (or at least a new volume) with sufficient space
We recommend doing restores to a new instance whenever possible. If you are able to use the backup you just created, you can test starting this up after making sure that all of the configuration, paths, etc are correct on the new instance. If you are installing on a new instance, also make sure that you are using the same Postgres version (including other packages and extensions), and that the locale has the same setting.
Now that you’ve eliminated the space issues, you should be able to resume database operations.
Fix the underlying issues
Now that the database is back up and running, review the logs for why this failed, and fix the underlying issues. Now add/adjust your monitoring so you will be able to detect and prevent this issue in the future. For instance, if this was caused by a failing archive_command
, you can utilize some log analysis tools to notify you if such a thing starts happening again.
To summarize here, if there’s one thing you take away, do not delete your WAL. If your disk is filling up, Postgres has tools to help you recover quickly and efficiently.