By John Doe December 31, 2024
Summary: In this article, we will learn how to use CTID based pagination for data cleanups in PostgreSQL.
Table of Contents
Introduction
When dealing with very large PostgreSQL tables (we’re talking 15TB+), sometimes routine maintenance like archiving very old data can become surprisingly challenging. Despite having good indexes. I recently faced this issue when trying to clean up very old data on a very large and legacy table.
The Problem
Initial approach used standard ID-based pagination. Imagine a query like this:
DELETE FROM large_table
WHERE id IN (
SELECT id
FROM large_table
WHERE created_at < '2023-04-01'
ORDER BY id
LIMIT 10000
);
While totally harmless on the surface, this kept timing out (as 10 min statement_timeout
). Why? Because the databases in this case needs to:
- Scan the index to find old records
- Sort them by ID
- Perform the deletion
- Maintain indexes during deletion
While the deletion and maintaining of indexes isn’t so much of an issue; On a 15TB large table, the lookup operations become extremely expensive, especially when indexes are involved.
Enter CTID
PostgreSQL maintains a physical location identifier for each row called CTID. It’s a tuple of (page_number,row_number)
where:
page_number
is the physical page in the table (each page is typically 8KB)row_number
is the row’s position within that page
Here’s how you can see it:
SELECT ctid, id, created_at
FROM large_table
LIMIT 5;
-- Results might look like:
-- "(0,1)","1","2023-01-01" -- Page 0, row 1
-- "(0,2)","2","2023-01-01" -- Page 0, row 2
-- "(0,3)","3","2023-01-01" -- Page 0, row 3
-- "(1,1)","4","2023-01-01" -- Page 1, row 1
-- "(1,2)","5","2023-01-01" -- Page 1, row 2
The interesting thing about CTID is that it represents the physical storage order. So, I started to wonder - what if we can use this to process the table sequentially, page by page:
-- Process first 200,000 physical pages
WITH to_delete AS (
SELECT ctid
FROM large_table
WHERE ctid BETWEEN '(0,0)'::tid AND '(200000,0)'::tid
AND created_at < '2023-04-01'
FOR UPDATE SKIP LOCKED
)
DELETE FROM large_table
WHERE ctid IN (SELECT ctid FROM to_delete);
Note that when we specify a CTID range like BETWEEN '(0,0)'::tid AND '(200000,0)'::tid
, we’re actually selecting all rows from all positions within those pages. Each page can (and usually does) contain multiple rows, making this an efficient way to read chunks of the table sequentially.
And the benefit? These lookups are super quick and doesn’t put a lot of strain on the system like the previous query. This clean up task can slowly and easily keep doing the cleanup.
This is not an ideal solution IMO (see trade-off below), but as a short term mitigation, it gets the job done.
The complete solution
Here’s how we can structure the full cleanup (Ruby / Rails based example).
Note that we use REPEATABLE READ
isolation to handle CTID stability issues - CTIDs can change during updates or VACUUM FULL
operations, so we need consistent snapshots to avoid missing rows:
page_size = 200_000 # Number of pages to process at once
current_page = 0
cutoff_date = '2023-04-01'
deleted_count = 0
ApplicationRecord.transaction(isolation: :repeatable_read) do # Ensure consistent CTID snapshots
ApplicationRecord.with_statement_timeout(TIMEOUT) do
loop do
delete_sql = <<~SQL
WITH to_delete AS (
SELECT ctid
FROM large_table
WHERE ctid BETWEEN '(#{current_page * page_size},0)'::tid
AND '(#{(current_page + 1) * page_size},0)'::tid
AND created_at < '#{cutoff_date}'
FOR UPDATE OF large_table SKIP LOCKED
)
DELETE FROM large_table
WHERE ctid IN (SELECT ctid FROM to_delete)
RETURNING id;
SQL
result = ActiveRecord::Base.connection.exec_query(delete_sql)
deleted_count += result.rows.size
current_page += 1
# Check if there are any rows in next page range
check_sql = <<~SQL
SELECT EXISTS (
SELECT 1
FROM large_table
WHERE ctid BETWEEN '(#{current_page * page_size},0)'::tid
AND '(#{(current_page + 1) * page_size},0)'::tid
LIMIT 1
);
SQL
has_more_rows = ActiveRecord::Base.connection.exec_query(check_sql).rows[0][0]
break unless has_more_rows
end
end
end
The trade-off
This approach is overall slower than index-based deletion would be (if it worked). We’re doing a sequential scan of the entire table, which means we’re reading every page, even those without old records. Which is what database’s query planner and index lookup would have saved us from.
However, it’s reliable, super quick and doesn’t put strain on the system. Instead of timing out on large deletions, we now:
- Process the table in predictable chunks based on physical storage
- Avoid expensive index operations
- Maintain a consistent load on the database
- Eventually complete the cleanup
For this specific use case of cleaning up old data, I am totally happy to trade speed for predictability and reliability. A slow-but-successful cleanup is better than a fast-but-always-failing one. Lastly, because this is a periodic job, its also ok if the pages distribution isn’t fully up to date. We have the option to run an ANALYZE
right before the job as well.
What about sequential scans
You might wonder if simply forcing PostgreSQL to use sequential scans instead of index scans would solve the timeout issues:
SET enable_indexscan = off;
SET enable_seqscan = on;
DELETE FROM large_table
WHERE created_at < '2023-04-01'
LIMIT 10000;
Sadly not. The database would still have to sequentially scan each rows (like we would in our program, and less efficiently) and depending on the rows and row sizes, this operation would still get timed out.