January 6, 2025
Summary: in this tutorial, you will learn what’s ‘Rows Removed By Filter’ in EXPLAIN output all about.
Table of Contents
Example
We’re using an users
table with 20,000+ rows, which has a name_code
column, its value was populated with random strings.
-- Sample rows, note name_code "BA492"
SELECT
id,
first_name,
last_name,
name_code
FROM users
ORDER BY id DESC
LIMIT 3;
id | first_name | last_name | name_code
-------+------------+-----------+-----------
20210 | Brooks | Aufderhar | BA492
20209 | Nereida | Goodwin | NG8292
20208 | Dillon | Rodriguez | DR9151
-- total rows, and all rows with name_code "BA492"
SELECT
COUNT(*) AS total_count,
COUNT(*) FILTER (WHERE "users"."name_code" = 'BA492') AS with_code
FROM users;
total_count | with_code
-------------+-----------
20210 | 1
-- Why is Rows Removed by Filter "20208" and not "20209"?
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT * FROM "users" WHERE
"users"."name_code" = 'BA492' LIMIT 1;
QUERY PLAN
------------------------------------------------------------------
Limit (actual time=13.334..13.338 rows=1 loops=1)
Buffers: shared hit=684
-> Seq Scan on users (actual time=13.328..13.329 rows=1 loops=1)
Filter: ((name_code)::text = 'BA492'::text)
Rows Removed by Filter: 20208
Buffers: shared hit=684
Planning Time: 0.322 ms
Execution Time: 13.645 ms
Digging Into “Rows Removed by Filter”
Let’s discuss the table and query details, and other circumstances.
- We queried a “users” table with 20210 total rows. The table uses an integer sequence starting from 1.
- New rows are not being inserted. We’re working with a static set of rows.
VACUUM
has not run for this table.- We used the same
WHERE
clause in the query, with aname_code
string column, which holds a mostly-unique (but not guaranteed or enforced with a constraint) “code” value for each row. - We set a
LIMIT
of 1 on the queries, but noORDER BY
, and noOFFSET
. In practice, aLIMIT
would often be accompanied by anORDER BY
, but that wasn’t the case here. - A Sequential Scan was used to access the row data because there was no index. In a production system, an index on the
name_code
column would be a good idea. In this tutorial we’re aiming to understand the behavior without that index. - All data was accessed from shared buffers, confirmed by adding
BUFFERS
toEXPLAIN (ANALYZE)
.
What was happening?
We might have expected “Rows Removed by Filter” to be one less than the total row count, but there are a lot of reasons for why that wouldn’t be the case.
With the LIMIT 1
in the query, as soon as PostgreSQL found any match, there was an “early return.” There was no need to access additional rows in pages.
When supplying name_code
values from earlier inserted rows, in earlier pages, we’d see smaller values for “Rows Removed by Filter.”
For name_code
values “late” in the insertion order, many more rows were removed before finding a match.
We could confirm this in the planner output by observing greater numbers of buffers accessed.
Performance details
As stated, when prepending EXPLAIN (ANALYZE, BUFFERS)
on the query, we saw fewer buffers accessed for these “early matches.”
With fewer buffers accessed, there was less latency, and lower execution times. However, performance wasn’t the goal of this exercise. If it was, it would make more sense to add an index covering the name_code
column.
Additional information about “Rows removed by filter”
Actually, “Rows Removed by Filter” is a per-loop average, rounded to the nearest integer.
What does this mean? When more than one loop is used for the plan node (e.g. loops=2
or greater), then “Rows Removed by Filter” is an average value of the “Rows Removed by Filter” per-loop.
For example, if one loop removed 10 rows, and another removed 30 rows, we’d expect to see a value of 20 as the average of the two.
When there’s one loop (loops=1
), the figure is the actual number of rows processed and removed.
Experiments with OFFSET
When using OFFSET
without ordering, it would be correlated with “Rows Removed by Filter.”
To start, when we grabbed the name_code
from the first row with the LIMIT 1
, we didn’t see “Rows Removed by Filter” at all. In the default planner output format, when zero rows are filtered, the message is not displayed.
When we choose the second row (based on the default ordering), we see a single shared hit, and we see one row removed.
Next, we tried going to the “middle” of the 20 thousand rows, using an OFFSET
of 10000, and supplying the name_code
value to the query that first row at that offset, again without specifying an ordering for the query.
With that name_code
, we see “Rows Removed by Filter: 10000”, which exactly matched the offset.
Other Tidbits
- The PostgreSQL EXPLAIN documentation describes how “Rows Removed by Filter” appears only when
ANALYZE
is added toEXPLAIN
. - “Rows Removed by Filter” applies for filter conditions like a
WHERE
clause, but also for conditions on aJOIN
node. - “Rows Removed by Filter” appears when at least one row is scanned for evaluation, or a “potential join pair” (for join nodes) when rows were discarded by the filter condition.
- The
ORDER BY
clause is commonly used withLIMIT
, which can produce more predictable planner results.
Takeaways
- Without an explicit ordering (
ORDER BY
), and when using aLIMIT
, the results of “Rows Removed by Filter” may be surprising. - When
LIMIT 1
is used, PostgreSQL finds the first match and returns. The default ordering is likely the insertion order of the rows. - When analyzing “Rows Removed by Filter” figures, check whether the plan node had more than one loop. In that case, the rows are an average of all loops, rounded to the nearest integer.
- For performance work, a high proportion of rows filtered out indicates an optimization opportunity. Adding an index may greatly reduce the filtering of so many rows, reducing storage access, and speeding up your query.