July 16, 2024
Summary: in this tutorial, you will learn how to tune subqueries in PostgreSQL.
Table of Contents
Introduction
SQL allows you to use subqueries almost anywhere where you could have a table or column name. All you have to do is surround the query with parentheses, like (SELECT ...)
, and you can use it in arbitrary expressions. This makes SQL a powerful language – and one that can be hard to read. But I don’t want to discuss the beauty or ugliness of SQL. In this tutorial, let’s take a look at how to write subqueries that perform well. Let’s start simple and move on to more surprising and complex topics later.
Correlated and uncorrelated subqueries
In a subquery you can use table columns from the outside, like
SELECT a.col1,
(SELECT b.col2 FROM b WHERE b.x = a.x)
FROM a;
The subquery will be different for each row in “a
”. Such a subquery is usually called a correlated subquery. An uncorrelated subquery is one that does not reference anything from the outside.
Uncorrelated subqueries are simple. If the PostgreSQL optimizer does not “pull it up” (integrate it in the main query tree), the executor will calculate it in a separate step. You can see that as an InitPlan
(initial plan) in the output of EXPLAIN
. Uncorrelated subqueries are almost never a performance problem. In the rest of this article, I will mostly deal with correlated subqueries.
Scalar and tabular subqueries
If you write a subquery in a place in an SQL statement where you would otherwise have to write a single value, it is a scalar subquery. An example for a scalar subquery is the one in the previous section. A different example would be
SELECT a.col1
FROM a
WHERE 1 = (SELECT count(*)
FROM b
WHERE b.x = a.x);
If a scalar subquery returns no result, the resulting value is NULL. If the query returns more than a single row, you will receive a run-time error:
ERROR: more than one row returned by a subquery used as an expression
A tabular subquery appears in a context where it can return more than one value:
- a
FROM
list entry:FROM (SELECT ...) AS alias
- a common table expression (CTE):
WITH q AS (SELECT ...) SELECT ...
- an
IN
orNOT IN
expression:WHERE a.x IN (SELECT ...)
- an
EXISTS
orNOT EXISTS
expression:WHERE NOT EXISTS (SELECT ...)
Scalar subqueries usually are a performance problem
My rule of thumb is: avoid correlated scalar subqueries whenever you can. The reason is that PostgreSQL can only execute a scalar subquery as a nested loop. For example, PostgreSQL will execute the subquery from the first section once for each row in table “a
”. This can be fine if “a
” is a small table (remember, the recommendation here is just a rule of thumb). However, if table “a
” is large, even a fast subquery will make the query execution unpleasantly slow.
Rewriting a scalar subquery in the SELECT list or WHERE clause
If correlated scalar subqueries are bad for performance, how can we avoid them? There is no single, straightforward answer, and you probably won’t be able to rewrite the query to avoid such subqueries in all cases. But usually the solution is to convert the subquery into a join. For our first query, that will look like this:
SELECT a.col1,
b.col2
FROM a
LEFT JOIN b ON b.x = a.x;
That query is semantically equivalent, with the exception that we don’t get a run-time error if a row in “a
” matches more than one row in “b
”. We need an outer join to account for the case where the subquery returns no result.
For our second example, the rewritten query would look like this:
SELECT a.col1
FROM a
JOIN b ON b.x = a.x
GROUP BY a.pkey, a.col1
HAVING count(*) = 1;
Here, a.pkey
is the primary key of “a
”. Grouping by a.col1
would not be sufficient, because two different rows from table “a
” could have the same value for col1
.
The advantage of rewriting the queries as shown above is that PostgreSQL can choose the optimal join strategy and is not restricted to nested loops. If the table “a
” has few rows, that may not make a difference, since a nested loop join may be the most efficient join strategy anyway. But you also won’t lose by rewriting the query in that case. And if “a
” is large, you will be much faster with a hash or a merge join.
Tabular subqueries and performance
While correlated scalar subqueries are usually bad, the case is not so simple with tabular subqueries. Let’s consider the different cases separately.
CTEs and subqueries in FROM
These cases are almost identical, because you can always rewrite a CTE to a subquery in FROM
unless it is a recursive, a MATERIALIZED
or a data modifying CTE. CTEs are never correlated, so they are never problematic. However, a FROM
clause entry can be correlated in a lateral join:
SELECT a.col1, sub.col2
FROM a
CROSS JOIN LATERAL
(SELECT b.col2
FROM b
WHERE b.x = a.x
ORDER BY b.sort
LIMIT 1) AS sub;
Again, PostgreSQL will execute such a subquery in a nested loop, which can perform badly for large table “a
”. Therefore, it’s usually a good idea to rewrite the query to avoid a correlated subquery:
SELECT DISTINCT ON (a.pkey)
a.col1, b.col2
FROM a
JOIN b ON b.x = a.x
ORDER BY a.pkey, b.sort;
The rewritten query will perform better if “a
” has many rows, but it could perform worse if “a
” is small and “b
“ is large, but has an index on (x, sort)
.
Subqueries in EXISTS and NOT EXISTS
This is a special case. So far I have always recommended avoiding correlated subqueries. But with EXISTS
and NOT EXISTS
, the PostgreSQL optimizer is able to transform the clause to a semi-join and anti-join, respectively. That allows PostgreSQL to use all join strategies, not only nested loops.
Consequently, PostgreSQL can process correlated subqueries in EXISTS
and NOT EXISTS
efficiently.
The tricky case of IN and NOT IN
You will maybe expect that these two cases behave alike, but that is not the case. A query using IN
with a subquery can always be rewritten to use EXISTS
. For example, the following statement:
SELECT a.col1
FROM a
WHERE a.foo IN (SELECT b.col2
FROM b
WHERE a.x = b.x);
is equivalent to
SELECT a.col1
FROM a
WHERE EXISTS (SELECT NULL
FROM b
WHERE a.x = b.x
AND a.foo = b.col2);
The PostgreSQL optimizer can do that and will process the subquery in IN
as efficiently as the one in EXISTS
.
However, the case with NOT IN
is quite different. You can rewrite NOT IN
to NOT EXISTS
similar to the above, but that is not a transformation that PostgreSQL can do automatically, because the rewritten statement is semantically different: If the subquery returns at least one NULL value, NOT IN
will never be TRUE. The NOT EXISTS
clause does not exhibit this surprising behavior.
Now people normally don’t care about this property of NOT IN
(and in fact, too few people know about it). Most people would prefer the behavior of NOT EXISTS
anyway. But you have to rewrite the SQL statement yourself and cannot expect PostgreSQL to do it automatically. So my recommendation is that you never use NOT IN
with a subquery and always use NOT EXISTS
instead.
Using correlated subqueries to force nested loop joins
So far, I have told you how to rewrite an SQL statement to avoid forcing the optimizer to use a nested loop. Yet sometimes you need the exact opposite: you want the optimizer to use a nested loop join, because you happen to know that that is the best join strategy. Then you can deliberately rewrite a regular join to a lateral cross join to force a nested loop. For example, this query
SELECT a.col1, b.col2
FROM a
JOIN b ON a.x = b.x;
is semantically equivalent to
SELECT a.col1, sub.col2
FROM a
CROSS JOIN LATERAL
(SELECT b.col2
FROM b
WHERE a.x = b.x) AS sub;
Conclusion
If you want good performance with subqueries, it is often a good idea to follow these guidelines:
- use uncorrelated subqueries as much as you like, as long as they don’t make the statement hard to understand
- avoid correlated subqueries everywhere except in
EXISTS
,NOT EXISTS
andIN
clauses - always rewrite
NOT IN
toNOT EXISTS
Don’t take these rules as iron laws. Sometimes a correlated subquery can actually perform better, and sometimes you can use a correlated subquery to force the optimizer to use a nested loop if you are certain that is the right strategy to use.