PostgreSQL Tutorial: Tuning Subqueries

July 16, 2024

Summary: in this tutorial, you will learn how to tune subqueries in PostgreSQL.

Table of Contents

Introduction

SQL allows you to use subqueries almost anywhere where you could have a table or column name. All you have to do is surround the query with parentheses, like (SELECT ...), and you can use it in arbitrary expressions. This makes SQL a powerful language – and one that can be hard to read. But I don’t want to discuss the beauty or ugliness of SQL. In this tutorial, let’s take a look at how to write subqueries that perform well. Let’s start simple and move on to more surprising and complex topics later.

Correlated and uncorrelated subqueries

In a subquery you can use table columns from the outside, like

SELECT a.col1,
       (SELECT b.col2 FROM b WHERE b.x = a.x)
FROM a;

The subquery will be different for each row in “a”. Such a subquery is usually called a correlated subquery. An uncorrelated subquery is one that does not reference anything from the outside.

Uncorrelated subqueries are simple. If the PostgreSQL optimizer does not “pull it up” (integrate it in the main query tree), the executor will calculate it in a separate step. You can see that as an InitPlan (initial plan) in the output of EXPLAIN. Uncorrelated subqueries are almost never a performance problem. In the rest of this article, I will mostly deal with correlated subqueries.

Scalar and tabular subqueries

If you write a subquery in a place in an SQL statement where you would otherwise have to write a single value, it is a scalar subquery. An example for a scalar subquery is the one in the previous section. A different example would be

SELECT a.col1
FROM a
WHERE 1 = (SELECT count(*)
           FROM b
           WHERE b.x = a.x);

If a scalar subquery returns no result, the resulting value is NULL. If the query returns more than a single row, you will receive a run-time error:

ERROR:  more than one row returned by a subquery used as an expression

A tabular subquery appears in a context where it can return more than one value:

a FROM list entry: FROM (SELECT ...) AS alias
a common table expression (CTE): WITH q AS (SELECT ...) SELECT ...
an IN or NOT IN expression: WHERE a.x IN (SELECT ...)
an EXISTS or NOT EXISTS expression: WHERE NOT EXISTS (SELECT ...)

Scalar subqueries usually are a performance problem

My rule of thumb is: avoid correlated scalar subqueries whenever you can. The reason is that PostgreSQL can only execute a scalar subquery as a nested loop. For example, PostgreSQL will execute the subquery from the first section once for each row in table “a”. This can be fine if “a” is a small table (remember, the recommendation here is just a rule of thumb). However, if table “a” is large, even a fast subquery will make the query execution unpleasantly slow.

Rewriting a scalar subquery in the SELECT list or WHERE clause

If correlated scalar subqueries are bad for performance, how can we avoid them? There is no single, straightforward answer, and you probably won’t be able to rewrite the query to avoid such subqueries in all cases. But usually the solution is to convert the subquery into a join. For our first query, that will look like this:

SELECT a.col1,
       b.col2
FROM a
   LEFT JOIN b ON b.x = a.x;

That query is semantically equivalent, with the exception that we don’t get a run-time error if a row in “a” matches more than one row in “b”. We need an outer join to account for the case where the subquery returns no result.

For our second example, the rewritten query would look like this:

SELECT a.col1
FROM a
   JOIN b ON b.x = a.x
GROUP BY a.pkey, a.col1
HAVING count(*) = 1;

Here, a.pkey is the primary key of “a”. Grouping by a.col1 would not be sufficient, because two different rows from table “a” could have the same value for col1.

The advantage of rewriting the queries as shown above is that PostgreSQL can choose the optimal join strategy and is not restricted to nested loops. If the table “a” has few rows, that may not make a difference, since a nested loop join may be the most efficient join strategy anyway. But you also won’t lose by rewriting the query in that case. And if “a” is large, you will be much faster with a hash or a merge join.

Tabular subqueries and performance

While correlated scalar subqueries are usually bad, the case is not so simple with tabular subqueries. Let’s consider the different cases separately.

CTEs and subqueries in FROM

These cases are almost identical, because you can always rewrite a CTE to a subquery in FROM unless it is a recursive, a MATERIALIZED or a data modifying CTE. CTEs are never correlated, so they are never problematic. However, a FROM clause entry can be correlated in a lateral join:

SELECT a.col1, sub.col2
FROM a
   CROSS JOIN LATERAL
      (SELECT b.col2
       FROM b
       WHERE b.x = a.x
       ORDER BY b.sort
       LIMIT 1) AS sub;

Again, PostgreSQL will execute such a subquery in a nested loop, which can perform badly for large table “a”. Therefore, it’s usually a good idea to rewrite the query to avoid a correlated subquery:

SELECT DISTINCT ON (a.pkey)
       a.col1, b.col2
FROM a
   JOIN b ON b.x = a.x
ORDER BY a.pkey, b.sort;

The rewritten query will perform better if “a” has many rows, but it could perform worse if “a” is small and “b“ is large, but has an index on (x, sort).

Subqueries in EXISTS and NOT EXISTS

This is a special case. So far I have always recommended avoiding correlated subqueries. But with EXISTS and NOT EXISTS, the PostgreSQL optimizer is able to transform the clause to a semi-join and anti-join, respectively. That allows PostgreSQL to use all join strategies, not only nested loops.

Consequently, PostgreSQL can process correlated subqueries in EXISTS and NOT EXISTS efficiently.

The tricky case of IN and NOT IN

You will maybe expect that these two cases behave alike, but that is not the case. A query using IN with a subquery can always be rewritten to use EXISTS. For example, the following statement:

SELECT a.col1
FROM a
WHERE a.foo IN (SELECT b.col2
                 FROM b
                 WHERE a.x = b.x);

is equivalent to

SELECT a.col1
FROM a
WHERE EXISTS (SELECT NULL
              FROM b
              WHERE a.x = b.x
                AND a.foo = b.col2);

The PostgreSQL optimizer can do that and will process the subquery in IN as efficiently as the one in EXISTS.

However, the case with NOT IN is quite different. You can rewrite NOT IN to NOT EXISTS similar to the above, but that is not a transformation that PostgreSQL can do automatically, because the rewritten statement is semantically different: If the subquery returns at least one NULL value, NOT IN will never be TRUE. The NOT EXISTS clause does not exhibit this surprising behavior.

Now people normally don’t care about this property of NOT IN (and in fact, too few people know about it). Most people would prefer the behavior of NOT EXISTS anyway. But you have to rewrite the SQL statement yourself and cannot expect PostgreSQL to do it automatically. So my recommendation is that you never use NOT IN with a subquery and always use NOT EXISTS instead.

Using correlated subqueries to force nested loop joins

So far, I have told you how to rewrite an SQL statement to avoid forcing the optimizer to use a nested loop. Yet sometimes you need the exact opposite: you want the optimizer to use a nested loop join, because you happen to know that that is the best join strategy. Then you can deliberately rewrite a regular join to a lateral cross join to force a nested loop. For example, this query

SELECT a.col1, b.col2
FROM a
   JOIN b ON a.x = b.x;

is semantically equivalent to

SELECT a.col1, sub.col2
FROM a
   CROSS JOIN LATERAL
      (SELECT b.col2
       FROM b
       WHERE a.x = b.x) AS sub;

Conclusion

If you want good performance with subqueries, it is often a good idea to follow these guidelines:

use uncorrelated subqueries as much as you like, as long as they don’t make the statement hard to understand
avoid correlated subqueries everywhere except in EXISTS, NOT EXISTS and IN clauses
always rewrite NOT IN to NOT EXISTS

Don’t take these rules as iron laws. Sometimes a correlated subquery can actually perform better, and sometimes you can use a correlated subquery to force the optimizer to use a nested loop if you are certain that is the right strategy to use.

See more

PostgreSQL Optimization