PostgreSQL Tutorial: Tune optimizer settings

December 15, 2023

Summary: in this tutorial, you will learn how to tune optimizer settings in PostgreSQL.

The PostgreSQL planner has improved a lot over the years. Still, something can go south and users have to convince the planner to do the right thing.

To modify plans, PostgreSQL offers a couple of runtime variables that will have a significant impact on planning. The idea is to give the end user the chance to make certain types of nodes in the plan more expensive than others. What does that mean in practice? Here is a simple plan:

test=# explain SELECT *
 FROM generate_series(1, 100) AS a,
      generate_series(1, 100) AS b
 WHERE a = b;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Hash Join (cost=2.25..4.63 rows=100 width=8)
   Hash Cond: (a.a = b.b)
   -> Function Scan on generate_series a (cost=0.00..1.00 rows=100 width=4)
   -> Hash (cost=1.00..1.00 rows=100 width=4)
         -> Function Scan on generate_series b (cost=0.00..1.00 rows=100 width=4)
(5 rows)

Here, PostgreSQL will scan the functions and perform a hash join. Let’s run the same query in PostgreSQL 11 or older and show you the execution plan:

                                     QUERY PLAN
------------------------------------------------------------------------------------
 Merge Join (cost=119.66..199.66 rows=5000 width=8)
   Merge Cond: (a.a = b.b)
   -> Sort (cost=59.83..62.33 rows=1000 width=4)
         Sort Key: a.a
         -> Function Scan on generate_series a (cost=0.00..10.00 rows=1000 width=4)
   -> Sort (cost=59.83..62.33 rows=1000 width=4)
         Sort Key: b.b
         -> Function Scan on generate_series b (cost=0.00..10.00 rows=1000 width=4)
(8 rows)

Can you see the difference between these two plans? In PostgreSQL 12, the estimate of the set returning function is already correct. In the older version, the optimizer still estimates that a set returning function will always return 1000 rows. In PostgreSQL, there are optimizer support functions that can help estimate the result set. Therefore, the plan in PostgreSQL 12 and beyond is vastly superior to the old plan.

What we can see in the new plan is that a hashjoin is performed, which is, of course, the most efficient way to do things. However, what if we’re smarter then the optimizer? Fortunately, PostgreSQL has the means to overrule the optimizer. You can set variables in a connection that change the default cost estimates. Here’s how it works:

test=# SET enable_hashjoin TO off;
SET
test=# explain SELECT *
  FROM generate_series(1, 100) AS a,
       generate_series(1, 100) AS b
  WHERE a = b;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Merge Join (cost=8.65..10.65 rows=100 width=8)
   Merge Cond: (a.a = b.b)
   -> Sort (cost=4.32..4.57 rows=100 width=4)
         Sort Key: a.a
         -> Function Scan on generate_series a (cost=0.00..1.00 rows=100 width=4)
   -> Sort (cost=4.32..4.57 rows=100 width=4)
         Sort Key: b.b
         -> Function Scan on generate_series b (cost=0.00..1.00 rows=100 width=4)
(8 rows)

PostgreSQL assumes that hashjoin functions are bad and makes them infinitely expensive. Hence, it falls back to a merge join. However, we can turn merge joins off as well:

test=# explain SELECT *
 FROM generate_series(1, 100) AS a,
      generate_series(1, 100) AS b
 WHERE a = b;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Nested Loop (cost=0.01..226.00 rows=100 width=8)
 Join Filter: (a.a = b.b)
 -> Function Scan on generate_series a (cost=0.00..1.00 rows=100 width=4)
 -> Function Scan on generate_series b (cost=0.00..1.00 rows=100 width=4)
(4 rows)

PostgreSQL is slowly running out of options. The following example shows what happens if we turn off nested loops as well:

test=# SET enable_nestloop TO off;
SET
test=# explain SELECT *
   FROM generate_series(1, 100) AS a,
        generate_series(1, 100) AS b
   WHERE a = b;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Nested Loop (cost=10000000000.00..10000000226.00 rows=100 width=8)
   Join Filter: (a.a = b.b)
   -> Function Scan on generate_series a (cost=0.00..1.00 rows=100 width=4)
   -> Function Scan on generate_series b (cost=0.00..1.00 rows=100 width=4)
 JIT:
   Functions: 10
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(7 rows)

The important thing is that turning off doesn’t really mean off – it just means insanely expensive. If PostgreSQL has no cheaper options, it will fall back to the ones we turned off. Otherwise, there would no longer be any way to execute SQL.

What settings influence the planner? The following switches are available:

# - Planner Method Configuration -
#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_parallel_append = on
#enable_seqscan = on
#enable_sort = on
#enable_incrementalsort = on
#enable_tidscan = on
#enable_partitionwise_join = off
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on

While these settings can definitely be beneficial, please understand that these tweaks should be handled with care. They should only be used to speed up individual queries and not turn off things globally. Switching off options can turn against you fairly quickly and destroy performance. Therefore, it really makes sense to think twice before changing these parameters.

comments powered by Disqus