PostgreSQL Tutorial: Understand query planning

June 24, 2024

Summary: in this tutorial, you will understand query planning in PostgreSQL.

Table of Contents

Introduction

SQL is a declarative language: a query specifies what to retrieve, but not how to retrieve it.

Any query can be executed in a number of ways. Each operation in the parse tree has multiple execution options. For example, you can retrieve specific records from a table by reading the whole table and discarding rows you don’t need, or you can use indexes to find the rows that match your query. Data sets are always joined in pairs. Variations in the order of joins result in a multitude of execution options. Then there are various ways to join two sets of rows together. For example, you could go through the rows in the first set one by one and look for matching rows in the other set, or you could sort both sets first, and then merge them together. Different approaches perform better in some cases and worse in others.

The optimal plan may execute faster than a non-optimal one by several orders of magnitude. This is why the planner, which optimizes the parsed query, is one of the most complex elements of the system.

Plan tree

The execution plan can also be presented as a tree, but with its nodes as physical rather than logical operations on data.

img

If the parameter debug_print_plan is on, the full plan tree will be displayed in the server message log. This is highly impractical, as the log is extremely cluttered as it is. A more convenient option is to use the EXPLAIN command:

EXPLAIN
SELECT schemaname, tablename
FROM pg_tables
WHERE tableowner = 'postgres'
ORDER BY tablename;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Sort  (cost=21.03..21.04 rows=1 width=128)
   Sort Key: c.relname
   ->  Nested Loop Left Join  (cost=0.00..21.02 rows=1 width=128)
         Join Filter: (n.oid = c.relnamespace)
         ->  Seq Scan on pg_class c  (cost=0.00..19.93 rows=1 width=72)
               Filter: ((relkind = ANY ('{r,p}'::"char"[])) AND (pg_get_userbyid(relowner) = 'postgres'::name))
         ->  Seq Scan on pg_namespace n  (cost=0.00..1.04 rows=4 width=68)
(7 rows)

The image shows the main nodes of the tree. The same nodes are marked with arrows in the EXPLAIN output.

The Seq Scan node represents the table read operation, while the Nested Loop node represents the join operation. There are two interesting points to take note of here:

  • One of the initial tables is gone from the plan tree because the planner figured out that it’s not required to process the query and removed it.
  • There is an estimated number of rows to process and the cost of processing next to each node.

To find the optimal plan, PostgreSQL utilizes the cost-based query optimizer. The optimizer goes through various available execution plans and estimates the required amounts of resources, such as I/O operations and CPU cycles. This calculated estimate, converted into arbitrary units, is known as the plan cost. The plan with the lowest resulting cost is selected for execution.

The trouble is, the number of possible plans grows exponentially as the number of joins increases, and sifting through all the plans one by one is impossible even for relatively simple queries. Therefore, dynamic programming and heuristics are used to limit the scope of search. This allows to precisely solve the problem for a greater number of tables in a query within reasonable time, but the selected plan is not guaranteed to be truly optimal because the planner utilizes simplified mathematical models and may use imprecise initial data.

Ordering joins

A query can be structured in specific ways to significantly reduce the search scope (at a risk of missing the opportunity to find the optimal plan):

  • Common table expressions are usually optimized separately from the main query. Since version 12, this can be forced with the MATERIALIZE clause.
  • Queries from non-SQL functions are optimized separately from the main query. (SQL functions can be inlined into the main query in some cases.)
  • The join_collapse_limit parameter together with explicit JOIN clauses, as well as the from_collapse_limit parameter together with sub-queries may define the order of some joins, depending on the query syntax.

The last one may need an explanation. The query below calls several tables within a FROM clause with no explicit joins:

SELECT ...
FROM a, b, c, d, e
WHERE ...

This is the parse tree for this query:

img

In this query, the planner will consider all possible join orders.

In the next example, some joins are explicitly defined by the JOIN clause:

SELECT ...
FROM a, b JOIN c ON ..., d, e
WHERE ...

The parse tree reflects this:

img

The planner collapses the join tree, effectively transforming it into the tree from the previous example. The algorithm recursively traverses the tree and replaces each JOINEXPR node with a flat list of its components.

This “flattening out” will only occur, however, if the resulting flat list will contain no more than join_collapse_limit elements (8 by default). In the example above, if join_collapse_limit is set to 5 or less, the JOINEXPR node will not be collapsed. For the planner this means two things:

  • Table B must be joined to table C (or vice versa, the join order in a pair is not restricted).
  • Tables A, D, E, and the join of B to C may be joined in any order.

If join_collapse_limit is set to 1, any explicit JOIN order will be preserved.

Note that the operation FULL OUTER JOIN is never collapsed regardless of join_collapse_limit.

The parameter from_collapse_limit (also 8 by default) limits the flattening of sub-queries in a similar manner. Sub-queries don’t appear to have much in common with joins, but when it comes down to the parse tree level, the similarity is apparent.

Example:

SELECT ...
FROM a, b JOIN c ON ..., d, e
WHERE ...

And here’s the tree:

img

The only difference here is that the JOINEXPR node is replaced with FROMEXPR (hence the parameter name FROM).

Whenever the resulting flattened tree ends up with too many same-level nodes (tables or join results), planning time may skyrocket because each node requires separate optimization. If the parameter geqo is on (it is by default), PostgreSQL will switch to genetic search whenever the number of same-level nodes reaches geqo_threshold (12 by default).

Genetic search is much faster than the dynamic programming approach, but it does not guarantee that the best possible plan will be found. This algorithm has a number of adjustable options, but that’s a topic for another article.

Selecting the best plan

The definition of the best plan varies depending on the intended use. When a complete output is required (for example, to generate a report), the plan must optimize the retrieval of all rows that match the query. On the other hand, if you only want the first several matching rows (to display on the screen, for example), the optimal plan might be completely different.

PostgreSQL addresses this by calculating two cost components. They are displayed in the query plan output after the word “cost”:

 Sort  (cost=21.03..21.04 rows=1 width=128)

The first component, startup cost, is the cost to prepare for the execution of the node; the second component, total cost, represents the total node execution cost.

When selecting a plan, the planner first checks if a cursor is in use (a cursor can be set up with the DECLARE command or explicitly declared in PL/pgSQL). If not, the planner assumes that the full output is required and selects the plan with the least total cost.

Otherwise, if a cursor is used, the planner selects a plan that optimally retrieves the number of rows equal to cursor_tuple_fraction (0.1 by default) of the total number of matching rows. Or, more specifically, a plan with the lowest

startup cost + cursor_tuple_fraction × (total cost − startup cost).

Cost calculation process

To estimate a plan cost, each of its nodes has to be individually estimated. A node cost depends on the node type (reading from a table costs much less than sorting the table) and the amount of data processed (in general, the more data, the higher the cost). While the node type is known right away, to assess the amount of data we first need to estimate the node’s cardinality (the amount of input rows) and selectivity (the fraction of rows left over for output). To do that, we need data statistics: table sizes, data distribution across columns.

Therefore, optimization depends on accurate statistics, which are gathered and kept up-to-date by the autoanalyze process.

If the cardinality of each plan node is estimated accurately, the total cost calculated will usually match the actual cost. Common planner deviations are usually the result of incorrect estimation of cardinality and selectivity. These errors are caused by inaccurate, outdated or unavailable statistics data, and, to a lesser extent, inherently imperfect models the planner is based on.

Cardinality estimation

Cardinality estimation is performed recursively. Node cardinality is calculated using two values:

  • Cardinality of the node’s child nodes, or the number of input rows.
  • Selectivity of the node, or the fraction of output rows to the input rows.

Cardinality is the product of these two values.

Selectivity is a number between 0 and 1. Selectivity values closer to zero are called high selectivity, and values closer to one are called low selectivity. This is because high selectivity eliminates a higher fraction of rows, and lower selectivity values bring the threshold down, so fewer rows are discarded.

Leaf nodes with data access methods are processed first. This is where statistics such as table sizes come in.

Selectivity of conditions applied to a table depends on the condition types. In its simplest form selectivity can be a constant value, but the planner tries to use all available information to produce the most accurate estimate. Selectivity estimations for the simplest conditions serve as the basis, and complex conditions built with Boolean operations can be further calculated using the following straightforward formulas:

selx and y = selx sely

selx or y = 1−(1−selx)(1−sely) = selx + selyselx sely.

In these formulas, x and y are considered independent. If they correlate, the formulas are still used, but the estimate will be less accurate.

For a cardinality estimate of joins, two values are calculated: the cardinality of the Cartesian product (the product of cardinalities of two data sets) and the selectivity of the join conditions, which in turn depends on the condition types.

Cardinality of other node types, such as sorting or aggregation nodes, is calculated similarly.

Note that a cardinality calculation mistake in a lower node will propagate upward, resulting in inaccurate cost estimation and, ultimately, a sub-optimal plan. This is made worse by the fact that the planner only has statistical data on tables, not on join results.

Cost estimation

Cost estimation process is also recursive. The cost of a sub-tree comprises the costs of its child nodes plus the cost of the parent node.

Node cost calculation is based on a mathematical model of the operation it performs. The cardinality, which has been already calculated, serves as the input. The process calculates both startup cost and total cost.

Some operations don’t require any preparation and can start executing immediately. For these operations, the startup cost will be zero.

Other operations may have prerequisites. For example, a sorting node will usually require all of the data from its child node to begin the operation. These nodes have a non-zero startup cost. This cost has to be paid, even if the next node (or the client) only needs a single row of the output.

The cost is the planner’s best estimate. Any planning mistakes will affect how much the cost will correlate with the actual time to execute. The primary purpose of cost assessment is to allow the planner to compare different execution plans for the same query in the same conditions. In any other case, comparing queries (worse, different queries) by cost is pointless and wrong. For example, consider a cost that was underestimated because the statistics were inaccurate. Update the statistics—and the cost may change, but the estimate will become more accurate, and the plan will ultimately improve.

See more

PostgreSQL Optimization