August 15, 2024
Summary: in this tutorial, you will learn how to tune selectivity estimation with expression index.
Table of Contents
Introduction
Most people know that Postgres allows the creation of indexes on expressions. This is helpful if you need index lookups of expressions used in where clauses.
However, there is another benefit to expression indexes, and that is optimizer statistics. Not only do expression indexes allow rapid lookups of matching expressions, but they also provide optimizer statistics, which improve row estimates and hence query plans.
Improve row estimates with expression index
Here is an example:
CREATE TABLE test (x INTEGER);
INSERT INTO test SELECT x FROM generate_series(1, 100) AS t(x);
ANALYZE test;
SELECT COUNT(*) FROM test WHERE x % 2 = 1;
count
-------
50
EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
QUERY PLAN
----------------------------------------------------
Seq Scan on test (cost=0.00..2.50 rows=1 width=4)
Filter: ((x % 2) = 1)
The optimizer doesn’t know the selectivity of the modulus operator, so it initially assumes only one row is returned. Once an expression index is created and analyze statistics generated, the optimizer knows exactly how many rows will be returned:
CREATE INDEX i_test ON test ((x % 2));
ANALYZE test;
EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
QUERY PLAN
-----------------------------------------------------
Seq Scan on test (cost=0.00..2.50 rows=50 width=4)
Filter: ((x % 2) = 1)
Interestingly, the optimizer used expression index statistics, even though the expression index itself was not used. In the example above, the modulus operator is not selective enough to make the index useful, but expression statistics would be useful for more complex queries, e.g. with joins. This method can also be used to create statistics on functions.
It is also possible to create an expression index that generates cross-columns statistics. For example, this expression index would supply accurate statistics for state/city combinations, but queries would need to use the exact concatenation construction:
CREATE INDEX i_customer_state_city ON customer ((state || '|' || city));
It would be nice if there was a way to create expression statistics without the overhead of creating and maintaining indexes.
Improve query plan with expression index
As shown above, the statistics generated on expression indexes can be used to produce more accurate row counts and potentially better plans. Next, let’s see how more accurate row counts can change the query plan.
First, the setup:
CREATE TABLE test1 AS
SELECT * FROM generate_series(1, 100000) AS f(x);
CREATE TABLE test2 AS
SELECT * FROM generate_series(1, 2) AS f(x);
ANALYZE test1;
ANALYZE test2;
Then a join query using modulus with no expression index:
EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
QUERY PLAN
-----------------------------------------------------------------
Nested Loop (cost=0.00..1959.02 rows=1 width=4)
Join Filter: (test1.x = test2.x)
-> Seq Scan on test1 (cost=0.00..1943.00 rows=500 width=4)
Filter: ((x % 2) = 1)
-> Materialize (cost=0.00..1.03 rows=2 width=4)
-> Seq Scan on test2 (cost=0.00..1.02 rows=2 width=4)
A nested loop join is used, which is suboptimal because the row count for test1 is one hundred times too small.
With proper statistics on the modulus operation on test1.x, a more efficient hash join is used:
CREATE INDEX i_test1 ON test1((x % 2));
ANALYZE test1;
ANALYZE test2;
EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
QUERY PLAN
------------------------------------------------------------------
Hash Join (cost=1.04..2132.29 rows=1 width=4)
Hash Cond: (test1.x = test2.x)
-> Seq Scan on test1 (cost=0.00..1943.00 rows=50197 width=4)
Filter: ((x % 2) = 1)
-> Hash (cost=1.02..1.02 rows=2 width=4)
-> Seq Scan on test2 (cost=0.00..1.02 rows=2 width=4)
Notice the test1 row count is now much more accurate, and that analyzing the base table also analyzes the expression index. The total cost is now slightly higher (2132.29 vs. 1959.02), but that is not because the hash join is more expensive. Rather, it is because the nested loop misestimated how many rows it would need to process because it didn’t know the selectivity of the modulus operation.
One thing the example illustrated is how much the optimizer “loves” hash joins. If test2 has three or more rows, or if test1 has ten times more rows and parallelism is enabled, a hash join is used even without expression index statistics. Hash joins are very robust despite misestimation so they are favored by the optimizer. The takeaway is that the creation of expression indexes for statistical purposes is recommended only if testing shows they actually improve query plans, i.e. improving explain row counts alone has little benefit.