August 14, 2024
Summary: in this tutorial, you will learn “good” and “bad” OR
s and what you can do to avoid the latter.
Table of Contents
A little sample schema
We’ll use this simple setup for demonstration:
CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL);
INSERT INTO a
SELECT i, md5(i::text)
FROM generate_series(1, 100000) i;
CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL);
INSERT INTO b
SELECT i, md5(i::text)
FROM generate_series(1, 100000) i;
ALTER TABLE a ADD PRIMARY KEY (id);
ALTER TABLE b ADD PRIMARY KEY (id);
ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a;
VACUUM (ANALYZE) a;
VACUUM (ANALYZE) b;
Suppose that we want to run queries with equality and LIKE
conditions on the text
columns, so we need some indexes:
CREATE INDEX a_val_idx ON a(a_val text_pattern_ops);
CREATE INDEX b_val_idx ON b(b_val text_pattern_ops);
Have a look at the documentation if you don’t understand text_pattern_ops
.
The “good” OR
An OR
is fine in most parts of an SQL query: if it is not used to filter out rows from your query result, it will have no negative effect on query performance.
So if your OR
appears in a CASE
expression in the SELECT
list, don’t worry.
Unfortunately you usually find the OR
where it hurts: in the WHERE
clause.
The “bad” OR
Now for an example of an OR
in a WHERE
clause that is still pretty nice:
EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
OR a_val = 'value 42';
QUERY PLAN
-----------------------------------------------------------
Bitmap Heap Scan on a
Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text))
-> BitmapOr
-> Bitmap Index Scan on a_pkey
Index Cond: (id = 42)
-> Bitmap Index Scan on a_val_idx
Index Cond: (a_val = 'value 42'::text)
(7 rows)
PostgreSQL can actually use an index scan for the query, because it can combine the bitmaps for both indexes with a “bitmap OR”.
Note, however, that a bitmap index scan is more expensive than a normal index scan, since it has to build the bitmap. Moreover, it uses much more RAM; each of these bitmaps can use up to work_mem
memory.
A multi-column index on (id, a_val)
won’t help at all with this query, so there is no cheaper way to execute it.
IN is better than OR
Now for a more stupid variant of the above query:
EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
OR id = 4711;
QUERY PLAN
--------------------------------------------
Bitmap Heap Scan on a
Recheck Cond: ((id = 42) OR (id = 4711))
-> BitmapOr
-> Bitmap Index Scan on a_pkey
Index Cond: (id = 42)
-> Bitmap Index Scan on a_pkey
Index Cond: (id = 4711)
(7 rows)
Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR
:
EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id IN (42, 4711);
QUERY PLAN
---------------------------------------------------
Index Only Scan using a_pkey on a
Index Cond: (id = ANY ('{42,4711}'::integer[]))
(2 rows)
You see? As soon as you get rid of the OR
, an efficient index scan can be used!
You might say that this is good for equality conditions, but what about the following query:
SELECT id FROM a
WHERE a_val LIKE 'something%'
OR a_val LIKE 'other%';
To improve that query, observe that the PostgreSQL optimizer rewrote the IN
in the previous query to = ANY
.
This is a case of the standard SQL “quantified comparison predicate”: ANY
is true if the comparison is TRUE
for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).
Now LIKE
is a comparison operator as well, so we can write:
EXPLAIN (COSTS off)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);
QUERY PLAN
----------------------------------------------------------
Seq Scan on a
Filter: (a_val ~~ ANY ('{something%,other%}'::text[]))
(2 rows)
Unfortunately, the index cannot be used here.
The “ugly” OR
Things become really bad if OR
combines conditions from different tables:
EXPLAIN (COSTS off)
SELECT id, a.a_val, b.b_val
FROM a JOIN b USING (id)
WHERE a.id = 42
OR b.id = 42;
QUERY PLAN
---------------------------------------------
Merge Join
Merge Cond: (a.id = b.id)
Join Filter: ((a.id = 42) OR (b.id = 42))
-> Index Scan using a_pkey on a
-> Index Scan using b_pkey on b
(5 rows)
Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition. In our example, that would mean computing 100000 rows only to throw away the 99999 that do not match the condition.
Avoiding the ugly OR
Fortunately, there is an equivalent query that is longer to write, but much cheaper to execute:
EXPLAIN (COSTS off)
SELECT id, a.a_val, b.b_val
FROM a JOIN b USING (id)
WHERE a.id = 42
UNION
SELECT id, a.a_val, b.b_val
FROM a JOIN b USING (id)
WHERE b.id = 42;
QUERY PLAN
----------------------------------------------------------
Unique
-> Sort
Sort Key: a.id, a.a_val, b.b_val
-> Append
-> Nested Loop
-> Index Scan using a_pkey on a
Index Cond: (id = 42)
-> Index Scan using b_pkey on b
Index Cond: (id = 42)
-> Nested Loop
-> Index Scan using a_pkey on a a_1
Index Cond: (id = 42)
-> Index Scan using b_pkey on b b_1
Index Cond: (id = 42)
(14 rows)
Both parts of the query can make use of efficient index scans and return one row, and since the rows happen to be identical, UNION
will reduce them to one row.
If you can be certain that both branches of the query will return distinct sets, it is better to use UNION ALL
instead of UNION
, because that doesn’t have to do the extra processing to remove duplicates.
When using this trick, you should be aware that rewriting a query in that fashion does not always result in an equivalent query: if the original query can return identical rows, these would be removed by the UNION
. In our case, we don’t have to worry, because the primary keys were included in the query result.