January 30, 2024
Summary: in this tutorial, you will learn how to check duplicate indexes in PostgreSQL.
Table of Contents
Why should I get rid of duplicate indexes?
Everybody knows that a database index is a good thing because it can speed up SQL queries. But this does not come for free.
The disadvantages of indexes are:
- Indexes use up space. It is not unusual for database indexes to use as much storage space as the data themselves. And the kind of reliable, fast storage you want for a database is not necessarily cheap. The space used up by indexes also increases the size and duration of physical backups.
- Indexes slow down data modification. Whenever you
INSERT
into orDELETE
from a table, all indexes have to be modified, in addition to the table itself (the “heap”). And it is much more expensive to modify the complicated data structure of an index than the heap itself, which has its name precisely because it is basically an unordered “pile” of data (and as everybody knows, maintaining order is more work than having a mess). Modifying an indexed table can easily be an order of magnitude more expensive than modifying an unindexed table. - Indexes prevent HOT updates. Because of the architecture of PostgreSQL, every
UPDATE
causes a new row version (“tuple”) to be written, and that causes a new entry in every index on the table. This behavior has been dubbed “write amplification” and has drawn a lot of fire. This undesirable effect can be avoided if a) the new tuple fits into the same table block as the old one and b) no indexed column is modified. Then PostgreSQL creates the new tuple as a “Heap Only Tuple” (hence HOT), which is much more efficient and also reduces the workVACUUM
has to do.
Find the exact duplicate indexes
Finds multiple indexes that have the same set of columns, same opclass, expression and predicate – which make them equivalent. Usually it’s safe to drop one of them, but it gives no guarantees.
SELECT ni.nspname || '.' || ct.relname AS "table",
ci.relname AS "dup index",
pg_get_indexdef(i.indexrelid) AS "dup index definition",
i.indkey AS "dup index attributes",
cii.relname AS "encompassing index",
pg_get_indexdef(ii.indexrelid) AS "encompassing index definition",
ii.indkey AS "enc index attributes"
FROM pg_index i
JOIN pg_class ct ON i.indrelid=ct.oid
JOIN pg_class ci ON i.indexrelid=ci.oid
JOIN pg_namespace ni ON ci.relnamespace=ni.oid
JOIN pg_index ii ON ii.indrelid=i.indrelid AND
ii.indexrelid != i.indexrelid AND
(array_to_string(ii.indkey, ' ') || ' ') like (array_to_string(i.indkey, ' ') || ' %') AND
(array_to_string(ii.indcollation, ' ') || ' ') like (array_to_string(i.indcollation, ' ') || ' %') AND
(array_to_string(ii.indclass, ' ') || ' ') like (array_to_string(i.indclass, ' ') || ' %') AND
(array_to_string(ii.indoption, ' ') || ' ') like (array_to_string(i.indoption, ' ') || ' %') AND
NOT (ii.indkey::integer[] @> ARRAY[0]) AND -- Remove if you want expression indexes (you probably don't)
NOT (i.indkey::integer[] @> ARRAY[0]) AND -- Remove if you want expression indexes (you probably don't)
i.indpred IS NULL AND -- Remove if you want indexes with predicates
ii.indpred IS NULL AND -- Remove if you want indexes with predicates
CASE WHEN i.indisunique THEN ii.indisunique AND
array_to_string(ii.indkey, ' ') = array_to_string(i.indkey, ' ') ELSE true END
JOIN pg_class ctii ON ii.indrelid=ctii.oid
JOIN pg_class cii ON ii.indexrelid=cii.oid
WHERE ct.relname NOT LIKE 'pg_%' AND
NOT i.indisprimary
ORDER BY 1, 2, 3;
So, you probably noticed that we still get some false positives, yes? That’s because an index can have all the same columns but still be different. For example, it could use varchar_pattern_ops, GiST, or be a partial index. However, we want to see those because often they are functionally duplicates of other indexes even though they are not exactly the same. For example, you probably don’t need both an index on ( status WHERE cancelled is null )
and on ( status )
.
Find the partial matched indexes
What about indexes which contain all of the columns of another index, plus some more? Like if you have one index on (id, name)
you probably don’t need another index on just (id)
. Well, here’s a query to find partial matches.
-- check for containment
-- i.e. index A contains index B
-- and both share the same first column
-- but they are NOT identical
WITH index_cols_ord as (
SELECT attrelid, attnum, attname
FROM pg_attribute
JOIN pg_index ON indexrelid = attrelid
WHERE indkey[0] > 0
ORDER BY attrelid, attnum
),
index_col_list AS (
SELECT attrelid,
array_agg(attname) as cols
FROM index_cols_ord
GROUP BY attrelid
),
dup_natts AS (
SELECT indrelid, indexrelid
FROM pg_index as ind
WHERE EXISTS ( SELECT 1
FROM pg_index as ind2
WHERE ind.indrelid = ind2.indrelid
AND ( ind.indkey @> ind2.indkey
OR ind.indkey <@ ind2.indkey )
AND ind.indkey[0] = ind2.indkey[0]
AND ind.indkey <> ind2.indkey
AND ind.indexrelid <> ind2.indexrelid
) )
SELECT userdex.schemaname as schema_name,
userdex.relname as table_name,
userdex.indexrelname as index_name,
array_to_string(cols, ', ') as index_cols,
indexdef,
idx_scan as index_scans
FROM pg_stat_user_indexes as userdex
JOIN index_col_list ON index_col_list.attrelid = userdex.indexrelid
JOIN dup_natts ON userdex.indexrelid = dup_natts.indexrelid
JOIN pg_indexes ON userdex.schemaname = pg_indexes.schemaname
AND userdex.indexrelname = pg_indexes.indexname
ORDER BY userdex.schemaname, userdex.relname, cols, userdex.indexrelname;
The above query looks for indexes where one index contains all of the same columns as a second index, plus some more, and they both share the same first column. While a lot of these indexes might not actually be duplicates, a lot of them will be.
Obviously, you could come up with other variations on this, for example searching for all multicolumn indexes with the same columns in a different order, or indexes with the same first two columns but others different. To create your own variations, the key is to edit the filter criteria contained in this clause:
WHERE EXISTS ( SELECT 1
FROM pg_index as ind2
WHERE ind.indrelid = ind2.indrelid
AND ( ind.indkey @> ind2.indkey
OR ind.indkey <@ ind2.indkey )
AND ind.indkey[0] = ind2.indkey[0]
AND ind.indkey <> ind2.indkey
AND ind.indexrelid <> ind2.indexrelid
)
… and change it to figure out the factors which give you the most real duplicates without missing anything.