August 20, 2024
Summary: in this tutorial, you will learn the negative effects and the costs of indexes in PostgreSQL.
Table of Contents
Introduction
Indexes are generally considered to be the panacea when it comes to SQL performance tuning, and PostgreSQL supports different types of indexes catering to different use cases. The urge to create more and more indexes is found to be causing severe damage in many systems. Many times, removing indexes is what we should be doing first before considering any new indexes for the benefit of the entire system. Surprised? Understanding the consequences and overhead of indexes can help to make an informed decision and potentially save the system from many potential problems.
At a very basic level, we should remember that indexes are not free of cost. The benefits come with a cost in terms of performance and resource consumption. The following is the list of ten problems/overheads that the excessive use of indexes can cause. This article is about PostgreSQL, but most of the problems also apply to other database systems.
10 Ways Excessive Indexes Can Hurt PostgreSQL Performance
Indexes penalize the transactions
We might see an improvement in the performance of a SELECT statement after adding an index. But we should not forget that the performance gains come with a cost to the transactions on the same table. Conceptually, every DML on a table needs to update all the indexes of the table. Even though there are a lot of optimizations for reducing the write amplification, it is a considerable overhead.
For example, let’s assume that there are five indexes on a table; every INSERT into the table will result in an INSERT of the index record on those five indexes. Logically, five index pages also will be updated. So effectively, the overhead is 5x.
Memory usage
Index pages must be in memory, regardless of whether any query uses them because they need to get updated by transactions. Effectively, the memory available for pages of the table gets less. The more indexes, the more the requirement of memory for effective caching. If we don’t increase the available memory, this starts hurting the entire performance of the system.
Random writes: Updating indexes is more costly
Unlike INSERTS new records into tables, rows are less likely to be inserted into the same page. Indexes like B-Tree indexes are known to cause more random writes.
Indexes need more cache than tables
Due to random writes and reads, indexes need more pages to be in the cache. Cache requirements for indexes are generally much higher than associated tables.
WAL generation
In addition to WAL records of the table updates, there will also be WAL records for indexes. This helps in crash recovery and replication. If you are using any wait event analysis tools/scripts like pg_gather, the overhead of the WAL generation will be clearly visible. The actual impact depends on the index type.
This is a synthetic test case, but if WAL-related wait events appear as any of the top wait events, it is a matter of concern for a transaction system, and we should take every step to address it.
More and more I/O
Not just WAL records are generated; we will have more pages dirtied, as well. As the index pages get dirtied, they must be written back to files, leading to more I/O again—the “DataFileWrite” wait event, as seen in the previous screenshot.
Another side effect is indexes increase the total Active-Dataset size. By “Active dataset,” I mean the tables and indexes which are frequently queried and used. As the size of the active dataset increases, the cache becomes less and less efficient. Less-effective cache results in more datafile read, so read I/O is increased. This is in addition to the read I/O required to bring the additional index pages from storage for specific queries.
Again the pg_gather report of another system with mainly select queries shows this problem. As the Active-Dataset increases, PostgreSQL has no choice but to bring the pages from storage.
A more significant percentage of “DataFileRead” sustaining for a longer duration indicates that the Active-Dataset is much bigger, which is not cachable.
Impact on VACUUM/AUTOVACUUM
The overhead is not only for inserting or updating index pages, as discussed in the previous points. There is overhead in maintaining it since the indexes also need cleanups of old tuple references.
I have seen cases where autovacuum workers on a single table run for a very long duration because of the size of the table and, most importantly, the excessive number of indexes on the table. In fact, it is widespread that users see their autovacuum worker is “stuck” for hours without showing any progress for a longer duration. This happens because the index cleanup by the autovacuum is the opaque stage of autovacuum and is not visible through views like pg_stat_progress_vacuum other than the vacuum phase is indicated as vacuuming indexes.
Indexes can get bloated and become less efficient over time. Periodic index maintenance (REINDEX) might be needed in many systems.
Tunnel vision while tuning
Tunnel vision is the loss of the field view. The user may be concentrating on a particular SQL statement in an attempt to “tune” and decide on creating indexes. By creating an index for tuning a query, we are shifting more system resources to that query. Then it may give more performance to that particular statement by penalizing others.
But as we keep creating more and more indexes for tuning other queries, the resources will shift again towards other queries. This leads to a situation where the effort to tune every query penalizes every other query. Ultimately, everyone will be hurt, and only losers will be in this war. Someone trying to tune should consider how every part of the system can co-exist (maximizing business value) rather than absolute maximum performance for a particular query.
Greater storage requirement
Almost every day, I see cases where indexes take more storage than tables.
This may sound too silly for those with more money to spend on storage, but we should remember that this has a cascading effect. The total database size grows to a multiple of the actual data. So obviously, backups take more time, storage, and network resources, and then the same backup can put more load on the host machine. This would also increase the time to restore a backup and recover it. Bigger databases affect many things, including more time to build standby instances.
Indexes are more prone to corruption
I am not just talking about rarely occurring index-related bugs like silent index corruption of PostgreSQL 14 or index corruption due to glibc collation change, which keeps popping up now and then and affects many environments even today. As we increase the number of indexes, we increase the probability of index corruptions.
What should we do?
A set of critical questions should accompany new index considerations: Is it essential to have this index, or is it necessary to speed up the query at the cost of more index? Is there a way to rewrite the query to get a better performance? Is it ok to discard the small gains and live without an index?
Existing indexes also require a critical review over a period of time. All unused indexes (those indexes with idx_scan as zero in pg_stat_user_indexes) should be considered for dropping. Scripts like the one from pgexperts can help to do more analysis.
The upcoming PostgreSQL 16 has one more column in pg_stat_user_indexes / pg_stat_all_indexes with the name last_idx_scan, which can tell us when was the last time the index was used (timestamp). This will help us to take a well-informed look at all the indexes in the system.
Summary
The summary in simple words: Indexes are not cheap. There is a cost, and the cost can be manifold. Indexes are not always good, and sequential scans are not always bad, either. My humble advice is to avoid looking for improving individual queries as the first step because it is a slippery slope. A top-down approach to tuning the system yields better results starting from tuning the Host machine, Operating System, PostgreSQL parameter, Schema, etc. An objective “cost-benefit analysis” is important before creating an index.