January 7, 2025
Summary: in this tutorial, you will learn how to optimize SELECT SQL queries by improving filtering.
Table of Contents
Sometimes, we need to evaluate and improve the WHERE
part of the query, subtle changes to the filtering statement can have a massive impact on query performance.
Avoiding functions in filtering
How
Applying a function to a column in the filtering phase slows down the performance. The database needs to apply the function to the dataset before filtering. Let’s take a simple example of filtering on a timestamp field:
SELECT count(*)
FROM orders
WHERE CAST(order_timestamp AS DATE) >= '2024-02-01';
The above query on a 100,000,000 row dataset runs in 01 min 53 sec
because it needs to change the data type of the order_timestamp
column from timestamp to date before applying the filter. But, that’s not necessary! Note that, if you give it the above query and table metadata, it can be rewritten to:
SELECT count(*)
FROM orders
WHERE order_timestamp >= '2024-02-01 00:00:00';
The rewritten query uses the native timestamp field without casting. The result of such small change is that the query now runs in 20 sec
, nearly 6 times faster than the original.
Warning
Not all functions can be avoided, since some might be needed to retrieve parts of the column value (think about substring
examples) or to reshape it. Nevertheless, every time you are about to add a function in a filter, think about alternative ways to use the native data type operators.
Golden rule
When applying a filter to a column, try to reshape the filter format rather than the column format.
The above is a perfect example: moving the filter format from the date 2024-02-01
to the timestamp 2024-02-01 00:00:00
allowed us to use the native timestamp data format and operators.
Pro Tip
If applying the function is a must, you can try the following two options:
- Create an index on the expression, available in PostgreSQL and MySQL
- Use database triggers to populate an additional column with the transformation already in place
Improving subqueries
How
Subqueries are commonly used in filters to retrieve the set of values to be applied as filters. A common example is when needing to retrieve the list of users having recent activity.
SELECT *
FROM users
WHERE id IN (
SELECT DISTINCT user_id
FROM sessions
WHERE session_date = '2024-02-01');
The above query retrieves the distinct list of users from the SESSIONS
table and then applies the filter on the USERS
table. However, there are several, more performant, ways of achieving the same results. An example is using EXISTS
:
SELECT *
FROM users
WHERE EXISTS (
SELECT user_id
FROM sessions
WHERE user_id = id and session_date = '2024-02-01'
);
EXISTS
is faster since it doesn’t need to retrieve the list of distinct users from the SESSION
table, but it just verifies the presence of at least one row in the table for a specific user. The use case above went from a performance of 02 min 08 sec
to 18 sec
by just changing the subquery section.
Warning
Subtle changes in the subquery could provide different results in corner cases.
Golden rule
When needing to use subqueries, take time to learn and understand what are the options and what they allow you to achieve. Several times there’s more than one option and some functionalities will provide better response times.
Paginating results
How
When needing to display a long list of rows, it’s useful to paginate the results retrieved from the database. Both PostgreSQL and MySQL offer the functionality to LIMIT
the output to retrieve only a certain amount of rows and to OFFSET
the result set by retrieving only rows in a specific range based on ordering. Using LIMIT
and OFFSET
is a good way to minimize the data sent to clients to only the one needed to be displayed.
Warning
The drawback of using LIMIT
and OFFSET
is that you’ll need a query for each “page” to be sent and executed to the database. This could be inconvenient if the overall number of rows is not far from the page size. As example, if you’re showing results in pages of 10
rows, but there are on average 15
rows to display, it might be better to retrieve the entire dataset at once.
Golden rule
If the size of the resultset is an order of magnitude larger than the page size, using paging can be an effective way to ensure better performance, since only the visible dataset will be retrieved from the database.
Pro tip
The LIMIT
and OFFSET
clauses are the default pagination method in most databases. However, more efficient paging implementations can be achieved by storing the starting and ending offsets of the current page on the client side and pushing the filtering for the following page in the WHERE
clause of the SQL statement. A couple of examples of this implementation are available in the Paginated Select presentation.
Moving filters from HAVING to WHERE clause
How
When running a query the filters defined in the WHERE
and HAVING
clauses are applied at different times:
- The
WHERE
filters are applied on the raw dataset, before any data transformation defined in theSELECT
statement is applied - The
HAVING
filters are applied post aggregation, therefore after all the rows have been retrieved, transformed and rolled up.
Therefore moving filters in the WHERE
section should be a priority since it allows you to work on a smaller dataset. An example is if we try to get the list of dates for which we have a session with a tracked user_id
(the user_id
field is not null)
SELECT session_date, count(user_id) nr_sessions
FROM sessions
GROUP BY session_date
HAVING count(user_id) > 0;
We can rewrite the above query by pushing the filter into the WHERE
clause with:
SELECT session_date, count(user_id) nr_sessions
FROM sessions
WHERE user_id is not null
GROUP BY session_date
The query performance, on a 100.000.000
rows dataset went from 21 sec
to 18 sec
with just a change in the filters statement.
Warning
Moving filters from the HAVING
to the WHERE
clause is possible only if we can determine a similar condition that applies at row level instead of the one applying at aggregate level.
Golden rule
Always try to filter in the WHERE
clause since the filtering applies before the data is transformed/aggregated by the SELECT
statement.