April 1, 2024
Summary: Caching is used to speed up applications, particularly for database reads. In this tutorial, you will learn how to build caching with Redis into a simple PostgreSQL application.
Table of Contents
Overview
With any sufficiently complex application, performance becomes a primary concern for optimization. One of the key metrics for measuring performance of any software is the speed of reading and writing from a database.
Most applications repeatedly store (write) and retrieve (read) some kind of data. However in most cases, the number of reads far exceeds the number of writes. To make this more efficient, we used a caching layer. Caching is the act of writing to a location (for instance, a block of memory) specifically designed for quick retrieval of common requests.
For example, imagine a customer creating a profile on an online store. The customer fills out their name, phone number, and address, which we store to our database. Because these pieces of information are required at multiple points in the checkout process, it’s worth storing them in a cache the first time we retrieve them from the database. This speeds up application processing time for all following retrievals. The difference in retrieval speed for any single request might be mere milliseconds, but when we develop applications for the cloud, for millions of users simultaneously, those milliseconds add up.
Caching cloud applications introduces some complexities but also offers opportunity for optimization. Instead of using a block of memory (and the unbounded, chaotic nature that entails), we can use two databases! We can use one database as a data store, and one as a cache. Further, we can choose a database for our data store that is optimized for things like concurrency control and one for our cache optiized for speedy reads and writes, while still taking advantage of everything that PostgreSQL offers us.
Getting ready
Download the dellstore2-normal-1.0.tar.gz
file from the PostgreSQL dellstore2 sample database.
Unpack it - for instance at the terminal you can use:
tar -xf dellstore2-normal-1.0.tar.gz
Navigate to the dellstore2-normal-1.0 folder on your terminal:
cd dellstore2-normal-1.0
Next, create a new database, dellstore
:
CREATE DATABASE dellstore;
Then change to the correct database:
\c dellstore
When you change to the dellstore
database, the prompt should change to dellstore=>
.
Import the data by typing:
\i dellstore2-normal-1.0.sql
Then check what objects have been created in the database with:
\d
The output should look like this:
List of relations
Schema | Name | Type | Owner
--------+--------------------------+----------+----------
public | categories | table | postgres
public | categories_category_seq | sequence | postgres
public | cust_hist | table | postgres
public | customers | table | postgres
public | customers_customerid_seq | sequence | postgres
public | inventory | table | postgres
public | orderlines | table | postgres
public | orders | table | postgres
public | orders_orderid_seq | sequence | postgres
public | products | table | postgres
public | products_prod_id_seq | sequence | postgres
public | reorder | table | postgres
(12 rows)
Exit psql
by typing \q
, and leave the dellstore2-normal-1.0
directory.
Create a simple application
We’ll create a read_count
function that it actually performs a COUNT
of records in the database, given a customer id.
This is intentionally a slow operation, as COUNT
enumerates through the entire database, record by record. In most production use cases, you wouldn’t do this too often.
In real life, a better example of a slow query would be calculating the checkout price of a basket on an ecommerce site. This needs a lot of database access, which takes time, and we don’t want to recalculate everything just because the customer hit “refresh”.
We can add the read_count
method to ask for how many orders there are in our database (you can see a brief introduction to how to use psycopg3
to make SQL queries in its documentation, at Basic module usage):
@app.get("/count")
def read_count(customerid):
conn = connect_to_postgres()
try:
cursor = conn.cursor()
cursor.execute('SELECT COUNT(*) FROM orders WHERE customerid = %s;', (customerid,))
count = cursor.fetchone()[0]
finally:
conn.close()
return {'count': count}
Let’s test this first working version of the application.
Why do we want caching?
When you run the application as shown above, the code executes the COUNT
every time the read_count()
method is called – every time a GET query is made. We don’t store the result anywhere, and thus we need to perform this expensive, slow operation every time.
Using Redis as a cache solves this: we run one Redis instance that all our backends can access. This lets us store the results of read_count()
outside our code and our PostgreSQL database. If we’re running a modern cloud-based application, other copies of our backend can access the results of read_count()
, and we don’t need to run expensive functions as often. Because the data Redis stores is not being read or written to disk, we can achieve very low latencies.
Connect to Redis and cache the GET method
Now let’s add caching for the read_count
function.
Finally, modify the read_count()
function to add values to Redis when called. We do this by:
-
Connecting to Redis using our
connect_to_redis()
function. -
Creating a variable to store our cache key in,
orders:count
. It’s a good idea to name your cache keys something meaningful. -
Looking in Redis for any values stored under the
orders:count
key.- If a value is found in Redis, we return the value and exit the function.
- If not, we connect to PostgresSQL and run the
SELECT COUNT (*) FROM orders
statement, add that value to the Redis cache, and return the value before exiting.
The read_count()
function should now look like:
@app.get("/count")
def read_count(customerid):
cache_key = 'orders:count:{0}'.format(customerid)
redis_conn = connect_to_redis()
# Is it already in the cache? If so, just use it
count = redis_conn.get(cache_key)
if count:
logging.debug(f'Found {cache_key}, value is {count}')
return {'count': count}
pg_conn = connect_to_postgres()
try:
cursor = pg_conn.cursor()
cursor.execute('SELECT COUNT(*) FROM orders WHERE customerid = %s;', (customerid,))
count = cursor.fetchone()[0]
finally:
pg_conn.close()
# Remember to add it to the cache
logging.debug(f'Caching {cache_key}, value is {count}')
redis_conn.set(cache_key, count)
return {'count': count}
Let’s test the changes to our application.
Add a POST method to the application
Next, let’s add a POST function to the application, so we can add new orders to our PostgreSQL database.
@app.post("/add/")
def post_add(item: Item):
"""Add a new order."""
order = item.dict()
logging.debug(f'Adding order {order}')
pg_conn = connect_to_postgres()
try:
cursor = pg_conn.cursor()
cursor.execute(
'INSERT INTO orders'
' (orderdate, customerid, netamount, tax, totalamount)'
' VALUES (%s, %s, %s, %s, %s);',
(item.orderdate, item.customerid, item.netamount, item.tax, item.netamount + item.tax),
)
pg_conn.commit()
logging.debug('Added new order')
finally:
pg_conn.close()
return {'updated': {'customer': item.customerid}}
We can test it. Then connect to the database, and make a query:
select * from orders where customerid = 9999;
which should respond as follows:
orderid | orderdate | customerid | netamount | tax | totalamount
---------+------------+------------+-----------+------+-------------
7918 | 2004-08-28 | 9999 | 51.19 | 4.22 | 55.41
12006 | 2023-03-10 | 9999 | 23.00 | 2.00 | 25.00
(2 rows)
Here we have a problem: our cache is out of date. We only update the cache when we call read_count()
for the first time. When we update the database from another method, we aren’t updating the cache to reflect the change.
Dealing with out of date caches
Invaldiating caches is a common issue in software development. Our cache is only as useful as it is accurate. So what do we need to do to ensure our cache is accurate?
Well, for starters, we need to ensure that our count
cache is invalidated whenever we call our POST method.
Invalidating the cache
We can solve the problem of our cache getting outdated by “throwing away” the cached value when we add a new order. To do that we need to add the following lines to the start of the post_add
function, just before the call to connect_to_postgres()
(the comments aren’t strictly necessary, but they make it clearer what we’re doing):
# Invalidate the cache entry whether we succeed in doing the update or not
# - we don't expect the update to fail, and it shouldn't hurt to clear the
# cache a bit too often
cache_key = 'orders:count:{0}'.format(customerid)
redis_conn = connect_to_redis()
# The Redis command is `del`, but that's special in Python
redis_conn.delete(cache_key)
The actual Redis command is called del
, but that’s a Python reserved word, so can’t be used as a method name. The Redis library chooses delete
as a good alternative.
The whole function now looks like:
@app.post("/add/")
def post_add(item: Item):
"""Add a new order."""
order = item.dict()
logging.debug(f'Adding order {order}')
# Invalidate the cache entry whether we succeed in doing the update or not
# - we don't expect the update to fail, and it shouldn't hurt to clear the
# cache a bit too often
cache_key = 'orders:count:{0}'.format(item.customerid)
redis_conn = connect_to_redis()
# The Redis command is `del`, but that's special in Python
redis_conn.delete(cache_key)
pg_conn = connect_to_postgres()
try:
cursor = pg_conn.cursor()
cursor.execute(
'INSERT INTO orders'
' (orderdate, customerid, netamount, tax, totalamount)'
' VALUES (%s, %s, %s, %s, %s);',
(item.orderdate, item.customerid, item.netamount, item.tax, item.netamount + item.tax),
)
pg_conn.commit()
logging.debug('Added new order')
finally:
pg_conn.close()
return {'updated': {'customer': item.customerid}}
Specifying a TTL (“time to live”)
Next, let’s set a Time to Live, or TTL.
This is a common practice when working with caches: it specifies how long we should let a cache stay valid for. As we add more functions and more microservices to our application, other applications might modify a database entry in the background. A TTL ensures that we invalidate the cache on a regular basis, reducing the chances that our cache is inaccurate to the underlying data.
Let’s declare a variable for the cache timeout in seconds, and use the cache timeout by altering the call of the Redis set
method at the end of our read_count
function. We just need to add the ex
parameter to specify the timeout:
CACHE_TIMEOUT_SECONDS = 5
redis_conn.set(cache_key, count, ex=CACHE_TIMEOUT_SECONDS)
Now, if we POST a new order, the next GET for a count is cached, and we can see that subsequent GET requests only find a cached value if they are within the CACHE_TIMEOUT_SECONDS
time.