PostgreSQL Tutorial: Controlling resource consumption using Linux cgroup2

November 1, 2024

Summary: in this tutorial, you will learn how to control resource consumption on a PostgreSQL server using Linux cgroup2.

Table of Contents

Introduction

Multi-tenancy/co-hosting is always challenging. Running multiple PG instances could help to reduce the internal contention points (scalability issues) within PostgreSQL. However, the load caused by one of the tenants could affect other tenants, which is generally referred to as the “Noisy Neighbor” effect. Luckily, Linux allows users to control the resources consumed by each program using cgroups (Control Groups). cgroup2 came as a replacement for cgroup version one, addressing almost all the limitations of the architecture of version one.

We should be able to reliably use cgroup2 if the Linux Kernel Version is 5.2.0 or later. More practically, if we are running a Linux distribution of 2022 or later, your host machine will most probably be ready for cgroup2.

An easy way to check whether the Linux is using cgroup version one or two is to check the number of mounts with cgroup.

$ grep -c cgroup /proc/mounts
1

If the count is one, then we have cgroup2. Because cgroup2 has a unified, single hierarchy, we may see multiple mounts if cgroup version 1 is in effect.

If the kernel version is new, still the cgroup1 is in effect, you may have to use the boot parameter: “systemd.unified_cgroup_hierarchy=1”. On Redhat/OEL systems, we can add this parameter by executing the following:

sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"

Basically, it adds this to the Kernel parameter as a bootloader option, like:

$ cat /etc/default/grub
...
GRUB_CMDLINE_LINUX="xxxxxx systemd.unified_cgroup_hierarchy=1"
...

This change requires a restart of the machine.

After restarting, you may verify:

$ sudo mount -l | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

Please make sure that it is mentioned as “cgroup2”.

Now we shall inspect this virtual filesystem for a better understanding.

$ ls -l /sys/fs/cgroup/
total 0
-r--r--r--.   1 root root 0 May 27 02:10 cgroup.controllers
-rw-r--r--.   1 root root 0 May 27 02:10 cgroup.max.depth
-rw-r--r--.   1 root root 0 May 27 02:10 cgroup.max.descendants
-rw-r--r--.   1 root root 0 May 27 02:10 cgroup.procs
-r--r--r--.   1 root root 0 May 27 02:10 cgroup.stat
-rw-r--r--.   1 root root 0 May 27 02:10 cgroup.subtree_control
-rw-r--r--.   1 root root 0 May 27 02:10 cgroup.threads
-rw-r--r--.   1 root root 0 May 27 02:10 cpu.pressure
-r--r--r--.   1 root root 0 May 27 02:10 cpuset.cpus.effective
-r--r--r--.   1 root root 0 May 27 02:10 cpuset.mems.effective
-r--r--r--.   1 root root 0 May 27 02:10 cpu.stat
drwxr-xr-x.   2 root root 0 May 27 02:10 init.scope
-rw-r--r--.   1 root root 0 May 27 02:10 io.pressure
-r--r--r--.   1 root root 0 May 27 02:10 io.stat
drwxr-xr-x.   2 root root 0 May 27 02:10 machine.slice
-r--r--r--.   1 root root 0 May 27 02:10 memory.numa_stat
-rw-r--r--.   1 root root 0 May 27 02:10 memory.pressure
-r--r--r--.   1 root root 0 May 27 02:10 memory.stat
-r--r--r--.   1 root root 0 May 27 02:10 misc.capacity
drwxr-xr-x. 107 root root 0 May 27 02:10 system.slice
drwxr-xr-x.   3 root root 0 May 27 02:16 user.slice

This is the root control group. All Slices come under this. We can see “system.slice” and “user.slice,” which appear as directories because they are the next levels.

We can check what are the cgroup controllers available in the machine as follows:

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

Putting cgroup2 into practice

Creating a slice

Creating a separate slice for the PostgreSQL instances is a good idea when there are multiple instances. This will allow us to control the overall consumption of resources from a higher level. Let’s assume that we want to restrict all PostgreSQL services from exceeding 25% of the machine’s CPU. The first step is to create a slice:

sudo systemctl edit --force postgres.slice

For the demonstration, I am adding the following unit configuration:

[Unit]
Description=PostgreSQL Slice
Before=slices.target
[Slice]
MemoryAccounting=true
MemoryLimit=2048M
CPUAccounting=true
CPUQuota=25%
TasksMax=4096

Save and quit the editor, and then reload.

sudo systemctl daemon-reload

Anytime we shall check the status of the slice like sudo systemctl status postgres.slice

Modifying PostgreSQL service

We can use the slice that we created in the PostgreSQL service. For which we need to edit the service unit:

$ sudo systemctl edit --full postgresql-16

Add a specification about the slice, like Slice=postgres.slice, under the [Service] section of the unit file.

...
[Service]
Type=notify

User=postgres
Group=postgres
Slice=postgres.slice
...

Save and exit the editor. This change requires a restart of the PostgreSQL service.

On restarting the PostgreSQL service, PostgreSQL will start running under the new slice.

$ systemd-cgls | grep post
├─postgres.slice
│ └─postgresql-16.service
│   ├─3760 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
│   ├─3761 postgres: logger
│   ├─3762 postgres: checkpointer
│   ├─3763 postgres: background writer
│   ├─3765 postgres: walwriter
│   ├─3766 postgres: autovacuum launcher
│   └─3767 postgres: logical replication launcher
│ 	└─3770 grep --color=auto post

The same will be visible in service status.

$ sudo systemctl status postgresql-16
  postgresql-16.service - PostgreSQL 16 database server
   Loaded: loaded (/etc/systend/systen/postgresql-16.service; enabled; vendor preset: disabled)
   Active: since Mon 2024-05-27 12:54:26 EDT; 7s ago
     Docs: https://www.postaresql.org/docs/16/static/
  Process: 5957 ExecStartPre=/usr/pgsql-16/bin/postgresql-16-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
 Main PID: 5962 (postgres)
    Tasks: 7 (limit: 29176)
   Memory: 18.1M
   CGroup: /postgres.slice/postgresql-16.service
           ├─5962 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
           ├─5963 postgres: logger
           ├─5964 postgres: checkpointer
           ├─5965 postgres: background writer
           ├─5967 postgres: walwriter
           ├─5968 postgres: autovacuum launcher
           └─5969 postgres: logical replication launcher

May 27 12:54:25 localhost.localdomain systemd[1]: postgresql-16.service: Succeeded.
May 27 12:54:25 localhost.localdomain systend[1]: Stopped PostgreSQL 16 database server.
May 27 12:54:25 localhost.localdomain systend[1]: Starting PostgreSQL 16 database server...
May 27 12:54:26 localhost.localdomain postgres[5962]: 2024-05-27 12:54:26.153 EDT [5962] LOG:  redirecting log output to logging collector process
May 27 12:54:26 localhost.localdomain postgres[5962]: 2024-05-27 12:54:26.153 EDT [5962] HINT:  Future log output will appear in directory "log"
May 27 12:54:26 localhost.localdomain systend[1]: Started PostgreSQL 16 database server.

Verification

I tried to create a heavy load on the system by running benchmark suit having many sessions in parallel on a single CPU machine. Irrespective of whatever I tried, the Linux was restricting the PostgreSQL from exceeding the limits specified by the slice.

If we add up all CPU utilization of all the processes of PostgreSQL, we shall see 2.34+27+1.7 = 24.9! (To make the counting easier, I used a single CPU core machine.)

The same workload without any cgroup restrictions can bring the server to 100% utilization (0% idle).

*cgroup slice restrictions have reduced the throughput, which is expected.

We can have multiple services in a slice, which will be the next level in the hierarchy. systemd-cgtop can show us the slice-wise and individual service-wise utilization.

Super cool, isn’t it? The quick demo concludes here.

Service level control

cgroup2 is very versatile, and many more options exist. For example, you may not want to create separate slices for PostgreSQL services, as demonstrated, especially when there is only one PostgreSQL instance on the host machine. By default, PostgreSQL and all services will be part of “system.slice”. The easy method in this case will be to specify the cgroup restrictions at the service level rather than at the slice level.

For example:

sudo systemctl edit --full postgresql-16

And specify the resource control configuration directly in the service unit under [Service] section.

...
[Service]
User=postgres
Group=postgres

CPUAccounting=true
CPUQuota=25%
...

* Changes will be in effect on the next restart.

Summary

Control groups are widely and silently used these days by other programs like Docker and Kubernetes. They are one of the well-proven methods of restricting resource consumption on a machine. The new cgroup2 makes them much simpler to use.

A clear control of the resource usage on a host machine opens up many possibilities. Some of them which come to my mind are:

Better multi-tenent environments. We can prevent the “Noisy Neighbor” effect in a multi-tenant environment by preventing tenants from competing for the same set of resources.
Co-hosted application server + database server on the same machine. The vast majority of applications are CPU-intensive, while DB servers remain Memory—and I/O-intensive. So, there are cases where putting them together on the same machine makes sense, especially for small and simple applications. A big advantage of co-hosted applications and databases is that they can communicate over local sockets rather than TCP/IP. Practically, we see many cases in which the Network is the silent performance destroyer. How To Measure the Network Impact on PostgreSQL Performance. Yet another advantage is that we don’t have to expose the database service (port) to the network.
Protect the system from abuses, service denial attacks, especially unwanted fail-overs. When the system becomes overloaded, it may become unresponsive for all the programs running on the machine, not just the database. Such situations often result in unwanted failovers by HA Frameworks. A good control of resource usage can prevent this from happening.

See more

PostgreSQL Administration