PostgreSQL Tutorial: Partition Pruning

January 23, 2025

Summary: In this tutorial, we will have a look at partition pruning in PostgreSQL.

Table of Contents

Preparing the data set

When you are looking for free data sets “The home of the U.S. Government’s open data” is a great source. It provides free data sets for research, development or just data you can play with for whatever reason. In this tutorial, we will use the Traffic violations data set. It contains about 2 million of rows and is a good starting point for diving into PostgreSQL partitioning.

The initial import goes into a table which only contains text columns and from there we can load our partitions later on. The structure of the table is:

create table traffic_violations
( seqid text
, date_of_stop text
, time_of_stop text
, agency text
, subagency text
, description text
, location text
, latitude text
, longitude text
, accident text
, belts text
, personal_injury text
, property_damage text
, fatal text
, commercial_license text
, hazmat text
, commercial_vehicle text
, alcohol text
, work_zone text
, search_conducted text
, search_disposition text
, search_outcome text
, search_reason text
, search_reason_for_stop text
, search_type text
, search_arrest_reason text
, state text
, vehicletype text
, year text
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident text
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation text
);

For loading the table the easiest way is to use PostgreSQL’s COPY command:

COPY traffic_violations FROM '/home/postgres/Traffic_Violations.csv'
  WITH ( format csv, header true, null 'null', delimiter ',');

So we have around 1.5 millions of rows with real data. As this table contains only text columns we will create a range partitioned table with data types that fit the data.

Range partitioning

Usually range partitioning is used to partition a table by days, months or years although you can partition by other data types as well. Time/date based range partitioning is probably the most common use case, so we will partition our traffic violation data by year, and then sub-partition a table partition by month.

Lets assume our most common queries against the data set restrict the data for one or more years. Using partitioned tables we can range partition on the “date_of_stop” column and break down the large data set into smaller chunks per year. How would that work in PostgreSQL?

Currently we have this min and max values for the “date_of_stop” column:

SELECT min(to_date(date_of_stop, 'MM/DD/YYYY')),
       max(to_date(date_of_stop, 'MM/DD/YYYY'))
  FROM traffic_violations;
    min     |    max
------------+------------
 2012-01-01 | 2025-01-16
(1 row)

If we want to partition that by year we need at least 14 partitions (2012 to 2025) and maybe already add a partition for 2026 so we are on the safe side when a more recent data set brings data for next year. The first step is to create a partitioned table:

create table traffic_violations_p
( seqid text
, date_of_stop date not null
, time_of_stop time
, agency text
, subagency text
, description text 
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean 
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, work_zone boolean
, state text
, vehicletype text 
, year smallint
, make text
, model text
, color text
, violation_type text 
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point)
partition by range (date_of_stop);

Now we have a partitioned table with no partitions. We need partitions and this is how you can add partitions to that partitioned table:

create table traffic_violations_p_2020
partition of traffic_violations_p
for values from ('2020-01-01') to ('2021-01-01');

create table traffic_violations_p_2021
partition of traffic_violations_p
for values from ('2021-01-01') to ('2022-01-01');

create table traffic_violations_p_2022
partition of traffic_violations_p
for values from ('2022-01-01') to ('2023-01-01');

create table traffic_violations_p_2023
partition of traffic_violations_p
for values from ('2023-01-01') to ('2024-01-01');

create table traffic_violations_p_2024
partition of traffic_violations_p
for values from ('2024-01-01') to ('2025-01-01');

create table traffic_violations_p_2025
partition of traffic_violations_p
for values from ('2025-01-01') to ('2026-01-01');

We now have a partition for each year from 2020 to 2025. As our partitioned table setup is now complete we can load the data:

insert into traffic_violations_p
select seqid
     , to_date(date_of_stop,'MM/DD/YYYY')
     , time_of_stop::time
     , agency
     , subagency
     , description
     , location
     , latitude::numeric
     , longitude::numeric
     , accident
     , belts::boolean
     , personal_injury::boolean
     , property_damage::boolean
     , fatal::boolean
     , commercial_license::boolean
     , hazmat::boolean
     , commercial_vehicle::boolean
     , alcohol::boolean
     , work_zone::boolean
     , state
     , vehicletype
     , case year
         when '' then null
         else year::smallint
       end
     , make
     , model
     , color
     , violation_type
     , charge
     , article
     , contributed_to_accident::boolean
     , race
     , gender
     , driver_city
     , driver_state
     , dl_state
     , arrest_type
     , geolocation::point
  from traffic_violations
  WHERE to_date(date_of_stop, 'MM/DD/YYYY') >= '2020-01-01'::date;

Partition pruning at planning time

Partition pruning is a query optimization technique that improves performance for accessing partitioned tables, which can selectively remove certain partitions under the partitioned table. In our case it removes partitions when it is known that the partition(s) can not contain data we are asking for.

Lets come back to our range partitioned table. We partitioned the table by the “time_of_stop” column:

select * from pg_partition_tree('traffic_violations_p');
           relid           |     parentrelid      | isleaf | level
---------------------------+----------------------+--------+-------
 traffic_violations_p      |                      | f      |     0
 traffic_violations_p_2020 | traffic_violations_p | t      |     1
 traffic_violations_p_2021 | traffic_violations_p | t      |     1
 traffic_violations_p_2022 | traffic_violations_p | t      |     1
 traffic_violations_p_2023 | traffic_violations_p | t      |     1
 traffic_violations_p_2024 | traffic_violations_p | t      |     1
 traffic_violations_p_2025 | traffic_violations_p | t      |     1
(7 rows)

Each partition contains data from one year. If we ask for data from 2023 PostgreSQL should only read that partition and just ignore the others.

EXPLAIN
select count(*) from traffic_violations_p
  where date_of_stop = date('02-FEB-2023');
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4201.74..4201.75 rows=1 width=8)
   ->  Seq Scan on traffic_violations_p_2023 traffic_violations_p  (cost=0.00..4200.91 rows=329 width=0)
         Filter: (date_of_stop = '2023-02-02'::date)
(3 rows)

Indeed that is happening and only the traffic_violations_p_2023 is considered. All other partitions will just be ignored and that of course is a performance improvement. This is the simple case and it is partition pruning at planning time. Because we have a literal in the WHERE clause PostgreSQL can already decide at planning time which partitions it needs to read and which can be skipped.

Partition pruning at execution time

Consider this example:

select count(*) 
  from traffic_violations_p 
 where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));

In this case PostgreSQL can not know what will be the actual values coming back from the sub-select at planning time. If you are on PostgreSQL 10 there is not much you can do as partition pruning at execution time made it into PostgreSQL 11. Since PostgreSQL 15 is used here, it should work:

EXPLAIN         
select count(*) 
  from traffic_violations_p 
 where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=19883.95..19883.96 rows=1 width=8)
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.01 rows=1 width=4)
   ->  Gather  (cost=19883.72..19883.93 rows=2 width=8)
         Workers Planned: 2
         Params Evaluated: $0
         ->  Partial Aggregate  (cost=18883.72..18883.73 rows=1 width=8)
               ->  Parallel Append  (cost=0.00..18882.54 rows=471 width=0)
                     ->  Parallel Seq Scan on traffic_violations_p_2024 traffic_violations_p_5  (cost=0.00..4270.07 rows=91 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2020 traffic_violations_p_1  (cost=0.00..3869.67 rows=82 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2023 traffic_violations_p_4  (cost=0.00..3697.55 rows=79 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2022 traffic_violations_p_3  (cost=0.00..3471.79 rows=74 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2021 traffic_violations_p_2  (cost=0.00..3416.81 rows=73 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2025 traffic_violations_p_6  (cost=0.00..154.30 rows=102 width=0)
                           Filter: (date_of_stop = $0)
(20 rows)

The execution plan shows that all partitions will be scanned and that is no surprise. When you want to see partition pruning at execution time you actually have to execute the statement, so EXPLAIN(analyze):

EXPLAIN (analyze)
select count(*) 
  from traffic_violations_p 
 where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));
                                                                                   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=19883.95..19883.96 rows=1 width=8) (actual time=15.084..15.952 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=2.373..2.374 rows=1 loops=1)
   ->  Gather  (cost=19883.72..19883.93 rows=2 width=8) (actual time=14.928..15.945 rows=3 loops=1)
         Workers Planned: 2
         Params Evaluated: $0
         Workers Launched: 2
         ->  Partial Aggregate  (cost=18883.72..18883.73 rows=1 width=8) (actual time=7.632..7.634 rows=1 loops=3)
               ->  Parallel Append  (cost=0.00..18882.54 rows=471 width=0) (actual time=4.240..7.625 rows=77 loops=3)
                     ->  Parallel Seq Scan on traffic_violations_p_2024 traffic_violations_p_5  (cost=0.00..4270.07 rows=91 width=0) (actual time=4.227..7.604 rows=77 loops=3)
                           Filter: (date_of_stop = $0)
                           Rows Removed by Filter: 26680
                     ->  Parallel Seq Scan on traffic_violations_p_2020 traffic_violations_p_1  (cost=0.00..3869.67 rows=82 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2023 traffic_violations_p_4  (cost=0.00..3697.55 rows=79 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2022 traffic_violations_p_3  (cost=0.00..3471.79 rows=74 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2021 traffic_violations_p_2  (cost=0.00..3416.81 rows=73 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2025 traffic_violations_p_6  (cost=0.00..154.30 rows=102 width=0) (never executed)
                           Filter: (date_of_stop = $0)
 Planning Time: 0.268 ms
 Execution Time: 16.022 ms
(24 rows)