PostgreSQL 教程: 分区剪枝

一月 23, 2025

摘要:在本教程中,我们将了解 PostgreSQL 中的分区剪枝。

目录

准备数据集

当您正在寻找开放的数据集时,“美国政府开放数据之家” 是一个很好的来源。它为研究、开发或出于任何原因都可以使用的数据,提供了公开的数据集。在本教程中,我们将使用交通违规数据集。它包含了大约 200 万行,是深入研究 PostgreSQL 分区的一个良好起点。

初始的导入会进入到一个仅包含文本列的表,我们稍后可以从那里加载到我们的分区。该表的结构为:

create table traffic_violations
( seqid text
, date_of_stop text
, time_of_stop text
, agency text
, subagency text
, description text
, location text
, latitude text
, longitude text
, accident text
, belts text
, personal_injury text
, property_damage text
, fatal text
, commercial_license text
, hazmat text
, commercial_vehicle text
, alcohol text
, work_zone text
, search_conducted text
, search_disposition text
, search_outcome text
, search_reason text
, search_reason_for_stop text
, search_type text
, search_arrest_reason text
, state text
, vehicletype text
, year text
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident text
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation text
);

要加载表,最简单的方法是使用 PostgreSQL 的 COPY 命令:

COPY traffic_violations FROM '/home/postgres/Traffic_Violations.csv'
  WITH ( format csv, header true, null 'null', delimiter ',');

所以我们有了大约 150 万行真实数据。由于此表仅包含文本列,因此我们将创建一个范围分区表,采用适合数据的数据类型。

范围分区

通常,范围分区用于按天、月或年对表进行分区,尽管您也可以按其他数据类型进行分区。基于时间/日期的范围分区可能是最常见的用法,因此我们将按年份对交通违规数据进行分区,然后对某个表分区按月份进行子分区。

假设针对数据集的最常见查询,会将数据限制为一年或几年。使用分区表,我们可以在 “date_of_stop” 列上进行范围分区,并在每年将大型数据集分解为较小的块。这在 PostgreSQL 中要如何做到?

目前,我们有 “date_of_stop” 列的最小值和最大值:

SELECT min(to_date(date_of_stop, 'MM/DD/YYYY')),
       max(to_date(date_of_stop, 'MM/DD/YYYY'))
  FROM traffic_violations;
    min     |    max
------------+------------
 2012-01-01 | 2025-01-16
(1 row)

如果我们想按年份进行分区,我们至少需要 14 个分区(2012 年到 2025 年),并且最好也添加一个 2026 年的分区,因此当更新的数据集带来下一年的数据时,我们可以处于安全状态。第一步是创建一个分区表:

create table traffic_violations_p
( seqid text
, date_of_stop date not null
, time_of_stop time
, agency text
, subagency text
, description text 
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean 
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, work_zone boolean
, state text
, vehicletype text 
, year smallint
, make text
, model text
, color text
, violation_type text 
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point)
partition by range (date_of_stop);

现在我们有了一个没有分区的分区表。我们需要分区,下面是向该分区表添加分区的方法:

create table traffic_violations_p_2020
partition of traffic_violations_p
for values from ('2020-01-01') to ('2021-01-01');

create table traffic_violations_p_2021
partition of traffic_violations_p
for values from ('2021-01-01') to ('2022-01-01');

create table traffic_violations_p_2022
partition of traffic_violations_p
for values from ('2022-01-01') to ('2023-01-01');

create table traffic_violations_p_2023
partition of traffic_violations_p
for values from ('2023-01-01') to ('2024-01-01');

create table traffic_violations_p_2024
partition of traffic_violations_p
for values from ('2024-01-01') to ('2025-01-01');

create table traffic_violations_p_2025
partition of traffic_violations_p
for values from ('2025-01-01') to ('2026-01-01');

现在,从 2020 年到 2025 年的每一年,我们都有了一个分区。由于我们的分区表设置现在已经完成,我们可以加载数据了:

insert into traffic_violations_p
select seqid
     , to_date(date_of_stop,'MM/DD/YYYY')
     , time_of_stop::time
     , agency
     , subagency
     , description
     , location
     , latitude::numeric
     , longitude::numeric
     , accident
     , belts::boolean
     , personal_injury::boolean
     , property_damage::boolean
     , fatal::boolean
     , commercial_license::boolean
     , hazmat::boolean
     , commercial_vehicle::boolean
     , alcohol::boolean
     , work_zone::boolean
     , state
     , vehicletype
     , case year
         when '' then null
         else year::smallint
       end
     , make
     , model
     , color
     , violation_type
     , charge
     , article
     , contributed_to_accident::boolean
     , race
     , gender
     , driver_city
     , driver_state
     , dl_state
     , arrest_type
     , geolocation::point
  from traffic_violations
  WHERE to_date(date_of_stop, 'MM/DD/YYYY') >= '2020-01-01'::date;

规划时的分区剪枝

分区剪枝是一种提升分区表访问性能的查询优化技术,可以选择性地去除分区表下面的某些分区。在我们的例子中,当知道分区没有包含我们请求的数据时,会删除这些分区。

让我们回到我们的范围分区表。我们按 “time_of_stop” 列对表进行了分区:

select * from pg_partition_tree('traffic_violations_p');
           relid           |     parentrelid      | isleaf | level
---------------------------+----------------------+--------+-------
 traffic_violations_p      |                      | f      |     0
 traffic_violations_p_2020 | traffic_violations_p | t      |     1
 traffic_violations_p_2021 | traffic_violations_p | t      |     1
 traffic_violations_p_2022 | traffic_violations_p | t      |     1
 traffic_violations_p_2023 | traffic_violations_p | t      |     1
 traffic_violations_p_2024 | traffic_violations_p | t      |     1
 traffic_violations_p_2025 | traffic_violations_p | t      |     1
(7 rows)

每个分区都包含一年的数据。如果我们请求 2023 年的数据,PostgreSQL 应该只读取该分区,而忽略其他分区。

EXPLAIN
select count(*) from traffic_violations_p
  where date_of_stop = date('02-FEB-2023');
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4201.74..4201.75 rows=1 width=8)
   ->  Seq Scan on traffic_violations_p_2023 traffic_violations_p  (cost=0.00..4200.91 rows=329 width=0)
         Filter: (date_of_stop = '2023-02-02'::date)
(3 rows)

确实,分区剪枝发生了,而且只访问了 traffic_violations_p_2023。所有其他分区都会被忽略,这当然是一种性能改进。这是简单的情况,它在规划时进行了分区修剪。因为我们在 WHERE 子句中有一个常量,所以 PostgreSQL 已经可以在规划时决定,需要读取哪些分区以及可以跳过哪些分区。

执行时的分区剪枝

请考虑下面这个示例:

select count(*) 
  from traffic_violations_p 
 where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));

在这种情况下,PostgreSQL 在规划时无法知道从 SELECT 子查询返回的实际值是什么。如果您使用的是 PostgreSQL 10,则无能为力,因为执行时的分区剪枝是在 PostgreSQL 11 才支持的。由于这里使用了 PostgreSQL 15,它应该可以工作:

EXPLAIN         
select count(*) 
  from traffic_violations_p 
 where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=19883.95..19883.96 rows=1 width=8)
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.01 rows=1 width=4)
   ->  Gather  (cost=19883.72..19883.93 rows=2 width=8)
         Workers Planned: 2
         Params Evaluated: $0
         ->  Partial Aggregate  (cost=18883.72..18883.73 rows=1 width=8)
               ->  Parallel Append  (cost=0.00..18882.54 rows=471 width=0)
                     ->  Parallel Seq Scan on traffic_violations_p_2024 traffic_violations_p_5  (cost=0.00..4270.07 rows=91 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2020 traffic_violations_p_1  (cost=0.00..3869.67 rows=82 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2023 traffic_violations_p_4  (cost=0.00..3697.55 rows=79 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2022 traffic_violations_p_3  (cost=0.00..3471.79 rows=74 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2021 traffic_violations_p_2  (cost=0.00..3416.81 rows=73 width=0)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2025 traffic_violations_p_6  (cost=0.00..154.30 rows=102 width=0)
                           Filter: (date_of_stop = $0)
(20 rows)

执行计划显示出将会扫描所有分区,这并不奇怪。当你想要看到执行时的分区剪枝时,你实际上必须执行该语句,所以要用 EXPLAIN(analyze)

EXPLAIN (analyze)
select count(*) 
  from traffic_violations_p 
 where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));
                                                                                   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=19883.95..19883.96 rows=1 width=8) (actual time=15.084..15.952 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=2.373..2.374 rows=1 loops=1)
   ->  Gather  (cost=19883.72..19883.93 rows=2 width=8) (actual time=14.928..15.945 rows=3 loops=1)
         Workers Planned: 2
         Params Evaluated: $0
         Workers Launched: 2
         ->  Partial Aggregate  (cost=18883.72..18883.73 rows=1 width=8) (actual time=7.632..7.634 rows=1 loops=3)
               ->  Parallel Append  (cost=0.00..18882.54 rows=471 width=0) (actual time=4.240..7.625 rows=77 loops=3)
                     ->  Parallel Seq Scan on traffic_violations_p_2024 traffic_violations_p_5  (cost=0.00..4270.07 rows=91 width=0) (actual time=4.227..7.604 rows=77 loops=3)
                           Filter: (date_of_stop = $0)
                           Rows Removed by Filter: 26680
                     ->  Parallel Seq Scan on traffic_violations_p_2020 traffic_violations_p_1  (cost=0.00..3869.67 rows=82 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2023 traffic_violations_p_4  (cost=0.00..3697.55 rows=79 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2022 traffic_violations_p_3  (cost=0.00..3471.79 rows=74 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2021 traffic_violations_p_2  (cost=0.00..3416.81 rows=73 width=0) (never executed)
                           Filter: (date_of_stop = $0)
                     ->  Parallel Seq Scan on traffic_violations_p_2025 traffic_violations_p_6  (cost=0.00..154.30 rows=102 width=0) (never executed)
                           Filter: (date_of_stop = $0)
 Planning Time: 0.268 ms
 Execution Time: 16.022 ms
(24 rows)

这里出现了关键字 “(never executed)”,这是在执行时进行分区剪枝。

了解更多

PostgreSQL 优化