一月 23, 2025
摘要:在本教程中,我们将了解 PostgreSQL 中的分区剪枝。
目录
准备数据集
当您正在寻找开放的数据集时,“美国政府开放数据之家” 是一个很好的来源。它为研究、开发或出于任何原因都可以使用的数据,提供了公开的数据集。在本教程中,我们将使用交通违规数据集。它包含了大约 200 万行,是深入研究 PostgreSQL 分区的一个良好起点。
初始的导入会进入到一个仅包含文本列的表,我们稍后可以从那里加载到我们的分区。该表的结构为:
create table traffic_violations
( seqid text
, date_of_stop text
, time_of_stop text
, agency text
, subagency text
, description text
, location text
, latitude text
, longitude text
, accident text
, belts text
, personal_injury text
, property_damage text
, fatal text
, commercial_license text
, hazmat text
, commercial_vehicle text
, alcohol text
, work_zone text
, search_conducted text
, search_disposition text
, search_outcome text
, search_reason text
, search_reason_for_stop text
, search_type text
, search_arrest_reason text
, state text
, vehicletype text
, year text
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident text
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation text
);
要加载表,最简单的方法是使用 PostgreSQL 的 COPY 命令:
COPY traffic_violations FROM '/home/postgres/Traffic_Violations.csv'
WITH ( format csv, header true, null 'null', delimiter ',');
所以我们有了大约 150 万行真实数据。由于此表仅包含文本列,因此我们将创建一个范围分区表,采用适合数据的数据类型。
范围分区
通常,范围分区用于按天、月或年对表进行分区,尽管您也可以按其他数据类型进行分区。基于时间/日期的范围分区可能是最常见的用法,因此我们将按年份对交通违规数据进行分区,然后对某个表分区按月份进行子分区。
假设针对数据集的最常见查询,会将数据限制为一年或几年。使用分区表,我们可以在 “date_of_stop” 列上进行范围分区,并在每年将大型数据集分解为较小的块。这在 PostgreSQL 中要如何做到?
目前,我们有 “date_of_stop” 列的最小值和最大值:
SELECT min(to_date(date_of_stop, 'MM/DD/YYYY')),
max(to_date(date_of_stop, 'MM/DD/YYYY'))
FROM traffic_violations;
min | max
------------+------------
2012-01-01 | 2025-01-16
(1 row)
如果我们想按年份进行分区,我们至少需要 14 个分区(2012 年到 2025 年),并且最好也添加一个 2026 年的分区,因此当更新的数据集带来下一年的数据时,我们可以处于安全状态。第一步是创建一个分区表:
create table traffic_violations_p
( seqid text
, date_of_stop date not null
, time_of_stop time
, agency text
, subagency text
, description text
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, work_zone boolean
, state text
, vehicletype text
, year smallint
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point)
partition by range (date_of_stop);
现在我们有了一个没有分区的分区表。我们需要分区,下面是向该分区表添加分区的方法:
create table traffic_violations_p_2020
partition of traffic_violations_p
for values from ('2020-01-01') to ('2021-01-01');
create table traffic_violations_p_2021
partition of traffic_violations_p
for values from ('2021-01-01') to ('2022-01-01');
create table traffic_violations_p_2022
partition of traffic_violations_p
for values from ('2022-01-01') to ('2023-01-01');
create table traffic_violations_p_2023
partition of traffic_violations_p
for values from ('2023-01-01') to ('2024-01-01');
create table traffic_violations_p_2024
partition of traffic_violations_p
for values from ('2024-01-01') to ('2025-01-01');
create table traffic_violations_p_2025
partition of traffic_violations_p
for values from ('2025-01-01') to ('2026-01-01');
现在,从 2020 年到 2025 年的每一年,我们都有了一个分区。由于我们的分区表设置现在已经完成,我们可以加载数据了:
insert into traffic_violations_p
select seqid
, to_date(date_of_stop,'MM/DD/YYYY')
, time_of_stop::time
, agency
, subagency
, description
, location
, latitude::numeric
, longitude::numeric
, accident
, belts::boolean
, personal_injury::boolean
, property_damage::boolean
, fatal::boolean
, commercial_license::boolean
, hazmat::boolean
, commercial_vehicle::boolean
, alcohol::boolean
, work_zone::boolean
, state
, vehicletype
, case year
when '' then null
else year::smallint
end
, make
, model
, color
, violation_type
, charge
, article
, contributed_to_accident::boolean
, race
, gender
, driver_city
, driver_state
, dl_state
, arrest_type
, geolocation::point
from traffic_violations
WHERE to_date(date_of_stop, 'MM/DD/YYYY') >= '2020-01-01'::date;
规划时的分区剪枝
分区剪枝是一种提升分区表访问性能的查询优化技术,可以选择性地去除分区表下面的某些分区。在我们的例子中,当知道分区没有包含我们请求的数据时,会删除这些分区。
让我们回到我们的范围分区表。我们按 “time_of_stop” 列对表进行了分区:
select * from pg_partition_tree('traffic_violations_p');
relid | parentrelid | isleaf | level
---------------------------+----------------------+--------+-------
traffic_violations_p | | f | 0
traffic_violations_p_2020 | traffic_violations_p | t | 1
traffic_violations_p_2021 | traffic_violations_p | t | 1
traffic_violations_p_2022 | traffic_violations_p | t | 1
traffic_violations_p_2023 | traffic_violations_p | t | 1
traffic_violations_p_2024 | traffic_violations_p | t | 1
traffic_violations_p_2025 | traffic_violations_p | t | 1
(7 rows)
每个分区都包含一年的数据。如果我们请求 2023 年的数据,PostgreSQL 应该只读取该分区,而忽略其他分区。
EXPLAIN
select count(*) from traffic_violations_p
where date_of_stop = date('02-FEB-2023');
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Aggregate (cost=4201.74..4201.75 rows=1 width=8)
-> Seq Scan on traffic_violations_p_2023 traffic_violations_p (cost=0.00..4200.91 rows=329 width=0)
Filter: (date_of_stop = '2023-02-02'::date)
(3 rows)
确实,分区剪枝发生了,而且只访问了 traffic_violations_p_2023。所有其他分区都会被忽略,这当然是一种性能改进。这是简单的情况,它在规划时进行了分区修剪。因为我们在 WHERE 子句中有一个常量,所以 PostgreSQL 已经可以在规划时决定,需要读取哪些分区以及可以跳过哪些分区。
执行时的分区剪枝
请考虑下面这个示例:
select count(*)
from traffic_violations_p
where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));
在这种情况下,PostgreSQL 在规划时无法知道从 SELECT 子查询返回的实际值是什么。如果您使用的是 PostgreSQL 10,则无能为力,因为执行时的分区剪枝是在 PostgreSQL 11 才支持的。由于这里使用了 PostgreSQL 15,它应该可以工作:
EXPLAIN
select count(*)
from traffic_violations_p
where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=19883.95..19883.96 rows=1 width=8)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=4)
-> Gather (cost=19883.72..19883.93 rows=2 width=8)
Workers Planned: 2
Params Evaluated: $0
-> Partial Aggregate (cost=18883.72..18883.73 rows=1 width=8)
-> Parallel Append (cost=0.00..18882.54 rows=471 width=0)
-> Parallel Seq Scan on traffic_violations_p_2024 traffic_violations_p_5 (cost=0.00..4270.07 rows=91 width=0)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2020 traffic_violations_p_1 (cost=0.00..3869.67 rows=82 width=0)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2023 traffic_violations_p_4 (cost=0.00..3697.55 rows=79 width=0)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2022 traffic_violations_p_3 (cost=0.00..3471.79 rows=74 width=0)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2021 traffic_violations_p_2 (cost=0.00..3416.81 rows=73 width=0)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2025 traffic_violations_p_6 (cost=0.00..154.30 rows=102 width=0)
Filter: (date_of_stop = $0)
(20 rows)
执行计划显示出将会扫描所有分区,这并不奇怪。当你想要看到执行时的分区剪枝时,你实际上必须执行该语句,所以要用 EXPLAIN(analyze)
:
EXPLAIN (analyze)
select count(*)
from traffic_violations_p
where date_of_stop = (select to_date('01.01.2024','DD.MM.YYYY'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=19883.95..19883.96 rows=1 width=8) (actual time=15.084..15.952 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=2.373..2.374 rows=1 loops=1)
-> Gather (cost=19883.72..19883.93 rows=2 width=8) (actual time=14.928..15.945 rows=3 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
-> Partial Aggregate (cost=18883.72..18883.73 rows=1 width=8) (actual time=7.632..7.634 rows=1 loops=3)
-> Parallel Append (cost=0.00..18882.54 rows=471 width=0) (actual time=4.240..7.625 rows=77 loops=3)
-> Parallel Seq Scan on traffic_violations_p_2024 traffic_violations_p_5 (cost=0.00..4270.07 rows=91 width=0) (actual time=4.227..7.604 rows=77 loops=3)
Filter: (date_of_stop = $0)
Rows Removed by Filter: 26680
-> Parallel Seq Scan on traffic_violations_p_2020 traffic_violations_p_1 (cost=0.00..3869.67 rows=82 width=0) (never executed)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2023 traffic_violations_p_4 (cost=0.00..3697.55 rows=79 width=0) (never executed)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2022 traffic_violations_p_3 (cost=0.00..3471.79 rows=74 width=0) (never executed)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2021 traffic_violations_p_2 (cost=0.00..3416.81 rows=73 width=0) (never executed)
Filter: (date_of_stop = $0)
-> Parallel Seq Scan on traffic_violations_p_2025 traffic_violations_p_6 (cost=0.00..154.30 rows=102 width=0) (never executed)
Filter: (date_of_stop = $0)
Planning Time: 0.268 ms
Execution Time: 16.022 ms
(24 rows)
这里出现了关键字 “(never executed)”,这是在执行时进行分区剪枝。