一月 20, 2025
摘要:在本教程中,您将学习如何在 PostgreSQL 中使用子分区。
目录
准备数据集
当您正在寻找开放的数据集时,“美国政府开放数据之家” 是一个很好的来源。它为研究、开发或出于任何原因都可以使用的数据,提供了公开的数据集。在本教程中,我们将使用交通违规数据集。它包含了大约 200 万行,是深入研究 PostgreSQL 分区的一个良好起点。
初始的导入会进入到一个仅包含文本列的表,我们稍后可以从那里加载到我们的分区。该表的结构为:
create table traffic_violations
( seqid text
, date_of_stop text
, time_of_stop text
, agency text
, subagency text
, description text
, location text
, latitude text
, longitude text
, accident text
, belts text
, personal_injury text
, property_damage text
, fatal text
, commercial_license text
, hazmat text
, commercial_vehicle text
, alcohol text
, work_zone text
, search_conducted text
, search_disposition text
, search_outcome text
, search_reason text
, search_reason_for_stop text
, search_type text
, search_arrest_reason text
, state text
, vehicletype text
, year text
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident text
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation text
);
要加载表,最简单的方法是使用 PostgreSQL 的 COPY 命令:
COPY traffic_violations FROM '/home/postgres/Traffic_Violations.csv'
WITH ( format csv, header true, null 'null', delimiter ',');
所以我们有了大约 200 万行真实数据。由于此表仅包含文本列,因此我们将创建一个范围分区表,采用适合数据的数据类型。
范围分区
通常,范围分区用于按天、月或年对表进行分区,尽管您也可以按其他数据类型进行分区。基于时间/日期的范围分区可能是最常见的用法,因此我们将按年份对交通违规数据进行分区,然后对某个表分区按月份进行子分区。
假设针对数据集的最常见查询,会将数据限制为一年或几年。使用分区表,我们可以在 “date_of_stop” 列上进行范围分区,并在每年将大型数据集分解为较小的块。这在 PostgreSQL 中要如何做到?
目前,我们有 “date_of_stop” 列的最小值和最大值:
SELECT min(to_date(date_of_stop, 'MM/DD/YYYY')),
max(to_date(date_of_stop, 'MM/DD/YYYY'))
FROM traffic_violations;
min | max
------------+------------
2012-01-01 | 2025-01-16
(1 row)
如果我们想按年份进行分区,我们至少需要 14 个分区(2012 年到 2025 年),并且最好也添加一个 2026 年的分区,因此当更新的数据集带来下一年的数据时,我们可以处于安全状态。第一步是创建一个分区表:
create table traffic_violations_p
( seqid text
, date_of_stop date not null
, time_of_stop time
, agency text
, subagency text
, description text
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, work_zone boolean
, state text
, vehicletype text
, year smallint
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point)
partition by range (date_of_stop);
现在我们有了一个没有分区的分区表。我们需要分区,下面是向该分区表添加分区的方法:
create table traffic_violations_p_2020
partition of traffic_violations_p
for values from ('2020-01-01') to ('2021-01-01');
create table traffic_violations_p_2021
partition of traffic_violations_p
for values from ('2021-01-01') to ('2022-01-01');
create table traffic_violations_p_2022
partition of traffic_violations_p
for values from ('2022-01-01') to ('2023-01-01');
create table traffic_violations_p_2023
partition of traffic_violations_p
for values from ('2023-01-01') to ('2024-01-01');
create table traffic_violations_p_2024
partition of traffic_violations_p
for values from ('2024-01-01') to ('2025-01-01');
create table traffic_violations_p_2025
partition of traffic_violations_p
for values from ('2025-01-01') to ('2026-01-01');
现在,从 2020 年到 2025 年的每一年,我们都有了一个分区。由于我们的分区表设置现在已经完成,我们可以加载数据了:
insert into traffic_violations_p
select seqid
, to_date(date_of_stop,'MM/DD/YYYY')
, time_of_stop::time
, agency
, subagency
, description
, location
, latitude::numeric
, longitude::numeric
, accident
, belts::boolean
, personal_injury::boolean
, property_damage::boolean
, fatal::boolean
, commercial_license::boolean
, hazmat::boolean
, commercial_vehicle::boolean
, alcohol::boolean
, work_zone::boolean
, state
, vehicletype
, case year
when '' then null
else year::smallint
end
, make
, model
, color
, violation_type
, charge
, article
, contributed_to_accident::boolean
, race
, gender
, driver_city
, driver_state
, dl_state
, arrest_type
, geolocation::point
from traffic_violations
WHERE to_date(date_of_stop, 'MM/DD/YYYY') >= '2020-01-01'::date;
演示
假设您预测 2026 年交通违规行为会呈指数级增长,因为道路上的汽车会越来越多,而当汽车越多时,交通违规行为就越多。为此,您不仅要按年进行分区,还要按月进行分区。换句话说:为 2026 年添加一个新分区,再对该分区按月进行子分区。首先,您需要一个 2026 年的新分区,该分区本身也要被分区:
create table traffic_violations_p_2026
partition of traffic_violations_p
for values from ('2026-01-01') to ('2027-01-01') partition by range(date_of_stop);
现在,我们可以将分区添加到刚刚创建的分区后的表分区中:
create table traffic_violations_p_2026_jan
partition of traffic_violations_p_2026
for values from ('2026-01-01') to ('2026-02-01');
create table traffic_violations_p_2026_feb
partition of traffic_violations_p_2026
for values from ('2026-02-01') to ('2026-03-01');
create table traffic_violations_p_2026_mar
partition of traffic_violations_p_2026
for values from ('2026-03-01') to ('2026-04-01');
create table traffic_violations_p_2026_apr
partition of traffic_violations_p_2026
for values from ('2026-04-01') to ('2026-05-01');
create table traffic_violations_p_2026_may
partition of traffic_violations_p_2026
for values from ('2026-05-01') to ('2026-06-01');
create table traffic_violations_p_2026_jun
partition of traffic_violations_p_2026
for values from ('2026-06-01') to ('2026-07-01');
create table traffic_violations_p_2026_jul
partition of traffic_violations_p_2026
for values from ('2026-07-01') to ('2026-08-01');
create table traffic_violations_p_2026_aug
partition of traffic_violations_p_2026
for values from ('2026-08-01') to ('2026-09-01');
create table traffic_violations_p_2026_sep
partition of traffic_violations_p_2026
for values from ('2026-09-01') to ('2026-10-01');
create table traffic_violations_p_2026_oct
partition of traffic_violations_p_2026
for values from ('2026-10-01') to ('2026-11-01');
create table traffic_violations_p_2026_nov
partition of traffic_violations_p_2026
for values from ('2026-11-01') to ('2026-12-01');
create table traffic_violations_p_2026_dec
partition of traffic_violations_p_2026
for values from ('2026-12-01') to ('2027-01-01');
当我们查看 psql 的输出,显示分区表时,没有太大变化,只有关键字 “PARTITIONED” 出现在我们 2026 年的新分区旁边:
postgres=# \d+ traffic_violations_p
Partitioned table "public.traffic_violations_p"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------------------+------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
seqid | text | | | | extended | | |
date_of_stop | date | | not null | | plain | | |
time_of_stop | time without time zone | | | | plain | | |
agency | text | | | | extended | | |
subagency | text | | | | extended | | |
description | text | | | | extended | | |
location | text | | | | extended | | |
latitude | numeric | | | | main | | |
longitude | numeric | | | | main | | |
accident | text | | | | extended | | |
belts | boolean | | | | plain | | |
personal_injury | boolean | | | | plain | | |
property_damage | boolean | | | | plain | | |
fatal | boolean | | | | plain | | |
commercial_license | boolean | | | | plain | | |
hazmat | boolean | | | | plain | | |
commercial_vehicle | boolean | | | | plain | | |
alcohol | boolean | | | | plain | | |
work_zone | boolean | | | | plain | | |
state | text | | | | extended | | |
vehicletype | text | | | | extended | | |
year | smallint | | | | plain | | |
make | text | | | | extended | | |
model | text | | | | extended | | |
color | text | | | | extended | | |
violation_type | text | | | | extended | | |
charge | text | | | | extended | | |
article | text | | | | extended | | |
contributed_to_accident | boolean | | | | plain | | |
race | text | | | | extended | | |
gender | text | | | | extended | | |
driver_city | text | | | | extended | | |
driver_state | text | | | | extended | | |
dl_state | text | | | | extended | | |
arrest_type | text | | | | extended | | |
geolocation | point | | | | plain | | |
Partition key: RANGE (date_of_stop)
Partitions: traffic_violations_p_2020 FOR VALUES FROM ('2020-01-01') TO ('2021-01-01'),
traffic_violations_p_2021 FOR VALUES FROM ('2021-01-01') TO ('2022-01-01'),
traffic_violations_p_2022 FOR VALUES FROM ('2022-01-01') TO ('2023-01-01'),
traffic_violations_p_2023 FOR VALUES FROM ('2023-01-01') TO ('2024-01-01'),
traffic_violations_p_2024 FOR VALUES FROM ('2024-01-01') TO ('2025-01-01'),
traffic_violations_p_2025 FOR VALUES FROM ('2025-01-01') TO ('2026-01-01'),
traffic_violations_p_2026 FOR VALUES FROM ('2026-01-01') TO ('2027-01-01'), PARTITIONED
在这里,PostgreSQL 12 中的新函数变得非常方便:
select * from pg_partition_tree('traffic_violations_p');
relid | parentrelid | isleaf | level
-------------------------------+---------------------------+--------+-------
traffic_violations_p | | f | 0
traffic_violations_p_2020 | traffic_violations_p | t | 1
traffic_violations_p_2021 | traffic_violations_p | t | 1
traffic_violations_p_2022 | traffic_violations_p | t | 1
traffic_violations_p_2023 | traffic_violations_p | t | 1
traffic_violations_p_2024 | traffic_violations_p | t | 1
traffic_violations_p_2025 | traffic_violations_p | t | 1
traffic_violations_p_2026 | traffic_violations_p | f | 1
traffic_violations_p_2026_jan | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_feb | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_mar | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_apr | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_may | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_jun | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_jul | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_aug | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_sep | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_oct | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_nov | traffic_violations_p_2026 | t | 2
traffic_violations_p_2026_dec | traffic_violations_p_2026 | t | 2
(20 rows)
为了验证数据是否正确路由到子分区,让我们添加一些 2026 年的数据:
insert into traffic_violations_p (date_of_stop)
select * from generate_series ( date('2026-01-01')
, date('2026-12-30')
, interval '1 day' );
如果我们正确地进行了分区,我们应该会在新分区中看到数据:
select count(*) from traffic_violations_p_2026_jan;
count
-------
31
(1 row)
select count(*) from traffic_violations_p_2026_dec;
count
-------
30
(1 row)
select count(*) from traffic_violations_p_2026_feb;
count
-------
28
(1 row)
当然,您可以更进一步,按天或周进一步对每月的分区进行子分区。您还可以进行列表分区,然后按范围对列表分区进行子分区。或者按范围分区,然后按列表进行子分区。这就是子分区了。