PostgreSQL 教程: 子分区

一月 20, 2025

摘要:在本教程中,您将学习如何在 PostgreSQL 中使用子分区。

目录

准备数据集

当您正在寻找开放的数据集时,“美国政府开放数据之家” 是一个很好的来源。它为研究、开发或出于任何原因都可以使用的数据,提供了公开的数据集。在本教程中,我们将使用交通违规数据集。它包含了大约 200 万行,是深入研究 PostgreSQL 分区的一个良好起点。

初始的导入会进入到一个仅包含文本列的表,我们稍后可以从那里加载到我们的分区。该表的结构为:

create table traffic_violations
( seqid text
, date_of_stop text
, time_of_stop text
, agency text
, subagency text
, description text
, location text
, latitude text
, longitude text
, accident text
, belts text
, personal_injury text
, property_damage text
, fatal text
, commercial_license text
, hazmat text
, commercial_vehicle text
, alcohol text
, work_zone text
, search_conducted text
, search_disposition text
, search_outcome text
, search_reason text
, search_reason_for_stop text
, search_type text
, search_arrest_reason text
, state text
, vehicletype text
, year text
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident text
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation text
);

要加载表,最简单的方法是使用 PostgreSQL 的 COPY 命令:

COPY traffic_violations FROM '/home/postgres/Traffic_Violations.csv'
  WITH ( format csv, header true, null 'null', delimiter ',');

所以我们有了大约 200 万行真实数据。由于此表仅包含文本列,因此我们将创建一个范围分区表,采用适合数据的数据类型。

范围分区

通常,范围分区用于按天、月或年对表进行分区,尽管您也可以按其他数据类型进行分区。基于时间/日期的范围分区可能是最常见的用法,因此我们将按年份对交通违规数据进行分区,然后对某个表分区按月份进行子分区。

假设针对数据集的最常见查询,会将数据限制为一年或几年。使用分区表,我们可以在 “date_of_stop” 列上进行范围分区,并在每年将大型数据集分解为较小的块。这在 PostgreSQL 中要如何做到?

目前,我们有 “date_of_stop” 列的最小值和最大值:

SELECT min(to_date(date_of_stop, 'MM/DD/YYYY')),
       max(to_date(date_of_stop, 'MM/DD/YYYY'))
  FROM traffic_violations;
    min     |    max
------------+------------
 2012-01-01 | 2025-01-16
(1 row)

如果我们想按年份进行分区,我们至少需要 14 个分区(2012 年到 2025 年),并且最好也添加一个 2026 年的分区,因此当更新的数据集带来下一年的数据时,我们可以处于安全状态。第一步是创建一个分区表:

create table traffic_violations_p
( seqid text
, date_of_stop date not null
, time_of_stop time
, agency text
, subagency text
, description text 
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean 
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, work_zone boolean
, state text
, vehicletype text 
, year smallint
, make text
, model text
, color text
, violation_type text 
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point)
partition by range (date_of_stop);

现在我们有了一个没有分区的分区表。我们需要分区,下面是向该分区表添加分区的方法:

create table traffic_violations_p_2020
partition of traffic_violations_p
for values from ('2020-01-01') to ('2021-01-01');

create table traffic_violations_p_2021
partition of traffic_violations_p
for values from ('2021-01-01') to ('2022-01-01');

create table traffic_violations_p_2022
partition of traffic_violations_p
for values from ('2022-01-01') to ('2023-01-01');

create table traffic_violations_p_2023
partition of traffic_violations_p
for values from ('2023-01-01') to ('2024-01-01');

create table traffic_violations_p_2024
partition of traffic_violations_p
for values from ('2024-01-01') to ('2025-01-01');

create table traffic_violations_p_2025
partition of traffic_violations_p
for values from ('2025-01-01') to ('2026-01-01');

现在,从 2020 年到 2025 年的每一年,我们都有了一个分区。由于我们的分区表设置现在已经完成,我们可以加载数据了:

insert into traffic_violations_p
select seqid
     , to_date(date_of_stop,'MM/DD/YYYY')
     , time_of_stop::time
     , agency
     , subagency
     , description
     , location
     , latitude::numeric
     , longitude::numeric
     , accident
     , belts::boolean
     , personal_injury::boolean
     , property_damage::boolean
     , fatal::boolean
     , commercial_license::boolean
     , hazmat::boolean
     , commercial_vehicle::boolean
     , alcohol::boolean
     , work_zone::boolean
     , state
     , vehicletype
     , case year
         when '' then null
         else year::smallint
       end
     , make
     , model
     , color
     , violation_type
     , charge
     , article
     , contributed_to_accident::boolean
     , race
     , gender
     , driver_city
     , driver_state
     , dl_state
     , arrest_type
     , geolocation::point
  from traffic_violations
  WHERE to_date(date_of_stop, 'MM/DD/YYYY') >= '2020-01-01'::date;

演示

假设您预测 2026 年交通违规行为会呈指数级增长,因为道路上的汽车会越来越多,而当汽车越多时,交通违规行为就越多。为此,您不仅要按年进行分区,还要按月进行分区。换句话说:为 2026 年添加一个新分区,再对该分区按月进行子分区。首先,您需要一个 2026 年的新分区,该分区本身也要被分区:

create table traffic_violations_p_2026
partition of traffic_violations_p
for values from ('2026-01-01') to ('2027-01-01') partition by range(date_of_stop);

现在,我们可以将分区添加到刚刚创建的分区后的表分区中:

create table traffic_violations_p_2026_jan
partition of traffic_violations_p_2026
for values from ('2026-01-01') to ('2026-02-01');
 
create table traffic_violations_p_2026_feb
partition of traffic_violations_p_2026
for values from ('2026-02-01') to ('2026-03-01');
 
create table traffic_violations_p_2026_mar
partition of traffic_violations_p_2026
for values from ('2026-03-01') to ('2026-04-01');
 
create table traffic_violations_p_2026_apr
partition of traffic_violations_p_2026
for values from ('2026-04-01') to ('2026-05-01');
 
create table traffic_violations_p_2026_may
partition of traffic_violations_p_2026
for values from ('2026-05-01') to ('2026-06-01');
 
create table traffic_violations_p_2026_jun
partition of traffic_violations_p_2026
for values from ('2026-06-01') to ('2026-07-01');
 
create table traffic_violations_p_2026_jul
partition of traffic_violations_p_2026
for values from ('2026-07-01') to ('2026-08-01');
 
create table traffic_violations_p_2026_aug
partition of traffic_violations_p_2026
for values from ('2026-08-01') to ('2026-09-01');
 
create table traffic_violations_p_2026_sep
partition of traffic_violations_p_2026
for values from ('2026-09-01') to ('2026-10-01');
 
create table traffic_violations_p_2026_oct
partition of traffic_violations_p_2026
for values from ('2026-10-01') to ('2026-11-01');
 
create table traffic_violations_p_2026_nov
partition of traffic_violations_p_2026
for values from ('2026-11-01') to ('2026-12-01');
 
create table traffic_violations_p_2026_dec
partition of traffic_violations_p_2026
for values from ('2026-12-01') to ('2027-01-01');

当我们查看 psql 的输出,显示分区表时,没有太大变化,只有关键字 “PARTITIONED” 出现在我们 2026 年的新分区旁边:

postgres=# \d+ traffic_violations_p
                                             Partitioned table "public.traffic_violations_p"
         Column          |          Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
-------------------------+------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 seqid                   | text                   |           |          |         | extended |             |              |
 date_of_stop            | date                   |           | not null |         | plain    |             |              |
 time_of_stop            | time without time zone |           |          |         | plain    |             |              |
 agency                  | text                   |           |          |         | extended |             |              |
 subagency               | text                   |           |          |         | extended |             |              |
 description             | text                   |           |          |         | extended |             |              |
 location                | text                   |           |          |         | extended |             |              |
 latitude                | numeric                |           |          |         | main     |             |              |
 longitude               | numeric                |           |          |         | main     |             |              |
 accident                | text                   |           |          |         | extended |             |              |
 belts                   | boolean                |           |          |         | plain    |             |              |
 personal_injury         | boolean                |           |          |         | plain    |             |              |
 property_damage         | boolean                |           |          |         | plain    |             |              |
 fatal                   | boolean                |           |          |         | plain    |             |              |
 commercial_license      | boolean                |           |          |         | plain    |             |              |
 hazmat                  | boolean                |           |          |         | plain    |             |              |
 commercial_vehicle      | boolean                |           |          |         | plain    |             |              |
 alcohol                 | boolean                |           |          |         | plain    |             |              |
 work_zone               | boolean                |           |          |         | plain    |             |              |
 state                   | text                   |           |          |         | extended |             |              |
 vehicletype             | text                   |           |          |         | extended |             |              |
 year                    | smallint               |           |          |         | plain    |             |              |
 make                    | text                   |           |          |         | extended |             |              |
 model                   | text                   |           |          |         | extended |             |              |
 color                   | text                   |           |          |         | extended |             |              |
 violation_type          | text                   |           |          |         | extended |             |              |
 charge                  | text                   |           |          |         | extended |             |              |
 article                 | text                   |           |          |         | extended |             |              |
 contributed_to_accident | boolean                |           |          |         | plain    |             |              |
 race                    | text                   |           |          |         | extended |             |              |
 gender                  | text                   |           |          |         | extended |             |              |
 driver_city             | text                   |           |          |         | extended |             |              |
 driver_state            | text                   |           |          |         | extended |             |              |
 dl_state                | text                   |           |          |         | extended |             |              |
 arrest_type             | text                   |           |          |         | extended |             |              |
 geolocation             | point                  |           |          |         | plain    |             |              |
Partition key: RANGE (date_of_stop)
Partitions: traffic_violations_p_2020 FOR VALUES FROM ('2020-01-01') TO ('2021-01-01'),
            traffic_violations_p_2021 FOR VALUES FROM ('2021-01-01') TO ('2022-01-01'),
            traffic_violations_p_2022 FOR VALUES FROM ('2022-01-01') TO ('2023-01-01'),
            traffic_violations_p_2023 FOR VALUES FROM ('2023-01-01') TO ('2024-01-01'),
            traffic_violations_p_2024 FOR VALUES FROM ('2024-01-01') TO ('2025-01-01'),
            traffic_violations_p_2025 FOR VALUES FROM ('2025-01-01') TO ('2026-01-01'),
            traffic_violations_p_2026 FOR VALUES FROM ('2026-01-01') TO ('2027-01-01'), PARTITIONED

在这里,PostgreSQL 12 中的新函数变得非常方便:

select * from pg_partition_tree('traffic_violations_p');
             relid             |        parentrelid        | isleaf | level
-------------------------------+---------------------------+--------+-------
 traffic_violations_p          |                           | f      |     0
 traffic_violations_p_2020     | traffic_violations_p      | t      |     1
 traffic_violations_p_2021     | traffic_violations_p      | t      |     1
 traffic_violations_p_2022     | traffic_violations_p      | t      |     1
 traffic_violations_p_2023     | traffic_violations_p      | t      |     1
 traffic_violations_p_2024     | traffic_violations_p      | t      |     1
 traffic_violations_p_2025     | traffic_violations_p      | t      |     1
 traffic_violations_p_2026     | traffic_violations_p      | f      |     1
 traffic_violations_p_2026_jan | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_feb | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_mar | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_apr | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_may | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_jun | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_jul | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_aug | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_sep | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_oct | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_nov | traffic_violations_p_2026 | t      |     2
 traffic_violations_p_2026_dec | traffic_violations_p_2026 | t      |     2
(20 rows)

为了验证数据是否正确路由到子分区,让我们添加一些 2026 年的数据:

insert into traffic_violations_p (date_of_stop)
       select * from generate_series ( date('2026-01-01')
                                     , date('2026-12-30')
                                     , interval '1 day' );

如果我们正确地进行了分区,我们应该会在新分区中看到数据:

select count(*) from traffic_violations_p_2026_jan;
 count
-------
    31
(1 row)
 
select count(*) from traffic_violations_p_2026_dec;
 count
-------
    30
(1 row)
 
select count(*) from traffic_violations_p_2026_feb;
 count
-------
    28
(1 row)

当然,您可以更进一步,按天或周进一步对每月的分区进行子分区。您还可以进行列表分区,然后按范围对列表分区进行子分区。或者按范围分区,然后按列表进行子分区。这就是子分区了。

了解更多

PostgreSQL 教程