PostgreSQL 教程: 处理事务 ID 回卷的故障

五月 13, 2024

摘要:在本教程中,您将学习如何处理 PostgreSQL 中事务 ID 回卷的故障。

目录

死行和膨胀

受 PostgreSQL 中的 UPDATE 或 DELETE 等操作影响的任何行的旧版本,都会在内部进行标记,以让查询不会返回该行。因此,这些行被称为“死行”,死行占用的空间称为“膨胀”。经常执行大量 UPDATE/DELETE 操作的数据库程序,很快就会变得非常大,并需要进行定期的维护工作。

清理的类型

通常,VACUUM 是一种定期性的数据库清理动作,会在 PostgreSQL 中执行。它有两种不同的形式;其中一种是普通的 VACUUM(没有 FULL),它只是为其他新行腾出空间。普通 VACUUM 不会获取排他锁,使其能够与表的常规读写同时运行。这种类型的 VACUUM 操作,将让额外的空间在同一表内可重复使用,而不是将其返回给操作系统。使用 VACUUM FULL 形式时,操作系统会接收多余的空间,而且还会将表的全部内容重写到一个没有额外空间的新磁盘文件上。该类型的清理必须在每个表上使用 ACCESS EXCLUSIVE 锁进行处理,这使得它会慢一些。

运作方式 VACUUM VACUUM FULL
方法 释放死行以供重用 重写没有死行的表
Access Exclusive Lock
空闲空间可用于 在同一个表内 操作系统

事务回卷

现在,让我们来了解一下事务和事务 ID 回卷。在 PostgreSQL 的数据库中,更新的每一行都会从事务控制系统获得一个事务 ID。这些 ID 规定了会向其他活动事务显示哪些行。

事务 ID 回卷的问题在多版本并发控制(MVCC)。根据 ID 区分两个事务的能力对于 MVCC 至关重要。PostgreSQL 中的事务 ID 就是 32 位的整数。因此,只有大约四十亿(2^32)个可能的事务 ID。

40 亿可能看起来相当多且很难用光,但是不得不承认,对于写入密集型工作负载非常高的数据库系统,40 亿个事务可以在几周内用完。

因此,如果存在 2,000,000,000 个未清理的事务,PostgreSQL 将阻止进行写入操作,并将数据库切换到只读模式。

以下任何一个或多个条件,都可能导致事务 ID 回卷的问题。

  1. Autovacuum 设置为关闭。
  2. 长时间运行的事务。
  3. 繁重的 DML 操作,迫使取消自动清理工作进程。
  4. 许多会话或连接长时间保持锁定。

让我们在测试机器上来模拟下事务 ID 回卷的问题。在 PostgreSQL 日志文件中会出现以下行,这意味着 PostgreSQL 已停止接受 DML 语句,并已切换到了只读模式。

2022-09-16 08:43:38.265 [74098] WARNING: database "postgres" must be vacuumed within 3000000 transactions (10184)
2022-09-16 08:43:38.265 HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
    You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2022-09-16 08:43:48.640 [5764] ERROR: database is not accepting commands to avoid wraparound data loss in database "postgres" (10182)
2022-09-16 08:43:48.640 [5764] HINT: Stop the postmaster and vacuum that database in single-user mode.
    You might also need to commit or roll back old prepared transactions, or drop stale replication slots.

然后,让我们建立与问题数据库 “postgres” 的连接,并运行一些 SELECT 和 CREATE 命令来检查下。

postgres=# select datname from pg_database;
 datname
----------
 postgres
 templatel
 template0
(3 rows)

postgres=# create table test (id numeric);
ERROR: database is not accepting commands to avoid wraparound data loss in database "postgres" (10182)
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.

postgres=# select usename from pg_user;
 usename
----------
 postgres
(1 row)

由于事务 ID 回卷,数据库 “postgres” 目前已确认处于只读模式。

修复事务回卷

现在,让我们来看看解决此问题的方法。我们必须首先让数据库停机,以单用户模式连接,然后在每个数据库上执行 VACUUM FULL。但是,我们必须首先完成下面列出的步骤。

  1. 首先,我们需要使用下面的查询,找到发生事务回卷的表的名称。

    SELECT c.relnamespace::regnamespace as schema_name, c.relname as table_name,
           greatest(age(c.relfrozenxid), age(t.relfrozenxid)) as age,
           2^31 - 1000000 - greatest(age(c.relfrozenxid), age(t.relfrozenxid)) as remaining
      FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
      WHERE c.relkind IN ('r', 'm') ORDER BY 4;
    
  2. 如果您在上述查询的输出中发现属于 “pg_temp_” 模式的任何表,则删除表是解决问题的唯一选择,因为 PostgreSQL 不允许您对其他会话中生成的临时表进行 VACUUM。

  3. 另一种情况是,在当前最旧的活动事务开始后建立的任何事务 ID,PostgreSQL 是无法冻结的。这是由 MVCC 的运作方式决定的。事务有时可能会变得太旧,以至于 VACUUM 无法清理它们,来解除全部 20 亿个事务 ID 回卷的限制,这会导致系统不再接受新的 DML。

    2022-09-16 05:49:07.514 [64931] WARNING: oldest xmin is far in the past (11833)
    2022-09-16 05:49:07.514 [64931] HINT: Close open transactions soon to avoid wraparound problems.
    You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
  4. 您可能已经注意到,日志输出中的提示都告诉您提交或回滚以前创建的预备事务。因此,我们必须使用下面的查询,来检查孤立或过时的预备事务。

    SELECT age(transaction), * FROM pg_prepared_xacts;
    

    使用上述查询输出的 gid,回滚您找到的任何预备事务。回滚查询如下。

    ROLLBACK PREPARED gid;
    
  5. 完成上述过程后,我们必须使数据库停机,以单用户模式连接,并对每个数据库执行 VACUUM FULL 操作。

    $ pg_ctl stop
    waiting for server to shut down...... done
    server stopped
    
  6. 现在,您需要以单用户模式启动数据库,如下所示。

    $ postgres --single postgres
    2022-09-16 12:06:45.628 [84585] WARNING: database with OID 14728 must be vacuumed within 3000000 transactions (10185)
    2022-09-16 12:06:45.628 [84585] HINT: to avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
    PostgreSQL stand-alone backend 14.0
    backend>
    
  7. 成功进入单用户模式后,我们需要对每个数据库执行 VACUUM FULL。

    backend> VACUUM FULL;
    2022-09-16 12:09:02.711 [84585] WARNING: database "postgres" must be vacuumed within 3000000 transactions (10184)
    2022-09-16 12:09:02.711 [84585] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    2022-09-16 12:09:03.860 [84585] WARNING: database "postgres" must be vacuumed within 2999999 transactions (10184)
    2022-09-16 12:09:03.860 [84585] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
    2022-09-16 12:09:08.052 [84585] WARNING: database "postgres" must be vacuumed within 3000000 transactions (10184)
    2022-09-16 12:09:08.052 [84585] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    2022-09-16 12:09:08.058 [84585] WARNING: database "postgres" must be vacuumed within 3000000 transactions (10184)
    2022-09-16 12:09:08.058 [84585] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    2022-09-16 12:09:08.065 [84585] WARNING: database "template1" must be vacuumed within 3000000 transactions (10184)
    2022-09-16 12:09:08.065 [84585] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
  8. 现在,您可以看到 PostgreSQL 也要求对 template1 数据库执行相同的 VACUUM FULL。让我们对 template1 重复同样的操作。

    $ postgres --single template1
    2022-09-16 12:14:14.897 [84775] WARNING: database with OID 1 must be vacuumed within 2999860 transactions (10185)
    2022-09-16 12:14:14.897 [84775] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
    PostgreSQL stand-alone backend 14.0
    backend> VACUUM FULL;
    2022-09-16 12:14:24.965 [84775] WARNING: database "template1" must be vacuumed within 2999860 transactions (10184)
    2022-09-16 12:14:24.965 [84775] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    2022-09-16 12:14:25.026 [84775] WARNING: database "template1" must be vacuumed within 2999859 transactions (10184)
    2022-09-16 12:14:25.026 [84775] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
  9. 现在它要求在 template0 上执行相同的操作。

    2022-09-16 12:14:25.696 [84775] WARNING: database "template1" must be vacuumed within 2999795 transactions (10184)
    2022-09-16 12:14:25.696 [84775] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    2022-09-16 12:14:25.703 [84775] WARNING: database "template0" must be vacuumed within 2999794 transactions (10184)
    2022-09-16 12:14:25.703 [84775] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
  10. 让我们对 template0 重复同样的操作。

    $ postgres --single template0
    2022-09-16 12:16:18.247 [84863] WARNING: database with OID 14727 must be vacuumed within 2999794 transactions (10185)
    2022-09-16 12:16:18.247 [84863] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
    PostgreSQL stand-alone backend 14.0
    backend> VACUUM FULL;
    2022-09-16 12:16:36.346 [84863] WARNING: database "template0" must be vacuumed within 2999794 transactions (10184)
    2022-09-16 12:16:36.346 [84863] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    
    2022-09-16 12:16:37.036 [84863] WARNING: database "template0" must be vacuumed within 2999730 transactions (10184)
    2022-09-16 12:16:37.036 [84863] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    2022-09-16 12:16:37.042 [84863] WARNING: database "template0" must be vacuumed within 2999729 transactions (10184)
    2022-09-16 12:16:37.042 [84863] HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
    

    您会注意到,PostgreSQL 不再请求对任何其他数据库执行 VACUUM FULL,因为该实例只有三个数据库(postgres、template1 和 template0),并且我们已经成功地在每个数据库上完成了 VACUUM FULL。

    之后,我们可以按如下方式启动 PostgreSQL。

    $ pg_ctl start
    waiting for server to start....
    2022-09-16 12:19:11.594 +08 [84987] LOG: redirecting log output to logging collector process
    2022-09-16 12:19:11.594 +08 [84987] HINT: Allure log output will appear in directory 'log'.
    done
    server started
    

    现在,让我们执行一些 DML 语句,来确认问题是否已经解决。

    select usename from pg_user;
     usename
    ----------
     postgres
    (1 row)
    
    create table test (id numeric);
    insert into test values (1);
    
    select * from test;
     id 
    ----
      1
    (1 row)
    

    如您所见,该问题现已修复,PostgreSQL 能够像问题发生前一样有效地执行 DML 操作了。

避免事务回卷

在 PostgreSQL 中,可以使用各种重要方法来防止事务 ID 回卷。首先,您必须跟踪数据库最早的事务 ID 的年龄。这可以使用 pg_class 系统表来实现。您可以通过跟踪其年龄,来确定事务 ID 是否接近回卷限制。

应定期执行维护操作,例如清理和数据库分析,以防止事务 ID 回卷。VACUUM 清理有助于恢复被过时或已删除的元组占用的空间,从而最大限度地减少膨胀,并防止事务 ID 达到回卷阈值。分析会更新优化器统计信息,以确保查询计划正确无误。

此外,在 PostgreSQL 配置文件中,请考虑对 autovacuum 参数做一些调优。autovacuum 进程由这些设置控制,它们有助于控制事务 ID。

另一种预防方法是配置好监控系统,以便在事务 ID 的年龄达到特定限制时发送通知。这使您可以快速响应,并防止系统接近回卷限制。

结论

尽管通过设置系统的定期维护和分析来防止发生事务 ID 回卷是理想的选择,但随着业务的增长和发展,可能会出现意外事件或新的需求。上面的例子说明了,如何在发生事务 ID 回卷的情况下清理 PostgreSQL 系统。希望您现在不用再害怕事务 ID 回卷了,因为您知道您能够有效地修复它。