pg_ash: PostgreSQL 活动会话历史报告

John Doe 三月 9, 2026

pg_ash 可以用于生成 Postgres 的活动会话历史报告,轻量级等待事件采样,无数据膨胀。

image

纯 SQL + PL/pgSQL 实现,可在任何 Postgres 14 及以上版本运行,包括各类云托管服务。无需 C 扩展、无需配置 shared_preload_libraries、无需云厂商审批、无需重启数据库。只需执行 \i 导入即可使用。

为什么需要它

Postgres 没有内置会话历史。一小时前出现的性能慢查询,事后没有任何数据可查。pg_ash 每秒采样一次 pg_stat_activity,并以紧凑格式存储结果,可直接用普通 SQL 查询。

与同类工具对比

pg_ash

  • 安装:纯 SQL 脚本
  • 托管实例:支持 RDS、Cloud SQL、Supabase 等
  • 采样频率:1 秒(pg_cron)
  • 数据可见性:数据库内部
  • 存储:磁盘,约 30MiB / 天
  • 历史查询:支持,持久化
  • 实现:纯 SQL
  • 维护开销:无
  • 依赖:pg_cron 1.5+

pg_wait_sampling

  • 安装:需 shared_preload_libraries
  • 托管实例:仅 Cloud SQL(2026 初)
  • 采样频率:10 毫秒(进程内)
  • 数据可见性:数据库内部
  • 存储:仅内存
  • 历史查询:环形缓冲区,重启丢失
  • 实现:C 扩展,非纯 SQL
  • 维护开销:无
  • 依赖:shared_preload_libraries,需重启

pgsentinel

  • 安装:需 shared_preload_libraries(打包 / 编译)
  • 托管实例:暂不支持
  • 采样频率:10 毫秒(进程内)
  • 数据可见性:数据库内部
  • 存储:仅内存
  • 历史查询:环形缓冲区,重启丢失
  • 实现:C 扩展,非纯 SQL
  • 维护开销:无
  • 依赖:shared_preload_libraries,需重启

外部采样工具

  • 安装:独立基础设施
  • 托管实例:支持但需额外配置
  • 采样频率:15-60 秒
  • 数据可见性:仅外部
  • 存储:外部存储
  • 历史查询:取决于配置
  • 实现:非纯 SQL
  • 维护开销:高
  • 依赖:代理 + 存储

使用示例

查看状态

select * from ash.status();
           metric           |             value
----------------------------+-------------------------------
 version                    | 1.2
 current_slot               | 0
 sample_interval            | 00:00:01
 rotation_period            | 1 day
 include_bg_workers         | false
 samples_in_current_slot    | 56
 last_sample_ts             | 2026-02-16 08:39:03+00
 wait_event_map_count       | 11
 wait_event_map_utilization | 0.03%
 query_map_count            | 8
 pg_cron_available          | no

最近性能问题概览

-- morning coffee: what happened overnight?
select * from ash.activity_summary('5 minutes');
        metric        |                                            value
----------------------+---------------------------------------------------------------------------------------------
 time_range           | 00:05:00
 total_samples        | 56
 avg_active_backends  | 6.6
 peak_active_backends | 10
 peak_time            | 2026-02-16 08:38:16+00
 databases_active     | 1
 top_wait_1           | Client:ClientRead (46.77%)
 top_wait_2           | Timeout:PgSleep (11.83%)
 top_wait_3           | Lock:transactionid (9.68%)
 top_query_1          | -2835399305386018931 — COMMIT (29.73%)
 top_query_2          | 3365820675399133794 — UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE b (23.24%)
 top_query_3          | -4378765880691287891 — UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE t (11.35%)
-- top wait events (default: top 10 + Other)
select * from ash.top_waits('5 minutes');
     wait_event     | samples |  pct  |                  bar
--------------------+---------+-------+---------------------------------------
 Client:ClientRead  |     174 | 46.77 | ██████████████████████████████ 46.77%
 Timeout:PgSleep    |      44 | 11.83 | ████████ 11.83%
 Lock:transactionid |      36 |  9.68 | ██████ 9.68%
 CPU*               |      35 |  9.41 | ██████ 9.41%
 LWLock:WALWrite    |      31 |  8.33 | █████ 8.33%
 IdleTx             |      26 |  6.99 | ████ 6.99%
 IO:WalSync         |      19 |  5.11 | ███ 5.11%
 Lock:tuple         |       5 |  1.34 | █ 1.34%
 LWLock:LockManager |       2 |  0.54 | █ 0.54%
-- top queries with text from pg_stat_statements
select * from ash.top_queries_with_text('5 minutes', 5);
       query_id       | samples |  pct  | calls  | total_exec_time_ms | mean_exec_time_ms |                             query_text
----------------------+---------+-------+--------+--------------------+-------------------+---------------------------------------------------------------------
 -2835399305386018931 |     110 | 29.73 | 283202 |            1234.56 |              0.00 | commit
  3365820675399133794 |      86 | 23.24 | 283195 |          518349.35 |              1.83 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2
  5457019535816659310 |      44 | 11.89 |     11 |          195225.25 |          17747.75 | select pg_sleep($1)
 -4378765880691287891 |      42 | 11.35 | 283195 |          113278.00 |              0.40 | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2
-- breakdown by wait event type
select * from ash.top_by_type('5 minutes');
 wait_event_type | samples |  pct  |                       bar
-----------------+---------+-------+-------------------------------------------------
 Client          |     174 | 46.77 | ████████████████████████████████████████ 46.77%
 Timeout         |      44 | 11.83 | ██████████ 11.83%
 Lock            |      41 | 11.02 | █████████ 11.02%
 CPU*            |      35 |  9.41 | ████████ 9.41%
 LWLock          |      33 |  8.87 | ████████ 8.87%
 IdleTx          |      26 |  6.99 | ██████ 6.99%
 IO              |      19 |  5.11 | ████ 5.11%

分析特定查询

-- what is query 3365820675399133794 waiting on?
select * from ash.query_waits(3365820675399133794, '5 minutes');
     wait_event     | samples |  pct  |                       bar
--------------------+---------+-------+-------------------------------------------------
 Client:ClientRead  |      32 | 54.24 | ████████████████████████████████████████ 54.24%
 Lock:transactionid |      12 | 20.34 | ███████████████ 20.34%
 LWLock:WALWrite    |       6 | 10.17 | ████████ 10.17%
 CPU*               |       4 |  6.78 | █████ 6.78%
 IO:WalSync         |       3 |  5.08 | ████ 5.08%
 IdleTx             |       2 |  3.39 | ██ 3.39%
-- same, but during a specific time window
select * from ash.query_waits_at(3365820675399133794, '2026-02-16 08:38', '2026-02-16 08:40');

跟踪某个等待事件

-- which queries are stuck on Lock:transactionid?
select * from ash.event_queries('Lock:transactionid', '1 hour');

-- or by wait type (matches all events of that type)
select * from ash.event_queries('IO', '1 hour');

查看原始采样

-- see the last 20 decoded samples with query text
select * from ash.samples('10 minutes', 20);
      sample_time       | database_name | active_backends |     wait_event     |       query_id       |                          query_text
------------------------+---------------+-----------------+--------------------+----------------------+--------------------------------------------------------------
 2026-02-16 11:18:51+00 | postgres      |               7 | CPU*               | -2835399305386018931 | END
 2026-02-16 11:18:51+00 | postgres      |               7 | CPU*               |  3365820675399133794 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE ...
 2026-02-16 11:18:49+00 | postgres      |               5 | Client:ClientRead  |  9144568883098003499 | SELECT abalance FROM pgbench_accounts WHERE aid = $1
 2026-02-16 11:18:49+00 | postgres      |               5 | IO:WalSync         | -2835399305386018931 | END
 2026-02-16 11:18:49+00 | postgres      |               3 | Lock:transactionid | -2835399305386018931 | END
 2026-02-16 11:18:49+00 | postgres      |               5 | LWLock:WALWrite    | -2835399305386018931 | END
-- raw samples during an incident
select * from ash.samples_at('2026-02-14 03:00', '2026-02-14 03:05', 50);

定位故障

使用带绝对时间戳的_at后缀函数,可聚焦分析指定的时间窗口:

-- what happened between 3:00 and 3:10 am?
select * from ash.top_waits_at('2026-02-14 03:00', '2026-02-14 03:10');

-- which queries were running during the incident?
select * from ash.top_queries_at('2026-02-14 03:00', '2026-02-14 03:10');

-- minute-by-minute timeline of the incident
select * from ash.wait_timeline_at(
    '2026-02-14 03:00',
    '2026-02-14 03:10',
    '1 minute'
);

依赖要求

  • Postgres 14 及以上版本(要求 pg_stat_activity 视图中包含 query_id 字段)
  • pg_cron 1.5 及以上版本(用于支持分钟级以下的调度)
  • pg_stat_statements(可选组件,启用后可在 top_queries_with_text()event_queries() 函数中显示查询文本、calls(调用次数)、total_exec_time_ms(总执行时间,毫秒)、mean_exec_time_ms(平均执行时间,毫秒);不安装该组件时,其他所有函数仍可正常使用)

关于 query_id 的说明:默认配置 compute_query_id = auto,仅在 pg_stat_statements 被加入 shared_preload_libraries 时,才会填充 query_id 字段。如果发现 pg_stat_activity 中的 query_id 字段值为 NULL,请执行以下设置:

alter system set compute_query_id = 'on';
-- requires reload: select pg_reload_conf();

参考

pg_ash:https://github.com/NikolayS/pg_ash