pgcopydb: Resuming Operations (snaphots)

April 4, 2024

Summary: This section of documentation introduces you the resuming operations design in the pgcopydb program.

Table of Contents

Introduction

An important aspect of pgcopydb design is detailed in the documentation section Notes about concurrency and has to do with using many concurrent worker processes to implement parallelism.

Even when using multiple worker processes, it is important that pgcopydb operations are consistent. It is essential to guarantee that the same source schema and data set is used by every single worker process throughout the operations.

Consistency with multiple Postgres sessions is achieved thanks to Postgres ability to export and import snapshots. As per Postgres docs about Snapshot Synchronization Functions:

From the PostgreSQL documentation

PostgreSQL allows database sessions to synchronize their snapshots. A snapshot determines which data is visible to the transaction that is using the snapshot. Synchronized snapshots are necessary when two or more sessions need to see identical content in the database. If two sessions just start their transactions independently, there is always a possibility that some third transaction commits between the executions of the two START TRANSACTION commands, so that one session sees the effects of that transaction and the other does not.

To solve this problem, PostgreSQL allows a transaction to export the snapshot it is using. As long as the exporting transaction remains open, other transactions can import its snapshot, and thereby be guaranteed that they see exactly the same view of the database that the first transaction sees. But note that any database changes made by any one of these transactions remain invisible to the other transactions, as is usual for changes made by uncommitted transactions. So the transactions are synchronized with respect to pre-existing data, but act normally for changes they make themselves.

Snapshots are exported with the pg_export_snapshot function, shown in Table 9.94, and imported with the SET TRANSACTION command.

Using these Postgres APIs allows pgcopydb to implement consistent operations even when using multiple worker processes.

Bypassing consistency issues

If you can ensure that no writes happen on the source database for the whole duration of the pgcopydb operations, which means no schema change (DDL) and no data change (DML), then consistency issues can’t happen: that’s because the database is static for our context, probably within a maintenance window setup where the applications are disconnected from the source database service.

Note that pgcopydb offers the --not-consistent option that allows bypassing all the complexity of sharing a snapshot throughout the operations. In particular, resuming operations after a crash or even implementing multi-steps operations is made easier when bypassing consistency aspects altogether.

When you are able to work within a maintenance window where the database is isolated from any application traffic, consider using --not-consistent.

Consistency and concurrency: Postgres snapshots

As seen above, Postgres offers different APIs to export and import a snapshot:

Function pg_export_snapshot() exports the current snapshot.
SQL command SET TRANSACTION SNAPSHOT imports the given snapshot.
Replication protocol command CREATE_REPLICATION_SLOT allows exporting its snapshot.

Exporting a Postgres snapshot can be done either at the create replication slot time, or from a non-replication connection using the SQL function pg_export_snapshot(). This is an either/or situation, Postgres does not allow mixing these two approaches.

Also remember that a single snapshot must be used throughout pgcopydb operations, both the initial COPY of the schema and data and also the Change Data Capture aspects in order to achieve consistency (no data loss, no duplicates in the data change stream).

To be able to implement multiple worker processes in pgcopydb and have a consistent view of the same database (schema, data) in every single process, pgcopydb needs to first export a single common snapshot and then have every worker process import that snapshot when connecting to the source database.

When implementing Change-Data-Capture thanks to the --follow option, then it is also required that there is no gap between the initial snapshot and the first change received, and also that no changes are sent that were part of the initial copy. Postgres knows how to offer that guarantee via its snapshot exporting facility in the CREATE_REPLICATION_SLOT replication command.

As a result, the pgcopydb snapshot --follow command is required by the Postgres API to also create the pgcopydb replication slot, and exports the replication slot snapshot.

The pgcopydb snapshot command, when the --follow option is not used, simply connects to the Postgres source database using the normal query protocol and run the command select pg_export_snapshot() to grab a snapshot that can be shared by all the worker processes.

Moreover the Postgres API for exporting a snapshot has the following limitation:

From the PostgreSQL documentation

The snapshot is available for import only until the end of the transaction that exported it.

This means that the pgcopydb snapshot command must be kept running for the whole pgcopydb initial copy operations. The replication client only uses the replication slot to ensure consistency, so when only the follow worker processes are running, holding on to the snapshot is not required anymore.

Resumability of pgcopydb operations

The ability to resume operations when using pgcopydb faces three quite different contexts. Depending on the context and when the previous operation has been interrupted, then running the same pgcopydb command again with the --resume option might just work, or might error out because the situation does not allow a consistent resuming of the operation that was interrupted.

Bypassing consistency issues

When using the --resume --not-consistent options then there is no restriction around snapsht re-use when trying to resume interrupted operations.

Consistent copy of the data

When using pgcopydb clone --resume the snapshot used in the previous attempts is going to be re-used. For Postgres to be able to import that snapshot again, the transaction that exported the snapshot must be still running on the source database system.

Single pgcopydb command

When using pgcopydb clone the snapshot holding process is part of that single process tree, and any interruption of this command (signal, C-c, crash) also terminates the snapshot holding sub-process and the snapshot is then lost.
Separate pgcopydb snapshot command

That’s why the pgcopydb snapshot command can be used separately. Then the main pgcopydb clone command re-uses the snapshot automatically and can be left holding the snapshot even in case of an interruption of the pgcopydb clone command.
External snapshot

It is also possible to use another command or software to export and maintain the snapshot that pgcopydb uses and then use pgcopydb clone --snapshot ... to share the snapshot with pgcopydb.

Consistent copy of the data with CDC

When using Change Data Capture with the --follow option resuming operations consistently requires the following situation:

The initial COPY of the data needs to still have access to the exported snapshot.

Even when the snapshot has been exported with a replication protocol command, Postgres still requires the session to be maintained opened here.
The logical replication on the client side is not concerned with the snapshot operations, that’s done server-side when creating the replication slot; from there on all the client has to do is consume from the replication slot.

Snapshot and catalogs (cache invalidation)

The source catalog table setup registers information about the current pgcopydb command. The information is checked at start-up in order to avoid re-using data in a different context.

The information registered is the following, and also contains the snapshot information. In case of a mismatch, consider using --resume --not-consistent when that’s relevant to your operations.

Here’s how to inspect the current setup information that pgcopydb maintains in its local catalog cache:

$ sqlite3 /tmp/pgcopydb/schema/source.db
sqlite> .mode line
sqlite> select * from setup;
                      id = 1
           source_pg_uri = postgres:///pagila
           target_pg_uri = postgres:///plop
                snapshot = 00000003-00000048-1
split_tables_larger_than = 0
                 filters = {"type":"SOURCE_FILTER_TYPE_NONE"}
                  plugin =
               slot_name =

The source and target connection strings only contain the Postgres servers hostname, port, database name and connecting role name. In particular, authentication credentials are not stored in the catalogs.