Streaming Replication Protocol

To initiate streaming replication, the frontend sends the replication parameter in the startup message. This tells the backend to go into walsender mode, wherein a small set of replication commands can be issued instead of SQL statements. Only the simple query protocol can be used in walsender mode. The commands accepted in walsender mode are:

IDENTIFY_SYSTEM

Requests the server to identify itself. Server replies with a result set of a single row, containing three fields:

systemid

The unique system identifier identifying the cluster. This can be used to check that the base backup used to initialize the standby came from the same cluster.

timeline

Current TimelineID. Also useful to check that the standby is consistent with the master.

xlogpos

Current xlog flush location. Useful to get a known location in the transaction log where streaming can start.

START_REPLICATION XXX / XXX

Instructs server to start streaming WAL, starting at WAL position XXX / XXX . The server can reply with an error, e.g. if the requested section of WAL has already been recycled. On success, server responds with a CopyBothResponse message, and then starts to stream WAL to the frontend. WAL will continue to be streamed until the connection is broken; no further commands will be accepted.

WAL data is sent as a series of CopyData messages. (This allows other information to be intermixed; in particular the server can send an ErrorResponse message if it encounters a failure after beginning to stream.) The payload in each CopyData message follows this format:

XLogData (B)

Byte1('w')

Identifies the message as WAL data.

Byte8

The starting point of the WAL data in this message, given in XLogRecPtr format.

Byte8

The current end of WAL on the server, given in XLogRecPtr format.

Byte8

The server's system clock at the time of transmission, given in TimestampTz format.

Byte n

A section of the WAL data stream.

A single WAL record is never split across two CopyData messages. When a WAL record crosses a WAL page boundary, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be sent in different CopyData messages.

Note that all fields within the WAL data and the above-described header will be in the sending server's native format. Endianness, and the format for the timestamp, are unpredictable unless the receiver has verified that the sender's system identifier matches its own pg_control contents.

If the WAL sender process is terminated normally (during postmaster shutdown), it will send a CommandComplete message before exiting. This might not happen during an abnormal shutdown, of course.

The receiving process can send replies back to the sender at any time, using one of the following message formats (also in the payload of a CopyData message):

Primary keepalive message (B)

Byte1('k')

Identifies the message as a sender keepalive.

Byte8

The current end of WAL on the server, given in XLogRecPtr format.

Byte8

The server's system clock at the time of transmission, given in TimestampTz format.

Standby status update (F)

Byte1('r')

Identifies the message as a receiver status update.

Byte8

The location of the last WAL byte + 1 received and written to disk in the standby, in XLogRecPtr format.

Byte8

The location of the last WAL byte + 1 flushed to disk in the standby, in XLogRecPtr format.

Byte8

The location of the last WAL byte + 1 applied in the standby, in XLogRecPtr format.

Byte8

The server's system clock at the time of transmission, given in TimestampTz format.

Hot Standby feedback message (F)

Byte1('h')

Identifies the message as a Hot Standby feedback message.

Byte8

The server's system clock at the time of transmission, given in TimestampTz format.

Byte4

The standby's current xmin. This may be 0, if the standby is sending notification that Hot Standby feedback will no longer be sent on this connection. Later non-zero messages may reinitiate the feedback mechanism.

Byte4

The standby's current epoch.

BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ]

Instructs the server to start streaming a base backup. The system will automatically be put in backup mode before the backup is started, and taken out of it when the backup is complete. The following options are accepted:

LABEL 'label'

Sets the label of the backup. If none is specified, a backup label of base backup will be used. The quoting rules for the label are the same as a standard SQL string with standard_conforming_strings turned on.

PROGRESS

Request information required to generate a progress report. This will send back an approximate size in the header of each tablespace, which can be used to calculate how far along the stream is done. This is calculated by enumerating all the file sizes once before the transfer is even started, and may as such have a negative impact on the performance - in particular it may take longer before the first data is streamed. Since the database files can change during the backup, the size is only approximate and may both grow and shrink between the time of approximation and the sending of the actual files.

FAST

Request a fast checkpoint.

WAL

Include the necessary WAL segments in the backup. This will include all the files between start and stop backup in the pg_xlog directory of the base directory tar file.

NOWAIT

By default, the backup will wait until the last required xlog segment has been archived, or emit a warning if log archiving is not enabled. Specifying NOWAIT disables both the waiting and the warning, leaving the client responsible for ensuring the required log is available.

When the backup is started, the server will first send two ordinary result sets, followed by one or more CopyResponse results.

The first ordinary result set contains the starting position of the backup, given in XLogRecPtr format as a single column in a single row.

The second ordinary result set has one row for each tablespace. The fields in this row are:

spcoid

The oid of the tablespace, or NULL if it's the base directory.

spclocation

The full path of the tablespace directory, or NULL if it's the base directory.

size

The approximate size of the tablespace, if progress report has been requested; otherwise it's NULL .

After the second regular result set, one or more CopyResponse results will be sent, one for PGDATA and one for each additional tablespace other than pg_default and pg_global . The data in the CopyResponse results will be a tar format (following the "ustar interchange format" specified in the POSIX 1003.1-2008 standard) dump of the tablespace contents, except that the two trailing blocks of zeroes specified in the standard are omitted. After the tar data is complete, a final ordinary result set will be sent.

The tar archive for the data directory and each tablespace will contain all files in the directories, regardless of whether they are PostgreSQL files or other files added to the same directory. The only excluded files are:

  • postmaster.pid

  • postmaster.opts

  • pg_xlog , including subdirectories. If the backup is run with WAL files included, a synthesized version of pg_xlog will be included, but it will only contain the files necessary for the backup to work, not the rest of the contents.

Owner, group and file mode are set if the underlying file system on the server supports it.

Once all tablespaces have been sent, a final regular result set will be sent. This result set contains the end position of the backup, given in XLogRecPtr format as a single column in a single row.