Testing Online Recovery

Pgpool-II allows to recover a downed node by technique called "Online Recovery". This copies data from the primary node to a standby node so that it sync with the primary. This may take long time and database may be updated during the process. That's no problem because in the streaming configuration, the standby will receive WAL log and applies it to catch up the primary. To test online recovery, let's start with previous cluster, where node 0 is in down state.

    $ pcp_recovery_node -p 11001 -n 0
    Password: 
    pcp_recovery_node -- Command Successful

    $ psql -p 11000 -c "show pool_nodes" test
    node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change
    ---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
    0       | /tmp     | 11002 | up     | 0.500000  | standby | 0          | false             | 0                 | 2019-01-31 12:06:48
    1       | /tmp     | 11003 | up     | 0.500000  | primary | 0          | true              | 0                 | 2019-01-31 12:05:20
    (2 rows)
   

pcp_recovery_node is one of control commands coming with Pgpool-II installation. The argument -p is to specify the port number assigned to the command, which is 11001 set by pgpool_setup . The argument -n is to specify the node id to be recovered. After executing the command, node 0 returned to "up" status.

The script executed by pcp_recovery_node is specified as "recovery_1st_stage_command" in pgpool.conf . Here is the file installed by pgpool_setup .

    #! /bin/sh
    psql=/usr/local/pgsql/bin/psql
    DATADIR_BASE=/home/t-ishii/tmp/Tutorial
    PGSUPERUSER=t-ishii

    main_db_cluster=$1
    recovery_node_host_name=$2
    DEST_CLUSTER=$3
    PORT=$4
    recovery_node=$5

    pg_rewind_failed="true"

    log=$DATADIR_BASE/log/recovery.log
    echo >> $log
    date >> $log
    if [ $pg_rewind_failed = "true" ];then

    $psql -p $PORT -c "SELECT pg_start_backup('Streaming Replication', true)" postgres

    echo "source: $main_db_cluster dest: $DEST_CLUSTER" >> $log

    rsync -C -a -c --delete --exclude postgresql.conf --exclude postmaster.pid \
    --exclude postmaster.opts --exclude pg_log \
    --exclude recovery.conf --exclude recovery.done \
    --exclude pg_xlog \
    $main_db_cluster/ $DEST_CLUSTER/

    rm -fr $DEST_CLUSTER/pg_xlog 
    mkdir $DEST_CLUSTER/pg_xlog
    chmod 700 $DEST_CLUSTER/pg_xlog
    rm $DEST_CLUSTER/recovery.done
    fi
    cat > $DEST_CLUSTER/recovery.conf $lt;$lt;REOF
    standby_mode          = 'on'
    primary_conninfo      = 'port=$PORT user=$PGSUPERUSER'
    recovery_target_timeline='latest'
    restore_command = 'cp $DATADIR_BASE/archivedir/%f "%p" 2> /dev/null'
    REOF

    if [ $pg_rewind_failed = "true" ];then
    $psql -p $PORT -c "SELECT pg_stop_backup()" postgres
    fi

    if [ $pg_rewind_failed = "false" ];then
    cp /tmp/postgresql.conf $DEST_CLUSTER/
    fi