Failover in the PostgreSQL Operator Overview

Failover in the PostgreSQL Operator

There are a number of potential events that could cause a primary database instance or cluster to become unavailable during the course of normal operations, including:

  • A database storage (disk) failure or any other hardware failure
  • The network on which the database resides becomes unreachable
  • The host operating system becomes unstable and crashes
  • A key database file becomes corrupted
  • Total loss of data center

There may also be downtime events that are due to the normal case of operations, such as performing a minor upgrade, security patching of operating system, hardware upgrade, or other maintenance.

To enable rapid recovery from the unavailability of the primary PostgreSQL instance within a PostgreSQL cluster, the PostgreSQL Operator supports both Manual and Automated failover within a single Kubernetes cluster.

PostgreSQL Cluster Architecture

The failover from a primary PostgreSQL instances to a replica PostgreSQL instance within a PostgreSQL cluster.

Manual Failover

Manual failover is performed by PostgreSQL Operator API actions involving a query and then a target being specified to pick the fail-over replica target.

Automatic Failover

Automatic failover is performed by the PostgreSQL Operator by evaluating the readiness of a primary. Automated failover can be globally specified for all clusters or specific clusters. If desired, users can configure the PostgreSQL Operator to replace a failed primary PostgreSQL instance with a new PostgreSQL replica.

The PostgreSQL Operator automatic failover logic includes:

  • deletion of the failed primary Deployment
  • pick the best replica to become the new primary
  • label change of the targeted Replica to match the primary Service
  • execute the PostgreSQL promote command on the targeted replica