Patroni makes it extremely simple to deploy Multi-Node Citus clusters.
TL;DR
There are only a few simple rules you need to follow:
-
Citus database extension to PostgreSQL must be available on all nodes. Absolute minimum supported Citus version is 10.0, but, to take all benefits from transparent switchovers and restarts of workers we recommend using at least Citus 11.2.
-
Cluster name (
scope
) must be the same for all Citus nodes! -
Superuser credentials must be the same on coordinator and all worker nodes, and
pg_hba.conf
should allow superuser access between all nodes. -
REST API access should be allowed from worker nodes to the coordinator. E.g., credentials should be the same and if configured, client certificates from worker nodes must be accepted by the coordinator.
-
Add the following section to the
patroni.yaml
:
citus:
group: X # 0 for coordinator and 1, 2, 3, etc for workers
database: citus # must be the same on all nodes
After that you just need to start Patroni and it will handle the rest:
-
Patroni will set
bootstrap.dcs.synchronous_mode
to quorum if it is not explicitly set to any other value. -
citus
extension will be automatically added toshared_preload_libraries
. -
If
max_prepared_transactions
isn’t explicitly set in the global dynamic configuration Patroni will automatically set it to2*max_connections
. -
The
citus.local_hostname
GUC value will be adjusted fromlocalhost
to the value that Patroni is using in order to connect to the local PostgreSQL instance. The value sometimes should be different from thelocalhost
because PostgreSQL might be not listening on it. -
The
citus.database
will be automatically created followed byCREATE EXTENSION citus
. -
Current superuser credentials will be added to the
pg_dist_authinfo
table to allow cross-node communication. Don’t forget to update them if later you decide to change superuser username/password/sslcert/sslkey! -
The coordinator primary node will automatically discover worker primary nodes and add them to the
pg_dist_node
table using thecitus_add_node()
function. -
Patroni will also maintain
pg_dist_node
in case failover/switchover on the coordinator or worker clusters occurs.
patronictl
Coordinator and worker clusters are physically different PostgreSQL/Patroni clusters that are just logically grouped together using the Citus database extension to PostgreSQL. Therefore in most cases it is not possible to manage them as a single entity.
It results in two major differences in
patronictl
behaviour when
patroni.yaml
has the
citus
section comparing with the usual:
-
The
list
and thetopology
by default output all members of the Citus formation (coordinators and workers). The new columnGroup
indicates which Citus group they belong to. -
For all
patronictl
commands the new option is introduced, named--group
. For some commands the default value for the group might be taken from thepatroni.yaml
. For example, patronictl pause will enable the maintenance mode by default for thegroup
that is set in thecitus
section, but for example for patronictl switchover or patronictl remove the group must be explicitly specified.
An example of patronictl list output for the Citus cluster:
postgres@coord1:~$ patronictl list demo
+ Citus cluster: demo ----------+----------------+---------+----+-----------+
| Group | Member | Host | Role | State | TL | Lag in MB |
+-------+---------+-------------+----------------+---------+----+-----------+
| 0 | coord1 | 172.27.0.10 | Replica | running | 1 | 0 |
| 0 | coord2 | 172.27.0.6 | Quorum Standby | running | 1 | 0 |
| 0 | coord3 | 172.27.0.4 | Leader | running | 1 | |
| 1 | work1-1 | 172.27.0.8 | Quorum Standby | running | 1 | 0 |
| 1 | work1-2 | 172.27.0.2 | Leader | running | 1 | |
| 2 | work2-1 | 172.27.0.5 | Quorum Standby | running | 1 | 0 |
| 2 | work2-2 | 172.27.0.7 | Leader | running | 1 | |
+-------+---------+-------------+----------------+---------+----+-----------+
If we add the
--group
option, the output will change to:
postgres@coord1:~$ patronictl list demo --group 0
+ Citus cluster: demo (group: 0, 7179854923829112860) -+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-------------+----------------+---------+----+-----------+
| coord1 | 172.27.0.10 | Replica | running | 1 | 0 |
| coord2 | 172.27.0.6 | Quorum Standby | running | 1 | 0 |
| coord3 | 172.27.0.4 | Leader | running | 1 | |
+--------+-------------+----------------+---------+----+-----------+
postgres@coord1:~$ patronictl list demo --group 1
+ Citus cluster: demo (group: 1, 7179854923881963547) -+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------+------------+----------------+---------+----+-----------+
| work1-1 | 172.27.0.8 | Quorum Standby | running | 1 | 0 |
| work1-2 | 172.27.0.2 | Leader | running | 1 | |
+---------+------------+----------------+---------+----+-----------+
Citus worker switchover
When a switchover is orchestrated for a Citus worker node, Citus offers the
opportunity to make the switchover close to transparent for an application.
Because the application connects to the coordinator, which in turn connects to
the worker nodes, then it is possible with Citus to
pause
the SQL traffic on
the coordinator for the shards hosted on a worker node. The switchover then
happens while the traffic is kept on the coordinator, and resumes as soon as a
new primary worker node is ready to accept read-write queries.
An example of patronictl switchover on the worker cluster:
postgres@coord1:~$ patronictl switchover demo
+ Citus cluster: demo ----------+----------------+---------+----+-----------+
| Group | Member | Host | Role | State | TL | Lag in MB |
+-------+---------+-------------+----------------+---------+----+-----------+
| 0 | coord1 | 172.27.0.10 | Replica | running | 1 | 0 |
| 0 | coord2 | 172.27.0.6 | Quorum Standby | running | 1 | 0 |
| 0 | coord3 | 172.27.0.4 | Leader | running | 1 | |
| 1 | work1-1 | 172.27.0.8 | Leader | running | 1 | |
| 1 | work1-2 | 172.27.0.2 | Quorum Standby | running | 1 | 0 |
| 2 | work2-1 | 172.27.0.5 | Quorum Standby | running | 1 | 0 |
| 2 | work2-2 | 172.27.0.7 | Leader | running | 1 | |
+-------+---------+-------------+----------------+---------+----+-----------+
Citus group: 2
Primary [work2-2]:
Candidate ['work2-1'] []:
When should the switchover take place (e.g. 2024-08-26T08:02 ) [now]:
Current cluster topology
+ Citus cluster: demo (group: 2, 7179854924063375386) -+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------+------------+----------------+---------+----+-----------+
| work2-1 | 172.27.0.5 | Quorum Standby | running | 1 | 0 |
| work2-2 | 172.27.0.7 | Leader | running | 1 | |
+---------+------------+----------------+---------+----+-----------+
Are you sure you want to switchover cluster demo, demoting current primary work2-2? [y/N]: y
2024-08-26 07:02:40.33003 Successfully switched over to "work2-1"
+ Citus cluster: demo (group: 2, 7179854924063375386) ------+
| Member | Host | Role | State | TL | Lag in MB |
+---------+------------+---------+---------+----+-----------+
| work2-1 | 172.27.0.5 | Leader | running | 1 | |
| work2-2 | 172.27.0.7 | Replica | stopped | | unknown |
+---------+------------+---------+---------+----+-----------+
postgres@coord1:~$ patronictl list demo
+ Citus cluster: demo ----------+----------------+---------+----+-----------+
| Group | Member | Host | Role | State | TL | Lag in MB |
+-------+---------+-------------+----------------+---------+----+-----------+
| 0 | coord1 | 172.27.0.10 | Replica | running | 1 | 0 |
| 0 | coord2 | 172.27.0.6 | Quorum Standby | running | 1 | 0 |
| 0 | coord3 | 172.27.0.4 | Leader | running | 1 | |
| 1 | work1-1 | 172.27.0.8 | Leader | running | 1 | |
| 1 | work1-2 | 172.27.0.2 | Quorum Standby | running | 1 | 0 |
| 2 | work2-1 | 172.27.0.5 | Leader | running | 2 | |
| 2 | work2-2 | 172.27.0.7 | Quorum Standby | running | 2 | 0 |
+-------+---------+-------------+----------------+---------+----+-----------+
And this is how it looks on the coordinator side:
# The worker primary notifies the coordinator that it is going to execute "pg_ctl stop".
2024-08-26 07:02:38,636 DEBUG: query(BEGIN, ())
2024-08-26 07:02:38,636 DEBUG: query(SELECT pg_catalog.citus_update_node(%s, %s, %s, true, %s), (3, '172.19.0.7-demoted', 5432, 10000))
# From this moment all application traffic on the coordinator to the worker group 2 is paused.
# The old worker primary is assigned as a secondary.
2024-08-26 07:02:40,084 DEBUG: query(SELECT pg_catalog.citus_update_node(%s, %s, %s, true, %s), (7, '172.19.0.7', 5432, 10000))
# The future worker primary notifies the coordinator that it acquired the leader lock in DCS and about to run "pg_ctl promote".
2024-08-26 07:02:40,085 DEBUG: query(SELECT pg_catalog.citus_update_node(%s, %s, %s, true, %s), (3, '172.19.0.5', 5432, 10000))
# The new worker primary just finished promote and notifies coordinator that it is ready to accept read-write traffic.
2024-08-26 07:02:41,485 DEBUG: query(COMMIT, ())
# From this moment the application traffic on the coordinator to the worker group 2 is unblocked.
Secondary nodes
Starting from Patroni v4.0.0 Citus secondary nodes without
noloadbalance
tag
are also registered in
pg_dist_node
.
However, to use secondary nodes for read-only queries applications need to change
citus.use_secondary_nodes
GUC.
Peek into DCS
The Citus cluster (coordinator and workers) are stored in DCS as a fleet of Patroni clusters logically grouped together:
/service/batman/ # scope=batman
/service/batman/0/ # citus.group=0, coordinator
/service/batman/0/initialize
/service/batman/0/leader
/service/batman/0/members/
/service/batman/0/members/m1
/service/batman/0/members/m2
/service/batman/1/ # citus.group=1, worker
/service/batman/1/initialize
/service/batman/1/leader
/service/batman/1/members/
/service/batman/1/members/m3
/service/batman/1/members/m4
...
Such an approach was chosen because for most DCS it becomes possible to fetch the entire Citus cluster with a single recursive read request. Only Citus coordinator nodes are reading the whole tree, because they have to discover worker nodes. Worker nodes are reading only the subtree for their own group and in some cases they could read the subtree of the coordinator group.
Citus on Kubernetes
Since Kubernetes doesn’t support hierarchical structures we had to include the citus group to all K8s objects Patroni creates:
batman-0-leader # the leader config map for the coordinator
batman-0-config # the config map holding initialize, config, and history "keys"
...
batman-1-leader # the leader config map for worker group 1
batman-1-config
...
I.e., the naming pattern is:
${scope}-${citus.group}-${type}
.
All Kubernetes objects are discovered by Patroni using the label selector , therefore all Pods with Patroni&Citus and Endpoints/ConfigMaps must have similar labels, and Patroni must be configured to use them using Kubernetes settings or environment variables .
A couple of examples of Patroni configuration using Pods environment variables:
-
for the coordinator cluster
apiVersion: v1
kind: Pod
metadata:
labels:
application: patroni
citus-group: "0"
citus-type: coordinator
cluster-name: citusdemo
name: citusdemo-0-0
namespace: default
spec:
containers:
- env:
- name: PATRONI_SCOPE
value: citusdemo
- name: PATRONI_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: PATRONI_KUBERNETES_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: PATRONI_KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: PATRONI_KUBERNETES_LABELS
value: '{application: patroni}'
- name: PATRONI_CITUS_DATABASE
value: citus
- name: PATRONI_CITUS_GROUP
value: "0"
-
for the worker cluster from the group 2
apiVersion: v1
kind: Pod
metadata:
labels:
application: patroni
citus-group: "2"
citus-type: worker
cluster-name: citusdemo
name: citusdemo-2-0
namespace: default
spec:
containers:
- env:
- name: PATRONI_SCOPE
value: citusdemo
- name: PATRONI_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: PATRONI_KUBERNETES_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: PATRONI_KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: PATRONI_KUBERNETES_LABELS
value: '{application: patroni}'
- name: PATRONI_CITUS_DATABASE
value: citus
- name: PATRONI_CITUS_GROUP
value: "2"
As you may noticed, both examples have
citus-group
label set. This label
allows Patroni to identify object as belonging to a certain Citus group. In
addition to that, there is also
PATRONI_CITUS_GROUP
environment variable,
which has the same value as the
citus-group
label. When Patroni creates
new Kubernetes objects ConfigMaps or Endpoints, it automatically puts the
citus-group:
${env.PATRONI_CITUS_GROUP}
label on them:
apiVersion: v1
kind: ConfigMap
metadata:
name: citusdemo-0-leader # Is generated as ${env.PATRONI_SCOPE}-${env.PATRONI_CITUS_GROUP}-leader
labels:
application: patroni # Is set from the ${env.PATRONI_KUBERNETES_LABELS}
cluster-name: citusdemo # Is automatically set from the ${env.PATRONI_SCOPE}
citus-group: '0' # Is automatically set from the ${env.PATRONI_CITUS_GROUP}
You can find a complete example of Patroni deployment on Kubernetes with Citus support in the kubernetes folder of the Patroni repository.
There are two important files for you:
-
Dockerfile.citus
-
citus_k8s.yaml
Citus upgrades and PostgreSQL major upgrades
First, please read about upgrading Citus version in the
documentation
.
There is one minor change in the process. When executing upgrade, you have to
use
patronictl restart
instead of
systemctl
restart
to restart
PostgreSQL.
The PostgreSQL major upgrade with Citus is a bit more complex. You will have to combine techniques used in the Citus documentation about major upgrades and Patroni documentation about PostgreSQL major upgrade . Please keep in mind that Citus cluster consists of many Patroni clusters (coordinator and workers) and they all have to be upgraded independently.