Step 1 - Make sure the replication configuration is setup correctly.
Step 2 – Install repmgr on primary and standby servers.
Make sure we install the same version of repmgr which is of our postgreSQL version. Since I am running postgreSQL version 13, I will be installing repmgr-13.
Note - We have to install it as a root user.
On Primary Server -
#yum -y install repmgr13*
On Standby Server -
#yum -y install repmgr13*
Step 3 – Configure postgresql.conf on Master and Standby nodes.
Since we have already configured replication, most of the parameters are already set. But just check on the below parameters and set whichever is missing.
shared_preload_libraries = 'repmgr'
max_wal_senders = 5
max_replication_slots = 10
wal_level = replica
hot_standby = on
archive_mode = on
archive_command = 'cp %p /home/postgres/arch_dir/% - We can have the different archive locations.
Listen_address=’*’
Step 4 – Create a replication manager user as repmgr.
Psql$ create user repmgr;
Psql$ create database repmgr owner repmgr;
Psql$alter user repmgr with superuser;
Step 5 – Create repmgr.conf on Primary node.Now we have to create repmgr.conf on Primary node. We can create it on any location but I have created under data directory only. My data directory is -
Create repmgr.conf with below parameters -
[postgres@Master master]$ cat repmgr.conf
cluster='failovertesting'
node_id=1
node_name=Master
conninfo='user=repmgr password=welcome host=192.168.204.133 dbname=repmgr port=5448 connect_timeout=2'
data_directory='/home/postgres/master/'
failover=automatic
promote_command='/usr/pgsql-13/bin/repmgr standby promote -f /home/postgres/master/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-13/bin/repmgr standby follow -f /home/postgres/master/repmgr.conf --log-to-file --upstream-node-id=%n'
Parameter understanding –
failover='automatic'
The failover parameter is one of the mandatory parameters for the repmgr daemon. This parameter tells the daemon if it should initiate an automatic failover when a failover situation is detected. It can have either of two values: “manual” or “automatic”. We will set this to automatic in each node.
promote_command
promote_command= '/usr/pgsql-13/bin/repmgr standby promote -f /home/postgres/master/repmgr.conf --log-to-file'
This is another mandatory parameter for the repmgr daemon. This parameter tells the repmgr daemon what command it should run to promote a standby. The value of this parameter will be typically the “repmgr standby promote” command, or the path to a shell script that calls the command. For our use case, we set this to the following in each node:
follow_command
follow_command= '/usr/pgsql-13/bin/repmgr standby follow -f /home/postgres/master/repmgr.conf --log-to-file --upstream-node-id=%n'
This is the third mandatory parameter for the repmgr daemon. This parameter tells a standby node to follow the new primary. The repmgr daemon replaces the %n placeholder with the node ID of the new primary at run time.
priority
The priority parameter adds weight to a node’s eligibility to become a primary. Setting this parameter to a higher value gives a node greater eligibility to become the primary node. Also, setting this value to zero for a node will ensure the node is never promoted as primary.
Suppose we have two standbys: replica1 and replica2 and we want to promote replica2 as the new primary when master node goes offline, and replica2 to follow replica1 as its new primary. We set the parameter to the following values in the two standby nodes:
Node Name Parameter Setting
replica1 priority=60
replica2 priority=40
monitor_interval_secs
This parameter tells the repmgr daemon how often (default is seconds) it should check the availability of the upstream node. In our case, there is only one upstream node i.e the master node. The default value is 2 seconds, but we will explicitly set this anyhow in each node like below
monitor_interval_secs=2
connection_check_type
This parameter dictates the protocol repmgr daemon will use to reach out to the upstream node. This parameter can take three values:
ping: repmgr uses the PQPing() method.
connection: repmgr tries to create a new connection to the upstream node
query: repmgr tries to run a SQL query on the upstream node using the existing connection
Again, we will set this parameter to the default value of ping in each node:
connection_check_type='ping'
reconnect_attempts and reconnect_interval
When the primary becomes unavailable, the repmgr daemon in the standby nodes will try to reconnect to the primary for reconnect_attempts times. The default value for this parameter is 6. Between each reconnect attempt, it will wait for reconnect_interval seconds, which has a default value of 10.
For demonstration purposes, we will use a short interval and fewer reconnect attempts. We set this parameter in every node:
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus
When the primary becomes unavailable in a multi-node cluster, the standbys can consult each other to build a quorum about a failover. This is done by asking each standby about the time it last saw the primary. If a node’s last communication was very recent and later than the time the local node saw the primary, the local node assumes the primary is still available, and does not go ahead with a failover decision.
To enable this consensus model, the primary_visibility_consensus parameter needs to be set to “true” in each node – including the witness:
primary_visibility_consensus=true
standby_disconnect_on_failover
When the standby_disconnect_on_failover parameter is set to “true” in a standby node, the repmgr daemon will ensure its WAL receiver is disconnected from the primary and not receiving any WAL segments. It will also wait for the WAL receivers of other standby nodes to stop before making a failover decision. This parameter should be set to the same value in each node. We are setting this to “true”.
standby_disconnect_on_failover=true
Setting this parameter to true means every standby node has stopped receiving data from the primary as the failover happens. The process will have a delay of 5 seconds plus the time it takes the WAL receiver to stop before a failover decision is made. By default, the repmgr daemon will wait for 30 seconds to confirm all sibling nodes have stopped receiving WAL segments before the failover happens.
Step 6 – Register Primary database cluster in the repmgr configuration
Since we have created the repmgr.conf file with the minimum required parameters. Now we have to register our primary database cluster in repmgr configuration.
[postgres@Master master]$ /usr/pgsql-13/bin/repmgr -f /home/postgres/master/repmgr.conf primary register
We can see primary node is registered successfully using below command –
[postgres@Master master]$ /usr/pgsql-13/bin/repmgr -f /home/postgres/master/repmgr.conf cluster show
Step 7 – Create repmgr.conf file on standby node.
We can create it in any location but I have created in the data directory only. My data directory is –
[postgres@Replica1 replica1]$ cat repmgr.conf
cluster='failovertesting'
node_id=2
node_name=Replica1
conninfo='host=192.168.204.134 user=repmgr password=welcome dbname=repmgr port=5448 connect_timeout=2'
data_directory='/home/postgres/replica1/'
failover=automatic
promote_command='/usr/pgsql-13/bin/repmgr standby promote -f /home/postgres/replica1/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-13/bin/repmgr standby follow -f /home/postgres/replica1/repmgr.conf --log-to-file --upstream-node-id=%n'
Step 8 – Register Standby Database ClusterNow we have to register our standby database cluster in the repmgr configuration.
[postgres@Replica1 replica1]$ /usr/pgsql-13/bin/repmgr -f /home/postgres/replica1/repmgr.conf standby register
Getting below error –
Then I had to force register it like below –
[postgres@Replica1 replica1]$ /usr/pgsql-13/bin/repmgr -f /home/postgres/replica1/repmgr.conf standby register –force
But when I was showing cluster status on primary it was giving below warning –
[postgres@Master ~]$ /usr/pgsql-13/bin/repmgr -f /home/postgres/master/repmgr.conf cluster show
So I had to run below command on standby database.
[postgres@Replica1 ~]$ /usr/pgsql-13/bin/repmgr -f /home/postgres/replica1/repmgr.conf standby follow
After than when I checked the cluster status on primary it is running without any error –
I checked both the nodes and it is running fine –
/usr/pgsql-13/bin/repmgr -f /home/postgres/master/repmgr.conf node check
$/usr/pgsql-13/bin/repmgr –f /home/postgres/master/repmgr.conf cluster crosscheck
Step 9 – Start repmgrd process on master and standby
[postgres@Master ~]$ /usr/pgsql-13/bin/repmgrd -f /home/postgres/master/repmgr.conf
[postgres@Replica1 ~]$ /usr/pgsql-13/bin/repmgrd -f /home/postgres/replica1/repmgr.conf
We can see the series of events using cluster events commands –
When it starts monitoring then we will see the below details –
Repmgr Command Options –
====================The End=====