Selasa, 23 Juli 2024

POSTGRES-promote ha

 

PostgreSQL pg_rewind

https://www.youtube.com/watch?v=FRmEEaCoyXs
pg_rewind makes it possible to efficiently bring an old primary in sync with a new primary without having to perform a full base backup. This works by looking in the Write Ahead Log to see which pages have been modified, and only copying across those pages.
In this example, we have a primary (running on port 5530) and a standby subscribing to it (on port 5531):
 # SELECT * FROM pg_stat_replication;
 -[ RECORD 1 ]----+------------------------------
 pid              | 11609
 usesysid         | 16384
 usename          | rep_user
 application_name | standby1
 client_addr      | 127.0.0.1
 client_hostname  | 
 client_port      | 38434
 backend_start    | 2015-03-29 00:11:55.243319+00
 backend_xmin     | 
 state            | streaming
 sent_location    | 0/C81BB40
 write_location   | 0/C81BB40
 flush_location   | 0/C81BB40
 replay_location  | 0/C81BB40
 sync_priority    | 0
 sync_state       | async
Now we'll promote the standby:
 $ pg_ctl promote -D standby1
 server promoting
And we'll make some changes on this instance:
 $ psql -p 5531 postgres
 
 # CREATE TABLE x (content text);
 CREATE TABLE
 
 # INSERT INTO x SELECT 'test' FROM generate_series(1,1000);
 INSERT 0 1000
Now we'll stop old primary and use pg_rewind to re-synchronise it:
 $ pg_ctl stop -D primary
 waiting for server to shut down.... done
 server stopped
 
 $ pg_rewind -D primary --source-server='host=localhost port=5531' -P
 connected to remote server
 The servers diverged at WAL position 0/C81BB40 on timeline 1.
 Rewinding from last common checkpoint at 0/2000060 on timeline 1
 reading source file list
 reading target file list
 reading WAL in target
 Need to copy 274 MB (total source directory size is 290 MB)
 281142/281142 kB (100%) copied
 creating backup label and updating control file
 Done!
And we'll make some changes to get it to subscribe to the new primary:
 $ cd primary
 
 $ mv recovery.{done,conf}
 
 $ vi recovery.conf   # edited to set host info to point to port 5531 in this case
 
 $ vi postgresql.conf # as our example instances are running on the same server, we'll just change the port so it doesn't conflict
Then start the new standby (old primary):
 $ pg_ctl start -D primary
Let's see if it's successfully caught up:
 $ psql -p 5531 postgres  # connect to the new primary
 # SELECT * FROM pg_stat_replication;
 -[ RECORD 1 ]----+------------------------------
 pid              | 11837
 usesysid         | 16384
 usename          | rep_user
 application_name | standby1
 client_addr      | 127.0.0.1
 client_hostname  | 
 client_port      | 49148
 backend_start    | 2015-03-29 00:22:39.047657+00
 backend_xmin     | 
 state            | streaming
 sent_location    | 0/C8559B0
 write_location   | 0/C8559B0
 flush_location   | 0/C8559B0
 replay_location  | 0/C855978
 sync_priority    | 0
 sync_state       | async
And see if the test data from the new primary is on the new standby:
 $ psql -p 5530 postgres  # connect to the new standby
 
 # SELECT COUNT(*) FROM x;
  count 
 -------
   1000
 (1 row)
All synchronised.

PostgreSQL 9.5 – ADD PG_REWIND, FOR RE-SYNCHRONIZING A MASTER SERVER AFTER FAILBACK
d pg_rewind, for re-synchronizing a master server after failback.
 
Earlier versions of this tool were available (and still are) on github
So, we have a situation, where we have master and slave, some kind of replication, and we have to failover. This is trivial – just promote slave to standalone, and we're good. But we don't have slave then. Normal procedure is to setup slave on old master from scratch, using full data dir sync. But with non-trivial datasizes, it can take long time.
pg_rewind should solve the problem by just working on xlogs.
Let's see if it will really work that way. For added difficulty I will try to make it so that old master still did *some* work after the failover – this work should be “rolled back" after it will be set to slave from newly elected master.
To test it we will need two PostgreSQL's, but first I'll start with just master:
=$ initdb --data-checksums -D /var/tmp/master
=$ perl -pi -e 's/^#port\s*=.*/port = 5435/' /var/tmp/master/postgresql.conf
=$ pg_ctl -D /var/tmp/master/ -l /var/tmp/master.log start
With this in place, I have master running, on port 5435. It's very under-configured, but I'm mostly fine with it, just need some small changes so:
=$ psql -p 5435 -d postgres
$ alter system set archive_command = 'pigz -c %p > /var/tmp/archive/%f.gz';
$ alter system set archive_mode = on;
$ alter system set wal_level = hot_standby;
$ alter system set max_wal_senders = 10;
$ alter system set min_wal_size = '32MB';
$ alter system set max_wal_size = '32MB';
$ alter system set hot_standby = on;
$ alter system set wal_log_hints = on;
$ create user replication with replication;
I set the wal size so small, so that it will be simple to show how the situation works when all necessary wal files are no longer in pg_xlog/ directory.
With this in place, I just need to allow replication connections:
=$ echo "local replication replication trust" >> /var/tmp/master/pg_hba.conf
And of course, restart Pg to make all the setting come live:
=$ pg_ctl -D /var/tmp/master -m fast restart
Now, I need some test data to work on, and show that rewinding history works. So, I'll make a simple table:
$ create table test as select i, repeat('depesz'::text, 100) as payload, now() as last_updated from generate_series(1, 1000000) i;
$ create index what_updated on test (last_updated);
$ alter table test add primary key (i);
This generates this, very simple, table:
$ \d test
                 Table "public.test"
    Column    |           Type           | Modifiers
--------------+--------------------------+-----------
 i            | integer                  | not null
 payload      | text                     |
 last_updated | timestamp with time zone |
Indexes:
    "test_pkey" PRIMARY KEY, btree (i)
    "what_updated" btree (last_updated)
Where id is integer 1..1000000, payload is just long string to make the table and updates heavier, and last_updated is timestamptz that we will update.
To update the fields I will use:
$ update test
    set last_updated = clock_timestamp(), payload = repeat(clock_timestamp()::text, 100)
    where i = ( select ( (random() * 100000 )::int8 * 2 + current_setting('port')::int8 ) % 1000000 + 1 );
UPDATE 1
The where clause is complicated, but I wanted to update semi-random row, but make it so that the same script ran on slave will update different rows. I will put slave on port 5436, so master will always update even rows (with i being divisible by 2), and slave will be updating odd rows.
I repeat such update 10 times in a transaction, and as soon as one transaction is done, I start another.
After a minute or so, I can see:
$ select i, last_updated from test order by last_updated desc limit 10;
   i    |         last_updated
--------+-------------------------------
  45232 | 2015-04-04 13:43:50.272215+02
  57178 | 2015-04-04 13:43:50.272022+02
 136892 | 2015-04-04 13:43:50.271823+02
 116754 | 2015-04-04 13:43:50.271641+02
 177292 | 2015-04-04 13:43:50.271449+02
 131486 | 2015-04-04 13:43:50.271247+02
 140360 | 2015-04-04 13:43:50.270976+02
 105724 | 2015-04-04 13:43:50.270737+02
 130434 | 2015-04-04 13:43:50.270493+02
 127542 | 2015-04-04 13:43:50.270251+02
(10 rows)
Now, with this in place, I can setup slave (the updater process still goes on):
=$ pg_basebackup -D /var/tmp/slave -R -c fast -p 5435 -U replication
To have slave working, I will need to change port, and make sure it knows how to recover xlogs from archive, so:
=$ perl -pi -e 's/port = 5435/port = 5436/' /var/tmp/slave/postgresql.conf
=$ echo "restore_command = 'gzip -dc /var/tmp/archive/%f.gz > %p'" >> /var/tmp/slave/recovery.conf
With this done, I start slave:
=$ pg_ctl -D /var/tmp/slave/ -l /var/tmp/slave.log start
and wait for slave to catchup with replication. For this I used this query:
with x as (select pg_current_xlog_location() as cx)
select
    now()::timestamptz(0),
    pg_xlog_location_diff( cx, sent_location ) as sent_lag,
    pg_xlog_location_diff( cx, write_location ) as write_lag,
    pg_xlog_location_diff( cx, flush_location ) as flush_lag,
    pg_xlog_location_diff( cx, replay_location ) as replay_lag
from
    pg_stat_replication,
    x;
Repeating it every 2 seconds, I got:
=$ while true; do psql -p 5435 -d postgres -f show_lag.sql ; sleep 2; done | awk 'NR==1 || /[0-9]$/'
          now           | sent_lag | write_lag | flush_lag | replay_lag
 2015-04-04 13:45:52+02 |        0 |    188032 |    188032 |     219808
 2015-04-04 13:45:54+02 |        0 |         0 |     82840 |      82840
 2015-04-04 13:45:56+02 |        0 |         0 |    123192 |     123192
 2015-04-04 13:45:58+02 |        0 |         0 |     86664 |      86664
 2015-04-04 13:46:00+02 |        0 |         0 |     96008 |      96008
 2015-04-04 13:46:02+02 |        0 |         0 |         0 |      65976
 2015-04-04 13:46:04+02 |        0 |         0 |         0 |      79056
 2015-04-04 13:46:06+02 |   101672 |    101672 |    101712 |     223456
 2015-04-04 13:46:08+02 |        0 |         0 |     76320 |      76320
I see that replication is working steadily (the sent/write lags happen every so often, but are immediately cleared). So we're good.
Now. for the actual pg_rewind test. I will promote slave to standalone/master:
=$ pg_ctl -D /var/tmp/slave/ promote
After it got promoted (the update script on old master is still running), we are starting to see differences:
=$ psql -p 5435 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 162358 | 2015-04-04 13:46:39.827506+02
  14854 | 2015-04-04 13:46:39.827326+02
  81126 | 2015-04-04 13:46:39.827058+02
(3 rows)
 
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 rows)
Old master is still updating, but new master stopped updated when the promotion happened. So now, I'll stop the updating script, and rerun it on new master…
Now, we're getting newer values from 5436, but also please note that newest even rows are older than on old master:
=$ psql -p 5435 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 106572 | 2015-04-04 13:47:21.519875+02
  16160 | 2015-04-04 13:47:21.519372+02
  13140 | 2015-04-04 13:47:21.51886+02
(3 rows)
 
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 18331 | 2015-04-04 13:48:02.355589+02
  8521 | 2015-04-04 13:48:02.355299+02
 18015 | 2015-04-04 13:48:02.354986+02
(3 rows)
 
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test where 0 = i % 2 order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 rows)
And now it's time to bring old master to become slave.
First, I need to stop old master:
=$ pg_ctl -D /var/tmp/master/ -m fast stop
Now I run pg_rewind:
=$ pg_rewind -D /var/tmp/master --source-server="port=5436 user=pgdba dbname=postgres"
The servers diverged at WAL position 0/9BC9E268 on timeline 1.
could not open file "/var/tmp/master/pg_xlog/00000001000000000000009B": No such file or directory
 
could not find previous WAL record at 0/9BC9E268
Failure, exiting
This is OK. I purposely made master keep as little xlogs as possible, so now I need to copy some back. Quick inspection shows:
=$ ls -l /var/tmp/master/pg_xlog/
total 32780
-rw------- 1 pgdba pgdba      307 Apr  4 13:44 000000010000000000000077.000090D8.backup
-rw------- 1 pgdba pgdba 16777216 Apr  4 13:51 0000000100000000000000B5
-rw------- 1 pgdba pgdba 16777216 Apr  4 13:51 0000000100000000000000B6
-rw------- 1 pgdba pgdba       42 Apr  4 13:49 00000002.history
drwx------ 2 pgdba pgdba     4096 Apr  4 13:51 archive_status/
So I'll just copy all xlogs from 9B to B5:
=$ cp /var/tmp/archive/00000001000000000000009[BCDEF].gz /var/tmp/master/pg_xlog/
=$ cp /var/tmp/archive/0000000100000000000000A?.gz /var/tmp/master/pg_xlog/
=$ cp /var/tmp/archive/0000000100000000000000B[01234].gz /var/tmp/master/pg_xlog/
=$ gzip -d /var/tmp/master/pg_xlog/*.gz
With this in place, I rerun pg_rewind:
=$ pg_rewind -D /var/tmp/master --source-server="port=5436 user=pgdba dbname=postgres"
The servers diverged at WAL position 0/9BC9E268 on timeline 1.
Rewinding from last common checkpoint at 0/9B8B20D0 on timeline 1
Done!
This took ~ 1 second. Now – after running pg_rewind, state of /var/tmp/master is like if I had run pg_basebackup from “slave" (current master), including bad port (5436), and no sensible recovery.conf. Let's fix that:
=$ perl -pi -e 's/5436/5435/' /var/tmp/master/postgresql.conf
=$ cat /var/tmp/master/recovery.done  | sed 's/5435/5436/' > /var/tmp/master/recovery.conf
After this was done, I should be able to start new slave:
=$ pg_ctl -D /var/tmp/master -l /var/tmp/master.log start
After couple of seconds I can see that old master, current slave, is replicating – first from wal archive:
=$ ps uxfww | grep -E '( [-]D |[0]00000)'
pgdba      603  0.0  0.0 180328 13416 pts/4    S    13:59   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/master
pgdba      609 15.3  0.8 180416 135984 ?       Ss   13:59   0:01  \_ postgres: startup process   waiting for 0000000200000000000000F9
pgdba      900  0.0  0.0   4440   656 ?        S    13:59   0:00  |   \_ sh -c gzip -dc /var/tmp/archive/0000000200000000000000F9.gz > pg_xlog/RECOVERYXLOG
pgdba      901  7.0  0.0   4728   616 ?        D    13:59   0:00  |       \_ gzip -dc /var/tmp/archive/0000000200000000000000F9.gz
pgdba    24111  0.0  0.0 180324 13468 pts/4    S    13:44   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/slave
pgdba    25106  0.0  0.0  35300   972 ?        Ss   13:46   0:00  \_ postgres: archiver process   last was 0000000200000001000000CB
Couple more minutes, and streaming kicks-in:
=$ ps uxfww | grep -E '( [-]D |[0]00000|[s]tream)'
pgdba      603  0.0  0.0 180328 13416 pts/4    S    13:59   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/master
pgdba      609  4.2  0.8 180416 136396 ?       Ss   13:59   0:08  \_ postgres: startup process   recovering 0000000200000001000000E9
pgdba     2000  1.2  0.0 191396  2208 ?        Ss   14:00   0:01  \_ postgres: wal receiver process   streaming 1/E9EFAF08
pgdba    24111  0.0  0.0 180324 13460 pts/4    S    13:44   0:00 /home/pgdba/work/bin/postgres -D /var/tmp/slave
pgdba    25106  0.0  0.0  35300   972 ?        Ss   13:46   0:00  \_ postgres: archiver process   last was 0000000200000001000000E8
pgdba     2001  0.7  0.0 180984  2168 ?        Ss   14:00   0:00  \_ postgres: wal sender process replication [local] streaming 1/E9EFAF08
Of course, on both server, the “newest" rows will be odd – updated on slave:
=$ psql -p 5435 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
  42785 | 2015-04-04 14:09:14.065943+02
 190257 | 2015-04-04 14:09:14.065771+02
 145031 | 2015-04-04 14:09:14.065536+02
(3 rows)
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 192215 | 2015-04-04 14:09:14.520619+02
  64991 | 2015-04-04 14:09:14.52043+02
  69937 | 2015-04-04 14:09:14.520249+02
(3 rows)
but the interesting thing is – what will be the newest even rows now?
Before pg_rewind old master has newer values than new master. And now:
=$ psql -p 5435 -d postgres -c 'select i, last_updated from test where 0 = i % 2 order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 rows)
 
=$ psql -p 5436 -d postgres -c 'select i, last_updated from test where 0 = i % 2 order by 2 desc limit 3'
   i    |         last_updated
--------+-------------------------------
 204796 | 2015-04-04 13:46:27.894727+02
  96450 | 2015-04-04 13:46:27.894492+02
  21950 | 2015-04-04 13:46:27.894219+02
(3 rows)
Same thing. So, true to its name, pg_rewind removed changes from old master that didn't make it to new master. And that's just great. Thanks to all involved, pg_rewind will clearly be very useful.

Tidak ada komentar:

Posting Komentar