Tuesday, 24 November 2015

Backups Restores and PITR nowdays

Its been almost a year since 9.4 was released, it included many nice features, but the one that changed my day to day administration was replication slots.

In OmniTI , we use something called OmniPITR a lot, OmniPITR could be called a wal management suite, it can be used to wal ship, to replay wals to clean up wals when they are not needed and it can be used for backups. Usually the way I use it is to first set up streaming replication to a replica, and on top of streaming replication setup wal shipping replication, the reason is simple, i want to have all the wals around in case i lose the slave and i didn't really like the wal_keep_segments approach.

Backups are usually being taken from a slave in order to reduce load from the master.

Briefly, OmniPITR does this by sending a pg_start/stop_backup to the master, puts a pause removal file on the slave so wals are kept and makes 2 tar.gz files for the base backup and the wals.

Usually i have a backup server keeping all the wal files and the backups, which means a second (direct or indirect) wal shipping destination.
All this is nice and it works really well but with replication slots this could be more simple and more efficient.

Recently, I had to setup a backup for a 1 master 2 slave setup on 9.4.

The requirements of the backup would be the following:

  • backup will be taken from the slave.
  • backup will be incremental allowing PITR
  • backup will be stored remotely on another machine.
  • minimum possible wal shipping targets.
  • replicas should be identical and adding a new replica should be trivial.
  • backup should be initiated from the server that keeps backups and not from a slave.
  • backups should be tested by restoration every week.

Pretty straight forward stuff really.

I setup archiving to the backup server with a simple scp over rsync command,
archive_command = 'rsync -a %p postgres@<Backup server IP>:/data/walarchive/%f'

I created a backup user that will use .pgpass
touch ~/.pgpass ; chmod 0600 ~/.pgpass

added :

<master ip>:5432:template1:backup:pgbackup
<slave1 ip>:5432:template1:backup:pgbackup
<slave2 ip>:5432:template1:backup:pgbackup
<slave-x ip>:5432:template1:backup:pgbackup

Allowed my backup user in pg_hba.conf (in all db servers)
and then i simply used pg_basebackup like this :

rm -rf /data/base/* && pg_basebackup -D /data/base/ -h <slave ip> -U backup -Ft -z -U backup && mv /data/base/base.tar.gz /data/backups/basebackup_`date +"%Y-%m-%d"`.tar.gz

I would like if pg_basebackup could customize the backup name and if it used replication slots (coming in 9.5) but none if it is really a problem when it comes to backups.

I added a recovery.conf that looks like this :
restore_command = '<PATH/TO/pg_standby> -t /data/backups/failover.now -l /data/walarchive %f %p %r'
trigger_file = '/data/backups/failover.now'
recovery_end_command = 'rm /data/backups/failover.now'
recovery_target = 'immediate'

The parameter “recovery_target” specifies that recovery should end as soon as a consistent state is reached, i.e. as early as possible. When restoring from an online backup, this means the point where taking the backup ended.

NOTE that the recovery.conf file will exist if the backup was taken from a slave, always remember to edit it and replace its entries with the ones above.

Some notes would be that before you start the database.
Remove all logs from pg_log and verify that:

archive_command ='/bin/true'
synchronous_standby_names = ''

At this moment all you have to do is to start the database and monitor the log.
The log file of a freshly restored database should look like this :

postgres@backup:/data/backups/pg_log$ tail -f postgresql-2015-11-13_020746.log
2015-11-13 02:07:46 EET [] [2344]: [1-1] user=,db=,e=00000 LOG:  database system was interrupted; last known up at 2015-11-13 02:06:20 EET
2015-11-13 02:08:10 EET [] [2344]: [2-1] user=,db=,e=00000 LOG:  starting point-in-time recovery to earliest consistent point
2015-11-13 02:08:10 EET [] [2344]: [3-1] user=,db=,e=00000 LOG:  restored log file "00000002.history" from archive
2015-11-13 02:08:10 EET [] [2344]: [4-1] user=,db=,e=00000 LOG:  restored log file "000000020000000200000054" from archive
2015-11-13 02:08:10 EET [] [2344]: [5-1] user=,db=,e=00000 LOG:  redo starts at 2/54000028
2015-11-13 02:08:10 EET [] [2344]: [6-1] user=,db=,e=00000 LOG:  consistent recovery state reached at 2/540000F0
2015-11-13 02:08:10 EET [] [2344]: [7-1] user=,db=,e=00000 LOG:  recovery stopping after reaching consistency
2015-11-13 02:08:10 EET [] [2344]: [8-1] user=,db=,e=00000 LOG:  recovery has paused
2015-11-13 02:08:10 EET [] [2344]: [9-1] user=,db=,e=00000 HINT:  Execute pg_xlog_replay_resume() to continue.
2015-11-13 02:08:10 EET [] [2342]: [3-1] user=,db=,e=00000 LOG:  database system is ready to accept read only connections

at this point run :

psql -c "select pg_xlog_replay_resume()" template1

you should see in the log file :

2015-11-13 02:10:08 EET [] [2344]: [13-1] user=,db=,e=00000 LOG:  archive recovery complete
2015-11-13 02:10:09 EET [] [2344]: [14-1] user=,db=,e=00000 LOG:  MultiXact member wraparound protections are now enabled
2015-11-13 02:10:09 EET [] [2342]: [4-1] user=,db=,e=00000 LOG:  database system is ready to accept connections
2015-11-13 02:10:09 EET [] [2394]: [1-1] user=,db=,e=00000 LOG:  autovacuum launcher started

Now, assuming that you want to perform PITR.

Follow exactly the same restore procedure as previously described but this time the recovery.conf should look like this :

restore_command = '<PATH/TO/pg_standby> -t /data/backups/failover.now -l /data/walarchive %f %p %r'
#recovery_target_time = '2015-11-13 00:09:00'

# or
#recovery_target_xid = '1966'

trigger_file = '/data/backups/failover.now'
#recovery_target_inclusive = 'true'
recovery_end_command = 'rm /data/backups/failover.now'

recovery_target_time: This parameter specifies the time stamp up to which recovery will proceed.

recovery_target_xid: This parameter specifies the transaction ID up to which recovery will proceed. Keep in mind that while transaction IDs are assigned sequentially at transaction start, transactions can complete in a different numeric order. The transactions that will be recovered are those that committed before (and optionally including) the specified one. The precise stopping point is also influenced by recovery_target_inclusive.

recovery_target_inclusive: Specifies whether to stop just after the specified recovery target (true), or just before the recovery target (false). Applies when either recovery_target_time or recovery_target_xid is specified. This setting controls whether transactions having exactly the target commit time or ID, respectively, will be included in the recovery. Default is true.

The rest of this procedure should be identical as previously described.

Automate the restore procedure on the backup server , set some maintenance crontab entries that will delete old backups and WALs and you are have yourself a very simple but efficient backup strategy.

Remember , that testing your backups is equally important with backups!

Thanks for reading

Friday, 30 October 2015

PostgreSQL 9.4 streaming replication over SSL with Replication Slots

I never really liked the concept of keeping wal files in case the replica goes down proactively, setting wal_keep_segments was always a tricky number and most of the time we ended up setting this for the worst case scenario, something that means space consumption for no (under normal circumstances) good reason.
That’s why i really like replication slots.

Replication slots are, and i am quoting Robert Haas :

"a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation."

While there are a lot of posts about how to setup replication slots, and to be honest its pretty easy, in this post i will combine this with replication over SSL and i will also show a useful trick for pg_basebackup (props to my friend Denish Patel).

The setup that will serve this post is the following

2 Debian VMS, with minimal install with PostgreSQL 9.4.5 installed on both.
we will call them :

master01 the ip will be and it will act as Master with $PGDATA at /data/.
slave01 the ip will be and it will act as Slave with $PGDATA at /data/.

on Master:
We set SSL and replication settings to :

ssl = on
ssl_cert_file = '/data/ssl/server.crt'
ssl_key_file = '/data/ssl/server.key'
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 0
max_replication_slots = 5
hot_standby = on  

hot_standby is ignored on master but we plan getting a pg_basebackup so it will be used on slave.

Before we restart the Master, we have to create the certificate files :

mkdir $PGDATA/ssl ; chown postgres /data/ssl
cd /data/ssl/
openssl req -new -text -out server.req
openssl rsa -in privkey.pem -out server.key
rm privkey.pem
openssl req -x509 -in server.req -text -key server.key -out server.crt
chmod og-rwx server.key

More info about the certifications can be found here.

Now we need a proper replication user a replication slot and a replication entry on pg_hba.conf.
for the user :
psql -c "create role repuser with login replication password 'password';" postgres
for the slot :
psql -c "SELECT * FROM pg_create_physical_replication_slot('slave01_slot');" postgres
and pg_hba.conf should loot like this :

# TYPE  DATABASE        USER            ADDRESS                 METHOD
local   all             all                                     trust
host    all             all               trust
host    all             vasilis          md5
hostssl replication     repuser          md5
host    all             all                  reject

Two things that need attenction in pg_hba.conf :
1. hostssl for the replication user
2. keep your "reject everyone else" as your last entry.

With all these set on Master , time to restart and start working on slave.

On slave:

Connect to the replication slot BEFORE you initiate pg_basebackup. This way Master will keep all wal files needed for the duration of the backup. Connection to the replication slot can be done by using pg_receivexlog eg:

> pg_receivexlog -D . -S slave01_slot -v -h -U repuser -W
pg_receivexlog: starting log streaming at 0/3E000000 (timeline 1)
pg_receivexlog: finished segment at 0/3F000000 (timeline 1)
^Cpg_receivexlog: received interrupt signal, exiting
pg_receivexlog: received interrupt signal, exiting

notice that i pressed ^C, i stopped pg_receivexlog and with this way the wal files will remain to the Master until my slave is up.

now run pg_basebackup :
pg_basebackup -h -D $PGDATA -P -X stream -U repuser

When pg_basebackup is complete, add a recovery.conf with the following entries :

standby_mode = 'on'
primary_conninfo = 'host= port=5432 user=repuser password=password sslmode=require'
trigger_file = '/data/failover.trigger'
recovery_target_timeline = 'latest'
primary_slot_name = 'slave01_slot'

and start the slave.
you should see:
database system is ready to accept read only connections

Now, in order to monitor your replication there are two useful sql statements :
I run the first one while my slave is down so as you can see it shows how much data (wal data) have to be transferred to my slave.

postgres=# SELECT slot_name, database,
active,  pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;
  slot_name   | database | active | retained_bytes
 slave01_slot |          | f      |      201326688
(1 row)

and after i started the slave :

postgres=# SELECT slot_name, database, active,pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;
  slot_name   | database | active | retained_bytes
 slave01_slot |          | t      |              0
(1 row)

the second monitoring statement is :

postgres=#  SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes, pid, application_name
 FROM pg_stat_replication;

lag_bytes | pid | application_name
         0 | 855 | walreceiver

This will show you how many bytes your slave is behind.
(both statements are to be run on Master unless cascading replicas are in place)

Thanks for reading,
Vasilis Ventirozos

Thursday, 7 May 2015

Keeping snapshots of shared buffers and prewarming accordingly.

One major problem that i face during a failover is that contents of shared buffers on the slave are invalid or irrelevant. pg_prewarm that came with 9.4 is a good solution but usually you would use it by prewarming a table, or you could manually get the pages from pg_buffercache and prewarm the pages you had on the master prior to the failover, not everyone can do this especially during a failover, i reckon only a few can, so i wrote an extension called 'pgcozy' (comments about the name should be directed to @DenishPatel hehe).
This extension basically utilizes 2 well known extensions, pg_buffercache and pg_prewarm.
pg_buffercache has helped me so many times in the past that i couldn't even count, it shows the contents (pages) of the shared buffers and their popularity.
pg_prewarm is a new (9.4) extension that can be used on the block or relation level. In our case we will use it in the block level and we will put these 2 extensions to work.

pgcozy, stores on demand (or scheduled via a crontab entry) a snapshot of the contents of shared buffers in jsonb and because it will be replicated, any slave will know the contents of masters shared buffers.
It consists in 3 functions:
  1. initialization function
  2. shapshot function
  3. warm function
But first lets start with the installation.

unzip pgcozy--1.0.0.zip, and run make ; make install

then from the db that you want : create extension pgcozy;
it requires pg_buffercache and pg_prewarm preinstalled so if you get an error about these, please install them first.

After installation you need to initialize a schema and a table that it will store its contents, for now these are hardcoded, the schema name is pgcozy and the table is called snapshots.

test=# select pgcozy_init();
NOTICE:  pg_buffercache and pg_prewarm exists continuing...
NOTICE:  schema "pgcozy" does not exist, skipping
CONTEXT:  SQL statement "drop schema IF EXISTS pgcozy cascade"
PL/pgSQL function pgcozy_init() line 6 at SQL statement
NOTICE:  type "cozy_type" does not exist, skipping
CONTEXT:  SQL statement "drop type if exists cozy_type"
PL/pgSQL function pgcozy_init() line 10 at SQL statement
NOTICE:  Everything is done, check pgcozy schema for more details.

The above statement will initialize the schema and the table, from here you should be good to start taking snapshots:

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..

(1 row)

test=# select pgcozy_snapshot (1);
NOTICE:  Getting a new snapshot...

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date | snapshot
  1 | 2015-05-04    |
  2 | 2015-05-04    |
(2 rows)

Because this test db is fresh, it has no contents, but if you notice the argument of pgcozy_snapshot you can see that you can choose 0-5.
0 means all contents and 1-5 refers to the page popularity of pg_buffercache.
This means that you can select which pages you want to keep in your snapshot.

So lets initialize the db with some contents. I will use pgbench for this :

vasilis@cozy1:~$ ./pgbench -i -d test
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.13 s, remaining 0.00 s).
set primary keys...
vasilis@cozy1:~$ ./pgbench test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average: 0.000 ms
tps = 576.335658 (including connections establishing)
tps = 699.154024 (excluding connections establishing)

I didn't do much here, i just initialized pgbench with default values and i run a pgbench... again, with default values.
Lets get a new snapshot now :

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date |

  1 | 2015-05-04    |
  2 | 2015-05-04    |
  3 | 2015-05-04    | [{"block_no": 0, "popularity": 3, "table_name": "pgcozy.snapshots"}, {"block_no": 0, "popularity": 2, "table_name": "pgcozy.snapshots_uniq_idx"}, {"block_no": 1, "popularity": 2, "table_nam
e": "pgcozy.snapshots_uniq_idx"}, {"block_no": 0, "popularity": 2, "table_name": "public.pgbench_branches_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_tellers_pkey"}, {"block_no": 3, "p
opularity": 5, "table_name": "public.pgbench_accounts_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_accounts_pkey"}]
(3 rows)

As you can see, our latest snapshot has contents , as i previously mentioned i used jsonb to keep my snapshots nice and organized. This of course is being replicated to a slave i have. I don't really care about my slaves buffer cache contents, so i will assume that they are invalid because of readonly operations that happen there.. but something happened and i have to failover, so on the slave :

test=# select pgcozy_warm(0);
NOTICE:  warming up according to the latest pgcozy snapshot...
NOTICE:  Warming up 'pgcozy.snapshots' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 1 ...
NOTICE:  Warming up 'public.pgbench_branches_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_tellers_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 3 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 0 ...
NOTICE:  Done Warming up according to the latest snapshot...

pgcozy_warm can be used to prewarm according to the latest snapshot (0) or a specific snapshot id (check pgcozy.snapshot for that id).

I find this simple set of functions quite useful because everything passes through shared buffers and its really easy to put the wrong things in there. This way you can revert the contents of shared buffers to the way they were before any alteration.

It has some work that needs to be done in order to make this more of general use but for now it seems to work as it should, let me know what you think.

Thanks for reading
-- Vasilis

Managing xid wraparound without looking like a (mail) chimp

My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE . Since this was P...