Tuesday, 20 December 2016

Traveling in time(lines) with pg_rewind

At 9.5 pg_rewind was introduced, with this feature it was possible to make a server that is no longer master to follow a promoted standby that has a new timeline. 
There are at least a few use cases that could benefit from this feature, to name a few :

Testing Failovers : Its pretty common for one of customers to ask for a failover simulation just to make sure that everything works, by using pg_rewind this is much easier and much faster because we don't have to re-sync the whole database that sometimes is multi-TB sized. 

Minor version upgrades : during an upgrade you have to put the new binaries in place and restart the database. By using pg_rewind if the application is flexible enough , you could upgrade the slave , switchover , upgrade the master and pg_rewind the old master to follow the new slave, possibly minimising even more the downtime.

Requirements.

pg_rewind has a couple of requirements :


1. data page checksums enabled (initdb -k)
or
2. parameter wal_log_hints has to be enabled 
3. The old-master (the one we want to re-sync) to be properly shutdown or you'll get: 
target server must be shut down cleanly 
Failure, exiting

How does it work ?

pg_rewind searches the old-master’s data directory, finds the data blocks changed during the switchover and then copies only the nesessary blocks from the promoted slave. Keep in mind that the configuration files will also be copied from the new master so take a look for mis-configurations before you start it ! Also you have to have all the wal files since the last checkpoint  from the old master. Changes are identified by comparing the state of the data blocks present in the data directory with the changes logged in the wal files. When the deferrable blocks are identified, the wals are replayed.
For more info refer to the documentation here

How to do it

Recently i've posted about how to use Docker to setup a lab environment, the link can be found here. I'm going to use two containers for this blogpost using the image created from the dockerfile that i have. Keep in mind that in my Dockerfile, i make sure that initdb is run with -k option, that all the parameters are properly set for replication and that archiving is enabled but not doing anything : 

psql -c "show archive_command";
 archive_command
-----------------
 /bin/true
(1 row)

before i setup my slave, i create a wal_archive directory and i change the archive_command to simply archive there :
mkdir ~/wal_archive
psql -c "alter system set archive_command to 'cp %p /home/postgres/wal_archive/%f' ;" postgres
ALTER SYSTEM
psql -c "select pg_reload_conf() " postgres
 pg_reload_conf
----------------
 t
(1 row)
psql -c "select pg_switch_xlog() ;" postgres
 pg_switch_xlog
----------------
 0/7000078
(1 row)
psql -c "checkpoint;" postgres
CHECKPOINT

now , from the slave :

mkdir ~/wal_archive ; ~/mk_replica.sh
waiting for server to shut down.... done
server stopped
29284/29284 kB (100%), 1/1 tablespace
server starting
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
-[ RECORD 1 ]----+--------
lag_bytes        | 0
pid              | 106
application_name | a_slave

Next , we promote the slave :
pg_ctl promote ; sleep 5 ; psql -c "select pg_is_in_recovery();" postgres
server promoting
pg_is_in_recovery
-------------------
 f
(1 row)

And now, time to pg_rewind, from the old master :

pg_ctl stop
waiting for server to shut down.... done
server stopped
pg_rewind --target-pgdata=/home/postgres/pgdata --source-server="host=10.0.0.3 port=5432 user=postgres dbname=postgres"

servers diverged at WAL position 0/F000098 on timeline 1
rewinding from last common checkpoint at 0/F000028 on timeline 1
Done!

put a recovery.conf on the rewinded master (now slave) and start it up, check that it got connected to the new master and you should be all good.

postgres=# SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes,pid, application_name FROM pg_stat_replication;
 lag_bytes | pid | application_name
-----------+-----+------------------
         0 | 139 | old_master



Most of the production database clusters that i've come across don't use database checksums which is understandable , because of the performance penalty that checksums have, (as an alternative to checksums, in OmniTI we use ZFS on OmniOS A LOT, and we love it) 
 still pg_rewind is a nice feature , maybe its not for every database but for those who will use it, it might save a ton of time of waiting to resync an old master.


Thanks for reading
-- Vasilis Ventirozos

Wednesday, 14 December 2016

Adding columns with default value fast

I recently came across a discussion in #postgresql channel about how to add a new column that will represent record insert timestamps to a table with minimal locking.

Normally you would do something like this :

monkey=# alter table alter_test add column insert_timestamp timestamp without time zone default now();
ALTER TABLE
Time: 13643.983 ms
monkey=# create index concurrently insert_timestamp_idx on alter_test (insert_timestamp) where insert_timestamp is not null;
CREATE INDEX
Time: 1616.108 ms

This though , would require an AccessExclusiveLock on the table and the application would have to wait for the duration of alter.
What you could do instead is :

monkey=# alter table alter_test add column insert_timestamp timestamp without time zone;
ALTER TABLE
Time: 1.135 ms
monkey=# alter table alter_test alter column insert_timestamp set default now();
ALTER TABLE
Time: 1.238 ms
monkey=# create index concurrently insert_timestamp_idx on alter_test (insert_timestamp) where insert_timestamp is not null;
CREATE INDEX
Time: 1618.788 ms

As you can see the timings are completely different, in the first case the database has to change data in all pre-existing rows, which isn't really needed since the value will be dummy (it will default to the transaction time).
The second way will also require an AccessExclusiveLock but it wont have to change any data ignoring the old rows so the lock should be almost instant (assuming no blockers exist),
next step you alter the table adding the default value and from now on the new column will have the default value. If you don't want to have nulls, or you want to add a not null constraint to the column, you can update the old (null) rows in a more progressive and non intrusive way later on.

Thanks for reading
- Vasilis Ventirozos

Tuesday, 6 December 2016

Custom Postgres installation with Docker


Lately, i wanted to make a new test / lab environment that would be dispensable, fast to deploy and easily customisable. VM's are all nice for lab use and they served me well over the years, but they tend to be slow(er) to deploy, heavy on resources and lets face it,



So i've decided to play around with Docker. I've played with Docker in the past but i haven't done anything too serious, i just took the official postgres Dockerfile and run it to see how it works. This is how i started my whole Docker exploration, i took the official postgres docker file that can be found here and made an image. Immediately i saw that it wasn't exactly what i wanted, i wanted something less generic. The things that i would like are :

  • latest postgres version, compiled from source
  • easily customisable pgdata and prefix
  • custom postgresql.conf settings (mostly having it replication ready)
  • custom shell environment
  • some extensions
  • data page checksums
  • easy replica deployment
So i started creating my own Dockerfile that would fill my needs and at the same time i would learn some more about Docker, 2 birds one stone kind of thing. After several hours and some testing i came up with something that was working (link at the bottom). It still needs some work to be done but i plan maintaining it for my own use so feel free to use it if you like what i've done there.

Let me explain a bit how it works.
With :
ENV PGBINDIR=/home/postgres/pgsql
ENV PGDATADIR=/home/postgres/pgdata
I can customise the installation and the pgdata directories, I install all packages i need, note that comments on joe will be deleted! Then i add postgres user to sudoers (this dockerfile is not meant to be secure, ease of use is what i am looking for when it comes to a lab postgres instance).
I -always- get the latest postgres version sources, compile and install (including extensions , documentation etc) set some environmental variables, configure postgres (postgresql.conf , pg_hba.conf) , put a sample recovery.done there so slaves can get it from pg_basebackup , install some extensions in template1, initdb with checksums,  create a replication user and finally copy a very simple replica creation script to the docker image.

How to make it work
With Docker installed its pretty simple :

to create dockers own playground network
docker network create --subnet=10.0.0.0/16 garden   
to build the image, from the directory you have the Dockerfile downloaded
docker build -t pg_dev .  
to run a container with a specific ip in our kindergarten docker network:
docker run --name master --net garden --ip 10.0.0.2 -i -t pg_dev /bin/bash
docker run --name slave1 --net garden --ip 10.0.0.3 -i -t pg_dev /bin/bash

and docker ps should look like this :

CONTAINER ID        IMAGE             COMMAND           CREATED              STATUS                PORTS              NAMES
7a38d4b11769            pg_dev              "/bin/bash"         About a minute ago   Up About a minute   5432/tcp            slave1
2c0266f942ea             pg_dev              "/bin/bash"         About a minute ago   Up About a minute   5432/tcp            master

Keep in mind that i have started playing around with Docker just 2 days ago. There might be a better way to assign static ips that I'm not aware of (yet).

Now, say that you have put some data in master and you wanna create a replica on slave1, something like this should work :

From any slave container home dir :
postgres@7a38d4b11769:~$ ./mk_replica.sh
waiting for server to shut down.... done
server stopped
29044/29044 kB (100%), 1/1 tablespace
server starting
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".

-[ RECORD 1 ]----+--------
lag_bytes               | 0
pid                         | 37
application_name  | a_slave


The Dockerfile can be found here and hopefully it can help some people on setting up a lab fast and easy.


Thanks for reading.
- Vasilis Ventirozos

Thursday, 19 May 2016

Repairing clog corruptions

Yesterday, i got a page from a client about a possible corruption issue to one of his clients, so i decided to share how i dealt with it. Before starting with how i approached the problem, i want to say that corruptions don't just happen in postgres, in OmniTI, we manage A LOT of databases, all with different major versions and in different operating systems and in my time, I haven't seen (many) cases that corruption happened because of a postgres bug (i've seen indexes getting corrupted but i don't remember ever seeing data being corrupted). What i want to say is that corruptions don't just happen, hardware failures, power outages, disks getting full are common reasons for corruptions. 
A replica and backups should always be in place and the server should be properly monitored. Unfortunately this server was not managed by us so none of the above was in place..

At first I saw in the logs entries like :


From the logs:
2016-05-18 15:50:06 EDT::@:[10191]: ERROR:  could not access status of transaction 128604706
2016-05-18 15:50:06 EDT::@:[10191]: DETAIL:  Could not open file "pg_clog/007A": No such file or directory.
2016-05-18 15:50:06 EDT::@:[10191]: CONTEXT:  automatic vacuum of table "pe.pg_toast.pg_toast_4409334"
...
much more to be expected...
...

At this point you know that something went wrong, ideally you want to copy your whole $PGDATA to somewhere else and work there while at the same time you start considering uncompressing your most recent backup. In my case,  there was no backup and not enough disk space to copy $PGDATA.
FUN TIMES :)

I started by fixing all clogs missing which i found  logs
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0114 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/00D1 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0106 bs=256k count=1
...
...
keep creating until logs are clean, they can be a lot, in my case they were more than 100....
...
...


From the logs i also found the tables :

pg_toast.pg_toast_18454
pg_toast.pg_toast_35597
pg_toast.pg_toast_35607
pg_toast.pg_toast_4409334
pg_toast.pg_toast_4409344
pg_toast.pg_toast_8817516


db=# select relname,relnamespace from pg_class where oid in (18454,35597,35607,4409334,4409344,8817516) order by relnamespace;
        relname       | relnamespace
------------------------+--------------
table_case1 |        16872
table_case1 |        16874
table_case2 |        16874
table_case2 |      4409063
table_case1 |      4409063
table_case2 |      8817221
(6 rows)

db=# select oid,nspname from pg_namespace where oid in (16872,16874,16874,4409063,8817221) order by oid;
   oid   |  nspname
---------+------------
   16872 | schema1
   16874 | schema2
 4409063 | schema3
 8817221 | schema4
(4 rows)

With a join i found schema.table:
schema1.table_case1
schema2.table_case1
schema2.table_case2
schema3.table_case2
schema3.table_case1
schema4.table_case2

Basically we have an application with multiple schemas and 2 kinds of tables were corrupted across 4 schemas.

For table_case1 (simple case, table not referenced by anyone):
for each schema :

set search_path to schema1;
create table badrows (badid int);

DO $f$
declare
    curid INT := 0;
    vcontent TEXT;
    badid INT;
begin
FOR badid IN SELECT id FROM table_case1 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
    begin
        SELECT id
        INTO vcontent
        FROM table_case1 where id = badid;
    exception
        when others then
            insert into badrows values (badid);
raise notice 'data for id % is corrupt', badid;
            continue;
    end;
end loop;
end;
$f$;

(This script was taken from Josh Berkus blog, and it was modified to fill my needs.)

create table table_case1_clean as select * from table_case1
where id not in (select badid from badrows);

TRUNCATE table_case1;
vacuum full verbose table_case1;
insert into table_case1 select * from table_case1_clean;
vacuum full analyze verbose table_case1;
drop table badrows;


For table_case2 (this one is being referenced by 2 other tables) 
F or each org (schema):

set search_path to schema2;

create table badrows (badid int);

DO $f$
declare
    curid INT := 0;
    vcontent TEXT;
    badid INT;
begin
FOR badid IN SELECT id FROM table_case2 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
    begin
        SELECT id
        INTO vcontent
        FROM table_case2 where id = badid;
    exception
        when others then
            insert into badrows values (badid);
            raise notice 'data for id % is corrupt', badid;
            continue;
    end;
end loop;
end;
$f$;


create table table_case2_clean as select * from table_case2
where id not in (select badid from badrows);

alter table reference_table1 drop constraint reference_table1_fkey;
delete from reference_table1 where table_case2_id in (select badid from badrows) ;

alter table reference_table2 drop constraint reference_table2_fkey;
delete from reference_table2 where table_case2_id in (select badid from badrows);

TRUNCATE table_case2;
vacuum full verbose table_case2;
insert into table_case2 select * from table_case2_clean;
vacuum full analyze verbose table_case2;

ALTER TABLE ONLY reference_table1
    ADD CONSTRAINT reference_table1_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id) ON DELETE CASCADE;

ALTER TABLE ONLY reference_table2
    ADD CONSTRAINT reference_table2_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id);

drop table badrows;

(please ignore or report any typos here, i replaced the real table names while i was writing this post so i might messed up with some names).


What we basically did here was to recreate the table without the corrupted rows.
After this, tables should be corruption free with the minimum possible data loss.
To ensure that you are corruption free you should either pg_dump and restore, or vacuum full everythingnormal vacuum will NOT show corruptions.


pe=# vacuum verbose schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
INFO:  index " schema1.table_case1_pkey" now contains 12175 row versions in 36 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.15 sec.
INFO:  " table_case1": found 0 removable, 12175 nonremovable row versions in 258 out of 258 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.17 sec.
INFO:  vacuuming "pg_toast.pg_toast_18454"
INFO:  index "pg_toast_18454_index" now contains 51370 row versions in 143 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_18454": found 0 removable, 51370 nonremovable row versions in 12331 out of 12331 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.05s/0.03u sec elapsed 0.09 sec.
VACUUM
pe=#
pe=#
pe=# vacuum full verbose  schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
ERROR:  missing chunk number 0 for toast value 9270408 in pg_toast_18454



Rows were lost, in my case that was acceptable and maybe your case is not the same as mine, but i hope this will provide some guidance in case you get into a similar situation..


Thanks for reading
- Vasilis






Tuesday, 24 November 2015

Backups Restores and PITR nowdays

Its been almost a year since 9.4 was released, it included many nice features, but the one that changed my day to day administration was replication slots.

In OmniTI , we use something called OmniPITR a lot, OmniPITR could be called a wal management suite, it can be used to wal ship, to replay wals to clean up wals when they are not needed and it can be used for backups. Usually the way I use it is to first set up streaming replication to a replica, and on top of streaming replication setup wal shipping replication, the reason is simple, i want to have all the wals around in case i lose the slave and i didn't really like the wal_keep_segments approach.

Backups are usually being taken from a slave in order to reduce load from the master.

Briefly, OmniPITR does this by sending a pg_start/stop_backup to the master, puts a pause removal file on the slave so wals are kept and makes 2 tar.gz files for the base backup and the wals.

Usually i have a backup server keeping all the wal files and the backups, which means a second (direct or indirect) wal shipping destination.
All this is nice and it works really well but with replication slots this could be more simple and more efficient.

Recently, I had to setup a backup for a 1 master 2 slave setup on 9.4.

The requirements of the backup would be the following:

  • backup will be taken from the slave.
  • backup will be incremental allowing PITR
  • backup will be stored remotely on another machine.
  • minimum possible wal shipping targets.
  • replicas should be identical and adding a new replica should be trivial.
  • backup should be initiated from the server that keeps backups and not from a slave.
  • backups should be tested by restoration every week.


Pretty straight forward stuff really.


I setup archiving to the backup server with a simple scp over rsync command,
archive_command = 'rsync -a %p postgres@<Backup server IP>:/data/walarchive/%f'

I created a backup user that will use .pgpass
touch ~/.pgpass ; chmod 0600 ~/.pgpass

added :

<master ip>:5432:template1:backup:pgbackup
<slave1 ip>:5432:template1:backup:pgbackup
<slave2 ip>:5432:template1:backup:pgbackup
<slave-x ip>:5432:template1:backup:pgbackup



Allowed my backup user in pg_hba.conf (in all db servers)
and then i simply used pg_basebackup like this :

rm -rf /data/base/* && pg_basebackup -D /data/base/ -h <slave ip> -U backup -Ft -z -U backup && mv /data/base/base.tar.gz /data/backups/basebackup_`date +"%Y-%m-%d"`.tar.gz

I would like if pg_basebackup could customize the backup name and if it used replication slots (coming in 9.5) but none if it is really a problem when it comes to backups.

I added a recovery.conf that looks like this :
 
restore_command = '<PATH/TO/pg_standby> -t /data/backups/failover.now -l /data/walarchive %f %p %r'
trigger_file = '/data/backups/failover.now'
recovery_end_command = 'rm /data/backups/failover.now'
recovery_target = 'immediate'


The parameter “recovery_target” specifies that recovery should end as soon as a consistent state is reached, i.e. as early as possible. When restoring from an online backup, this means the point where taking the backup ended.

NOTE that the recovery.conf file will exist if the backup was taken from a slave, always remember to edit it and replace its entries with the ones above.

Some notes would be that before you start the database.
Remove all logs from pg_log and verify that:

archive_command ='/bin/true'
synchronous_standby_names = ''

At this moment all you have to do is to start the database and monitor the log.
The log file of a freshly restored database should look like this :

postgres@backup:/data/backups/pg_log$ tail -f postgresql-2015-11-13_020746.log
2015-11-13 02:07:46 EET [] [2344]: [1-1] user=,db=,e=00000 LOG:  database system was interrupted; last known up at 2015-11-13 02:06:20 EET
2015-11-13 02:08:10 EET [] [2344]: [2-1] user=,db=,e=00000 LOG:  starting point-in-time recovery to earliest consistent point
2015-11-13 02:08:10 EET [] [2344]: [3-1] user=,db=,e=00000 LOG:  restored log file "00000002.history" from archive
2015-11-13 02:08:10 EET [] [2344]: [4-1] user=,db=,e=00000 LOG:  restored log file "000000020000000200000054" from archive
2015-11-13 02:08:10 EET [] [2344]: [5-1] user=,db=,e=00000 LOG:  redo starts at 2/54000028
2015-11-13 02:08:10 EET [] [2344]: [6-1] user=,db=,e=00000 LOG:  consistent recovery state reached at 2/540000F0
2015-11-13 02:08:10 EET [] [2344]: [7-1] user=,db=,e=00000 LOG:  recovery stopping after reaching consistency
2015-11-13 02:08:10 EET [] [2344]: [8-1] user=,db=,e=00000 LOG:  recovery has paused
2015-11-13 02:08:10 EET [] [2344]: [9-1] user=,db=,e=00000 HINT:  Execute pg_xlog_replay_resume() to continue.
2015-11-13 02:08:10 EET [] [2342]: [3-1] user=,db=,e=00000 LOG:  database system is ready to accept read only connections



at this point run :

psql -c "select pg_xlog_replay_resume()" template1

you should see in the log file :

2015-11-13 02:10:08 EET [] [2344]: [13-1] user=,db=,e=00000 LOG:  archive recovery complete
2015-11-13 02:10:09 EET [] [2344]: [14-1] user=,db=,e=00000 LOG:  MultiXact member wraparound protections are now enabled
2015-11-13 02:10:09 EET [] [2342]: [4-1] user=,db=,e=00000 LOG:  database system is ready to accept connections
2015-11-13 02:10:09 EET [] [2394]: [1-1] user=,db=,e=00000 LOG:  autovacuum launcher started


Now, assuming that you want to perform PITR.

Follow exactly the same restore procedure as previously described but this time the recovery.conf should look like this :

restore_command = '<PATH/TO/pg_standby> -t /data/backups/failover.now -l /data/walarchive %f %p %r'
#recovery_target_time = '2015-11-13 00:09:00'

# or
#recovery_target_xid = '1966'

trigger_file = '/data/backups/failover.now'
#recovery_target_inclusive = 'true'
recovery_end_command = 'rm /data/backups/failover.now'

recovery_target_time: This parameter specifies the time stamp up to which recovery will proceed.

recovery_target_xid: This parameter specifies the transaction ID up to which recovery will proceed. Keep in mind that while transaction IDs are assigned sequentially at transaction start, transactions can complete in a different numeric order. The transactions that will be recovered are those that committed before (and optionally including) the specified one. The precise stopping point is also influenced by recovery_target_inclusive.


recovery_target_inclusive: Specifies whether to stop just after the specified recovery target (true), or just before the recovery target (false). Applies when either recovery_target_time or recovery_target_xid is specified. This setting controls whether transactions having exactly the target commit time or ID, respectively, will be included in the recovery. Default is true.



The rest of this procedure should be identical as previously described.






Automate the restore procedure on the backup server , set some maintenance crontab entries that will delete old backups and WALs and you are have yourself a very simple but efficient backup strategy.

Remember , that testing your backups is equally important with backups!

Thanks for reading
Vasilis

Friday, 30 October 2015

PostgreSQL 9.4 streaming replication over SSL with Replication Slots

I never really liked the concept of keeping wal files in case the replica goes down proactively, setting wal_keep_segments was always a tricky number and most of the time we ended up setting this for the worst case scenario, something that means space consumption for no (under normal circumstances) good reason.
That’s why i really like replication slots.

Replication slots are, and i am quoting Robert Haas :


"a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation."

While there are a lot of posts about how to setup replication slots, and to be honest its pretty easy, in this post i will combine this with replication over SSL and i will also show a useful trick for pg_basebackup (props to my friend Denish Patel).

The setup that will serve this post is the following

2 Debian VMS, with minimal install with PostgreSQL 9.4.5 installed on both.
we will call them :

master01 the ip will be 192.168.1.15 and it will act as Master with $PGDATA at /data/.
slave01 the ip will be 192.168.1.16 and it will act as Slave with $PGDATA at /data/.


on Master:
We set SSL and replication settings to :

ssl = on
ssl_cert_file = '/data/ssl/server.crt'
ssl_key_file = '/data/ssl/server.key'
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 0
max_replication_slots = 5
hot_standby = on  


hot_standby is ignored on master but we plan getting a pg_basebackup so it will be used on slave.


Before we restart the Master, we have to create the certificate files :

mkdir $PGDATA/ssl ; chown postgres /data/ssl
cd /data/ssl/
openssl req -new -text -out server.req
openssl rsa -in privkey.pem -out server.key
rm privkey.pem
openssl req -x509 -in server.req -text -key server.key -out server.crt
chmod og-rwx server.key


More info about the certifications can be found here.

Now we need a proper replication user a replication slot and a replication entry on pg_hba.conf.
for the user :
psql -c "create role repuser with login replication password 'password';" postgres
for the slot :
psql -c "SELECT * FROM pg_create_physical_replication_slot('slave01_slot');" postgres
and pg_hba.conf should loot like this :

# TYPE  DATABASE        USER            ADDRESS                 METHOD
local   all             all                                     trust
host    all             all             127.0.0.1/32            trust
host    all             vasilis         192.168.1.0/16          md5
hostssl replication     repuser        192.168.1.16/32          md5
host    all             all             0.0.0.0/0               reject


Two things that need attenction in pg_hba.conf :
1. hostssl for the replication user
2. keep your "reject everyone else" as your last entry.

With all these set on Master , time to restart and start working on slave.

On slave:

Connect to the replication slot BEFORE you initiate pg_basebackup. This way Master will keep all wal files needed for the duration of the backup. Connection to the replication slot can be done by using pg_receivexlog eg:

> pg_receivexlog -D . -S slave01_slot -v -h 192.168.1.15 -U repuser -W
Password:
pg_receivexlog: starting log streaming at 0/3E000000 (timeline 1)
pg_receivexlog: finished segment at 0/3F000000 (timeline 1)
^Cpg_receivexlog: received interrupt signal, exiting
pg_receivexlog: received interrupt signal, exiting


notice that i pressed ^C, i stopped pg_receivexlog and with this way the wal files will remain to the Master until my slave is up.

now run pg_basebackup :
pg_basebackup -h 192.168.1.15 -D $PGDATA -P -X stream -U repuser

When pg_basebackup is complete, add a recovery.conf with the following entries :

standby_mode = 'on'
primary_conninfo = 'host=192.168.1.15 port=5432 user=repuser password=password sslmode=require'
trigger_file = '/data/failover.trigger'
recovery_target_timeline = 'latest'
primary_slot_name = 'slave01_slot'


and start the slave.
you should see:
database system is ready to accept read only connections

Now, in order to monitor your replication there are two useful sql statements :
I run the first one while my slave is down so as you can see it shows how much data (wal data) have to be transferred to my slave.

postgres=# SELECT slot_name, database,
active,  pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;
  slot_name   | database | active | retained_bytes
--------------+----------+--------+----------------
 slave01_slot |          | f      |      201326688
(1 row)


and after i started the slave :


postgres=# SELECT slot_name, database, active,pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;
  slot_name   | database | active | retained_bytes
--------------+----------+--------+----------------
 slave01_slot |          | t      |              0
(1 row)

the second monitoring statement is :

postgres=#  SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes, pid, application_name
 FROM pg_stat_replication;

lag_bytes | pid | application_name
-----------+-----+------------------
         0 | 855 | walreceiver

This will show you how many bytes your slave is behind.
(both statements are to be run on Master unless cascading replicas are in place)




Thanks for reading,
Vasilis Ventirozos

Thursday, 7 May 2015

Keeping snapshots of shared buffers and prewarming accordingly.

One major problem that i face during a failover is that contents of shared buffers on the slave are invalid or irrelevant. pg_prewarm that came with 9.4 is a good solution but usually you would use it by prewarming a table, or you could manually get the pages from pg_buffercache and prewarm the pages you had on the master prior to the failover, not everyone can do this especially during a failover, i reckon only a few can, so i wrote an extension called 'pgcozy' (comments about the name should be directed to @DenishPatel hehe).
This extension basically utilizes 2 well known extensions, pg_buffercache and pg_prewarm.
pg_buffercache has helped me so many times in the past that i couldn't even count, it shows the contents (pages) of the shared buffers and their popularity.
pg_prewarm is a new (9.4) extension that can be used on the block or relation level. In our case we will use it in the block level and we will put these 2 extensions to work.

pgcozy, stores on demand (or scheduled via a crontab entry) a snapshot of the contents of shared buffers in jsonb and because it will be replicated, any slave will know the contents of masters shared buffers.
 
It consists in 3 functions:
  1. initialization function
  2. shapshot function
  3. warm function
But first lets start with the installation.

unzip pgcozy--1.0.0.zip, and run make ; make install

then from the db that you want : create extension pgcozy;
it requires pg_buffercache and pg_prewarm preinstalled so if you get an error about these, please install them first.

After installation you need to initialize a schema and a table that it will store its contents, for now these are hardcoded, the schema name is pgcozy and the table is called snapshots.

test=# select pgcozy_init();
NOTICE:  pg_buffercache and pg_prewarm exists continuing...
NOTICE:  schema "pgcozy" does not exist, skipping
CONTEXT:  SQL statement "drop schema IF EXISTS pgcozy cascade"
PL/pgSQL function pgcozy_init() line 6 at SQL statement
NOTICE:  type "cozy_type" does not exist, skipping
CONTEXT:  SQL statement "drop type if exists cozy_type"
PL/pgSQL function pgcozy_init() line 10 at SQL statement
NOTICE:  Everything is done, check pgcozy schema for more details.

The above statement will initialize the schema and the table, from here you should be good to start taking snapshots:

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..
 pgcozy_snapshot
-----------------

(1 row)

test=# select pgcozy_snapshot (1);
NOTICE:  Getting a new snapshot...
 pgcozy_snapshot
-----------------

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date | snapshot
----+---------------+----------
  1 | 2015-05-04    |
  2 | 2015-05-04    |
(2 rows)
 

Because this test db is fresh, it has no contents, but if you notice the argument of pgcozy_snapshot you can see that you can choose 0-5.
0 means all contents and 1-5 refers to the page popularity of pg_buffercache.
This means that you can select which pages you want to keep in your snapshot.

So lets initialize the db with some contents. I will use pgbench for this :

vasilis@cozy1:~$ ./pgbench -i -d test
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.13 s, remaining 0.00 s).
vacuum...
set primary keys...
done.
vasilis@cozy1:~$ ./pgbench test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average: 0.000 ms
tps = 576.335658 (including connections establishing)
tps = 699.154024 (excluding connections establishing)
 


I didn't do much here, i just initialized pgbench with default values and i run a pgbench... again, with default values.
Lets get a new snapshot now :

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..
 pgcozy_snapshot
-----------------

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date |
                                                                             snapshot

----+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------
  1 | 2015-05-04    |
  2 | 2015-05-04    |
  3 | 2015-05-04    | [{"block_no": 0, "popularity": 3, "table_name": "pgcozy.snapshots"}, {"block_no": 0, "popularity": 2, "table_name": "pgcozy.snapshots_uniq_idx"}, {"block_no": 1, "popularity": 2, "table_nam
e": "pgcozy.snapshots_uniq_idx"}, {"block_no": 0, "popularity": 2, "table_name": "public.pgbench_branches_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_tellers_pkey"}, {"block_no": 3, "p
opularity": 5, "table_name": "public.pgbench_accounts_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_accounts_pkey"}]
(3 rows)

As you can see, our latest snapshot has contents , as i previously mentioned i used jsonb to keep my snapshots nice and organized. This of course is being replicated to a slave i have. I don't really care about my slaves buffer cache contents, so i will assume that they are invalid because of readonly operations that happen there.. but something happened and i have to failover, so on the slave :

test=# select pgcozy_warm(0);
NOTICE:  warming up according to the latest pgcozy snapshot...
NOTICE:  Warming up 'pgcozy.snapshots' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 1 ...
NOTICE:  Warming up 'public.pgbench_branches_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_tellers_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 3 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 0 ...
NOTICE:  Done Warming up according to the latest snapshot...

pgcozy_warm can be used to prewarm according to the latest snapshot (0) or a specific snapshot id (check pgcozy.snapshot for that id).

I find this simple set of functions quite useful because everything passes through shared buffers and its really easy to put the wrong things in there. This way you can revert the contents of shared buffers to the way they were before any alteration.

It has some work that needs to be done in order to make this more of general use but for now it seems to work as it should, let me know what you think.


Thanks for reading
-- Vasilis