Thursday, 19 May 2016

Repairing clog corruptions

Yesterday, i got a page from a client about a possible corruption issue to one of his clients, so i decided to share how i dealt with it. Before starting with how i approached the problem, i want to say that corruptions don't just happen in postgres, in OmniTI, we manage A LOT of databases, all with different major versions and in different operating systems and in my time, I haven't seen (many) cases that corruption happened because of a postgres bug (i've seen indexes getting corrupted but i don't remember ever seeing data being corrupted). What i want to say is that corruptions don't just happen, hardware failures, power outages, disks getting full are common reasons for corruptions. 
A replica and backups should always be in place and the server should be properly monitored. Unfortunately this server was not managed by us so none of the above was in place..

At first I saw in the logs entries like :

From the logs:
2016-05-18 15:50:06 EDT::@:[10191]: ERROR:  could not access status of transaction 128604706
2016-05-18 15:50:06 EDT::@:[10191]: DETAIL:  Could not open file "pg_clog/007A": No such file or directory.
2016-05-18 15:50:06 EDT::@:[10191]: CONTEXT:  automatic vacuum of table "pe.pg_toast.pg_toast_4409334"
much more to be expected...

At this point you know that something went wrong, ideally you want to copy your whole $PGDATA to somewhere else and work there while at the same time you start considering uncompressing your most recent backup. In my case,  there was no backup and not enough disk space to copy $PGDATA.

I started by fixing all clogs missing which i found  logs
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0114 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/00D1 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0106 bs=256k count=1
keep creating until logs are clean, they can be a lot, in my case they were more than 100....

From the logs i also found the tables :


db=# select relname,relnamespace from pg_class where oid in (18454,35597,35607,4409334,4409344,8817516) order by relnamespace;
        relname       | relnamespace
table_case1 |        16872
table_case1 |        16874
table_case2 |        16874
table_case2 |      4409063
table_case1 |      4409063
table_case2 |      8817221
(6 rows)

db=# select oid,nspname from pg_namespace where oid in (16872,16874,16874,4409063,8817221) order by oid;
   oid   |  nspname
   16872 | schema1
   16874 | schema2
 4409063 | schema3
 8817221 | schema4
(4 rows)

With a join i found schema.table:

Basically we have an application with multiple schemas and 2 kinds of tables were corrupted across 4 schemas.

For table_case1 (simple case, table not referenced by anyone):
for each schema :

set search_path to schema1;
create table badrows (badid int);

DO $f$
    curid INT := 0;
    vcontent TEXT;
    badid INT;
FOR badid IN SELECT id FROM table_case1 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
        SELECT id
        INTO vcontent
        FROM table_case1 where id = badid;
        when others then
            insert into badrows values (badid);
raise notice 'data for id % is corrupt', badid;
end loop;

(This script was taken from Josh Berkus blog, and it was modified to fill my needs.)

create table table_case1_clean as select * from table_case1
where id not in (select badid from badrows);

TRUNCATE table_case1;
vacuum full verbose table_case1;
insert into table_case1 select * from table_case1_clean;
vacuum full analyze verbose table_case1;
drop table badrows;

For table_case2 (this one is being referenced by 2 other tables) 
F or each org (schema):

set search_path to schema2;

create table badrows (badid int);

DO $f$
    curid INT := 0;
    vcontent TEXT;
    badid INT;
FOR badid IN SELECT id FROM table_case2 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
        SELECT id
        INTO vcontent
        FROM table_case2 where id = badid;
        when others then
            insert into badrows values (badid);
            raise notice 'data for id % is corrupt', badid;
end loop;

create table table_case2_clean as select * from table_case2
where id not in (select badid from badrows);

alter table reference_table1 drop constraint reference_table1_fkey;
delete from reference_table1 where table_case2_id in (select badid from badrows) ;

alter table reference_table2 drop constraint reference_table2_fkey;
delete from reference_table2 where table_case2_id in (select badid from badrows);

TRUNCATE table_case2;
vacuum full verbose table_case2;
insert into table_case2 select * from table_case2_clean;
vacuum full analyze verbose table_case2;

ALTER TABLE ONLY reference_table1
    ADD CONSTRAINT reference_table1_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id) ON DELETE CASCADE;

ALTER TABLE ONLY reference_table2
    ADD CONSTRAINT reference_table2_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id);

drop table badrows;

(please ignore or report any typos here, i replaced the real table names while i was writing this post so i might messed up with some names).

What we basically did here was to recreate the table without the corrupted rows.
After this, tables should be corruption free with the minimum possible data loss.
To ensure that you are corruption free you should either pg_dump and restore, or vacuum full everythingnormal vacuum will NOT show corruptions.

pe=# vacuum verbose schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
INFO:  index " schema1.table_case1_pkey" now contains 12175 row versions in 36 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.15 sec.
INFO:  " table_case1": found 0 removable, 12175 nonremovable row versions in 258 out of 258 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.17 sec.
INFO:  vacuuming "pg_toast.pg_toast_18454"
INFO:  index "pg_toast_18454_index" now contains 51370 row versions in 143 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_18454": found 0 removable, 51370 nonremovable row versions in 12331 out of 12331 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.05s/0.03u sec elapsed 0.09 sec.
pe=# vacuum full verbose  schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
ERROR:  missing chunk number 0 for toast value 9270408 in pg_toast_18454

Rows were lost, in my case that was acceptable and maybe your case is not the same as mine, but i hope this will provide some guidance in case you get into a similar situation..

Thanks for reading
- Vasilis

Tuesday, 24 November 2015

Backups Restores and PITR nowdays

Its been almost a year since 9.4 was released, it included many nice features, but the one that changed my day to day administration was replication slots.

In OmniTI , we use something called OmniPITR a lot, OmniPITR could be called a wal management suite, it can be used to wal ship, to replay wals to clean up wals when they are not needed and it can be used for backups. Usually the way I use it is to first set up streaming replication to a replica, and on top of streaming replication setup wal shipping replication, the reason is simple, i want to have all the wals around in case i lose the slave and i didn't really like the wal_keep_segments approach.

Backups are usually being taken from a slave in order to reduce load from the master.

Briefly, OmniPITR does this by sending a pg_start/stop_backup to the master, puts a pause removal file on the slave so wals are kept and makes 2 tar.gz files for the base backup and the wals.

Usually i have a backup server keeping all the wal files and the backups, which means a second (direct or indirect) wal shipping destination.
All this is nice and it works really well but with replication slots this could be more simple and more efficient.

Recently, I had to setup a backup for a 1 master 2 slave setup on 9.4.

The requirements of the backup would be the following:

  • backup will be taken from the slave.
  • backup will be incremental allowing PITR
  • backup will be stored remotely on another machine.
  • minimum possible wal shipping targets.
  • replicas should be identical and adding a new replica should be trivial.
  • backup should be initiated from the server that keeps backups and not from a slave.
  • backups should be tested by restoration every week.

Pretty straight forward stuff really.

I setup archiving to the backup server with a simple scp over rsync command,
archive_command = 'rsync -a %p postgres@<Backup server IP>:/data/walarchive/%f'

I created a backup user that will use .pgpass
touch ~/.pgpass ; chmod 0600 ~/.pgpass

added :

<master ip>:5432:template1:backup:pgbackup
<slave1 ip>:5432:template1:backup:pgbackup
<slave2 ip>:5432:template1:backup:pgbackup
<slave-x ip>:5432:template1:backup:pgbackup

Allowed my backup user in pg_hba.conf (in all db servers)
and then i simply used pg_basebackup like this :

rm -rf /data/base/* && pg_basebackup -D /data/base/ -h <slave ip> -U backup -Ft -z -U backup && mv /data/base/base.tar.gz /data/backups/basebackup_`date +"%Y-%m-%d"`.tar.gz

I would like if pg_basebackup could customize the backup name and if it used replication slots (coming in 9.5) but none if it is really a problem when it comes to backups.

I added a recovery.conf that looks like this :
restore_command = '<PATH/TO/pg_standby> -t /data/backups/ -l /data/walarchive %f %p %r'
trigger_file = '/data/backups/'
recovery_end_command = 'rm /data/backups/'
recovery_target = 'immediate'

The parameter “recovery_target” specifies that recovery should end as soon as a consistent state is reached, i.e. as early as possible. When restoring from an online backup, this means the point where taking the backup ended.

NOTE that the recovery.conf file will exist if the backup was taken from a slave, always remember to edit it and replace its entries with the ones above.

Some notes would be that before you start the database.
Remove all logs from pg_log and verify that:

archive_command ='/bin/true'
synchronous_standby_names = ''

At this moment all you have to do is to start the database and monitor the log.
The log file of a freshly restored database should look like this :

postgres@backup:/data/backups/pg_log$ tail -f postgresql-2015-11-13_020746.log
2015-11-13 02:07:46 EET [] [2344]: [1-1] user=,db=,e=00000 LOG:  database system was interrupted; last known up at 2015-11-13 02:06:20 EET
2015-11-13 02:08:10 EET [] [2344]: [2-1] user=,db=,e=00000 LOG:  starting point-in-time recovery to earliest consistent point
2015-11-13 02:08:10 EET [] [2344]: [3-1] user=,db=,e=00000 LOG:  restored log file "00000002.history" from archive
2015-11-13 02:08:10 EET [] [2344]: [4-1] user=,db=,e=00000 LOG:  restored log file "000000020000000200000054" from archive
2015-11-13 02:08:10 EET [] [2344]: [5-1] user=,db=,e=00000 LOG:  redo starts at 2/54000028
2015-11-13 02:08:10 EET [] [2344]: [6-1] user=,db=,e=00000 LOG:  consistent recovery state reached at 2/540000F0
2015-11-13 02:08:10 EET [] [2344]: [7-1] user=,db=,e=00000 LOG:  recovery stopping after reaching consistency
2015-11-13 02:08:10 EET [] [2344]: [8-1] user=,db=,e=00000 LOG:  recovery has paused
2015-11-13 02:08:10 EET [] [2344]: [9-1] user=,db=,e=00000 HINT:  Execute pg_xlog_replay_resume() to continue.
2015-11-13 02:08:10 EET [] [2342]: [3-1] user=,db=,e=00000 LOG:  database system is ready to accept read only connections

at this point run :

psql -c "select pg_xlog_replay_resume()" template1

you should see in the log file :

2015-11-13 02:10:08 EET [] [2344]: [13-1] user=,db=,e=00000 LOG:  archive recovery complete
2015-11-13 02:10:09 EET [] [2344]: [14-1] user=,db=,e=00000 LOG:  MultiXact member wraparound protections are now enabled
2015-11-13 02:10:09 EET [] [2342]: [4-1] user=,db=,e=00000 LOG:  database system is ready to accept connections
2015-11-13 02:10:09 EET [] [2394]: [1-1] user=,db=,e=00000 LOG:  autovacuum launcher started

Now, assuming that you want to perform PITR.

Follow exactly the same restore procedure as previously described but this time the recovery.conf should look like this :

restore_command = '<PATH/TO/pg_standby> -t /data/backups/ -l /data/walarchive %f %p %r'
#recovery_target_time = '2015-11-13 00:09:00'

# or
#recovery_target_xid = '1966'

trigger_file = '/data/backups/'
#recovery_target_inclusive = 'true'
recovery_end_command = 'rm /data/backups/'

recovery_target_time: This parameter specifies the time stamp up to which recovery will proceed.

recovery_target_xid: This parameter specifies the transaction ID up to which recovery will proceed. Keep in mind that while transaction IDs are assigned sequentially at transaction start, transactions can complete in a different numeric order. The transactions that will be recovered are those that committed before (and optionally including) the specified one. The precise stopping point is also influenced by recovery_target_inclusive.

recovery_target_inclusive: Specifies whether to stop just after the specified recovery target (true), or just before the recovery target (false). Applies when either recovery_target_time or recovery_target_xid is specified. This setting controls whether transactions having exactly the target commit time or ID, respectively, will be included in the recovery. Default is true.

The rest of this procedure should be identical as previously described.

Automate the restore procedure on the backup server , set some maintenance crontab entries that will delete old backups and WALs and you are have yourself a very simple but efficient backup strategy.

Remember , that testing your backups is equally important with backups!

Thanks for reading

Friday, 30 October 2015

PostgreSQL 9.4 streaming replication over SSL with Replication Slots

I never really liked the concept of keeping wal files in case the replica goes down proactively, setting wal_keep_segments was always a tricky number and most of the time we ended up setting this for the worst case scenario, something that means space consumption for no (under normal circumstances) good reason.
That’s why i really like replication slots.

Replication slots are, and i am quoting Robert Haas :

"a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation."

While there are a lot of posts about how to setup replication slots, and to be honest its pretty easy, in this post i will combine this with replication over SSL and i will also show a useful trick for pg_basebackup (props to my friend Denish Patel).

The setup that will serve this post is the following

2 Debian VMS, with minimal install with PostgreSQL 9.4.5 installed on both.
we will call them :

master01 the ip will be and it will act as Master with $PGDATA at /data/.
slave01 the ip will be and it will act as Slave with $PGDATA at /data/.

on Master:
We set SSL and replication settings to :

ssl = on
ssl_cert_file = '/data/ssl/server.crt'
ssl_key_file = '/data/ssl/server.key'
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 0
max_replication_slots = 5
hot_standby = on  

hot_standby is ignored on master but we plan getting a pg_basebackup so it will be used on slave.

Before we restart the Master, we have to create the certificate files :

mkdir $PGDATA/ssl ; chown postgres /data/ssl
cd /data/ssl/
openssl req -new -text -out server.req
openssl rsa -in privkey.pem -out server.key
rm privkey.pem
openssl req -x509 -in server.req -text -key server.key -out server.crt
chmod og-rwx server.key

More info about the certifications can be found here.

Now we need a proper replication user a replication slot and a replication entry on pg_hba.conf.
for the user :
psql -c "create role repuser with login replication password 'password';" postgres
for the slot :
psql -c "SELECT * FROM pg_create_physical_replication_slot('slave01_slot');" postgres
and pg_hba.conf should loot like this :

# TYPE  DATABASE        USER            ADDRESS                 METHOD
local   all             all                                     trust
host    all             all               trust
host    all             vasilis          md5
hostssl replication     repuser          md5
host    all             all                  reject

Two things that need attenction in pg_hba.conf :
1. hostssl for the replication user
2. keep your "reject everyone else" as your last entry.

With all these set on Master , time to restart and start working on slave.

On slave:

Connect to the replication slot BEFORE you initiate pg_basebackup. This way Master will keep all wal files needed for the duration of the backup. Connection to the replication slot can be done by using pg_receivexlog eg:

> pg_receivexlog -D . -S slave01_slot -v -h -U repuser -W
pg_receivexlog: starting log streaming at 0/3E000000 (timeline 1)
pg_receivexlog: finished segment at 0/3F000000 (timeline 1)
^Cpg_receivexlog: received interrupt signal, exiting
pg_receivexlog: received interrupt signal, exiting

notice that i pressed ^C, i stopped pg_receivexlog and with this way the wal files will remain to the Master until my slave is up.

now run pg_basebackup :
pg_basebackup -h -D $PGDATA -P -X stream -U repuser

When pg_basebackup is complete, add a recovery.conf with the following entries :

standby_mode = 'on'
primary_conninfo = 'host= port=5432 user=repuser password=password sslmode=require'
trigger_file = '/data/failover.trigger'
recovery_target_timeline = 'latest'
primary_slot_name = 'slave01_slot'

and start the slave.
you should see:
database system is ready to accept read only connections

Now, in order to monitor your replication there are two useful sql statements :
I run the first one while my slave is down so as you can see it shows how much data (wal data) have to be transferred to my slave.

postgres=# SELECT slot_name, database,
active,  pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;
  slot_name   | database | active | retained_bytes
 slave01_slot |          | f      |      201326688
(1 row)

and after i started the slave :

postgres=# SELECT slot_name, database, active,pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;
  slot_name   | database | active | retained_bytes
 slave01_slot |          | t      |              0
(1 row)

the second monitoring statement is :

postgres=#  SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes, pid, application_name
 FROM pg_stat_replication;

lag_bytes | pid | application_name
         0 | 855 | walreceiver

This will show you how many bytes your slave is behind.
(both statements are to be run on Master unless cascading replicas are in place)

Thanks for reading,
Vasilis Ventirozos

Thursday, 7 May 2015

Keeping snapshots of shared buffers and prewarming accordingly.

One major problem that i face during a failover is that contents of shared buffers on the slave are invalid or irrelevant. pg_prewarm that came with 9.4 is a good solution but usually you would use it by prewarming a table, or you could manually get the pages from pg_buffercache and prewarm the pages you had on the master prior to the failover, not everyone can do this especially during a failover, i reckon only a few can, so i wrote an extension called 'pgcozy' (comments about the name should be directed to @DenishPatel hehe).
This extension basically utilizes 2 well known extensions, pg_buffercache and pg_prewarm.
pg_buffercache has helped me so many times in the past that i couldn't even count, it shows the contents (pages) of the shared buffers and their popularity.
pg_prewarm is a new (9.4) extension that can be used on the block or relation level. In our case we will use it in the block level and we will put these 2 extensions to work.

pgcozy, stores on demand (or scheduled via a crontab entry) a snapshot of the contents of shared buffers in jsonb and because it will be replicated, any slave will know the contents of masters shared buffers.
It consists in 3 functions:
  1. initialization function
  2. shapshot function
  3. warm function
But first lets start with the installation.

unzip, and run make ; make install

then from the db that you want : create extension pgcozy;
it requires pg_buffercache and pg_prewarm preinstalled so if you get an error about these, please install them first.

After installation you need to initialize a schema and a table that it will store its contents, for now these are hardcoded, the schema name is pgcozy and the table is called snapshots.

test=# select pgcozy_init();
NOTICE:  pg_buffercache and pg_prewarm exists continuing...
NOTICE:  schema "pgcozy" does not exist, skipping
CONTEXT:  SQL statement "drop schema IF EXISTS pgcozy cascade"
PL/pgSQL function pgcozy_init() line 6 at SQL statement
NOTICE:  type "cozy_type" does not exist, skipping
CONTEXT:  SQL statement "drop type if exists cozy_type"
PL/pgSQL function pgcozy_init() line 10 at SQL statement
NOTICE:  Everything is done, check pgcozy schema for more details.

The above statement will initialize the schema and the table, from here you should be good to start taking snapshots:

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..

(1 row)

test=# select pgcozy_snapshot (1);
NOTICE:  Getting a new snapshot...

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date | snapshot
  1 | 2015-05-04    |
  2 | 2015-05-04    |
(2 rows)

Because this test db is fresh, it has no contents, but if you notice the argument of pgcozy_snapshot you can see that you can choose 0-5.
0 means all contents and 1-5 refers to the page popularity of pg_buffercache.
This means that you can select which pages you want to keep in your snapshot.

So lets initialize the db with some contents. I will use pgbench for this :

vasilis@cozy1:~$ ./pgbench -i -d test
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.13 s, remaining 0.00 s).
set primary keys...
vasilis@cozy1:~$ ./pgbench test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average: 0.000 ms
tps = 576.335658 (including connections establishing)
tps = 699.154024 (excluding connections establishing)

I didn't do much here, i just initialized pgbench with default values and i run a pgbench... again, with default values.
Lets get a new snapshot now :

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date |

  1 | 2015-05-04    |
  2 | 2015-05-04    |
  3 | 2015-05-04    | [{"block_no": 0, "popularity": 3, "table_name": "pgcozy.snapshots"}, {"block_no": 0, "popularity": 2, "table_name": "pgcozy.snapshots_uniq_idx"}, {"block_no": 1, "popularity": 2, "table_nam
e": "pgcozy.snapshots_uniq_idx"}, {"block_no": 0, "popularity": 2, "table_name": "public.pgbench_branches_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_tellers_pkey"}, {"block_no": 3, "p
opularity": 5, "table_name": "public.pgbench_accounts_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_accounts_pkey"}]
(3 rows)

As you can see, our latest snapshot has contents , as i previously mentioned i used jsonb to keep my snapshots nice and organized. This of course is being replicated to a slave i have. I don't really care about my slaves buffer cache contents, so i will assume that they are invalid because of readonly operations that happen there.. but something happened and i have to failover, so on the slave :

test=# select pgcozy_warm(0);
NOTICE:  warming up according to the latest pgcozy snapshot...
NOTICE:  Warming up 'pgcozy.snapshots' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 1 ...
NOTICE:  Warming up 'public.pgbench_branches_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_tellers_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 3 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 0 ...
NOTICE:  Done Warming up according to the latest snapshot...

pgcozy_warm can be used to prewarm according to the latest snapshot (0) or a specific snapshot id (check pgcozy.snapshot for that id).

I find this simple set of functions quite useful because everything passes through shared buffers and its really easy to put the wrong things in there. This way you can revert the contents of shared buffers to the way they were before any alteration.

It has some work that needs to be done in order to make this more of general use but for now it seems to work as it should, let me know what you think.

Thanks for reading
-- Vasilis

Tuesday, 16 September 2014

Offsite replication problems and how to solve them.

 Those of us who use (and abuse) replication in daily basis know how cool and flexible it is. I've seen a lot of guides on how to setup streaming replication in 5 minutes, how to setup basic archiving and/or wal shipping replication but i haven't seen many guides combining these or implementing an offsite setup simulating latency, packet corruption, and basically what happens under network degradation.

In this post i will describe a resilient replication setup of 2 nodes and i will put it to the test. For this post i will use 2 debian VMs, PostgreSQL 9.4 beta2, OmniPITR 1.3.2 and netem.
Netem can be found on all current (2.6+) distributions and it can emulate variable delay, loss, duplication and re-ordering.

The Basics

Streaming replication is awesome, its fast , easy to setup, lightweight and near to realtime, but how it performs over the internet ?

I setup a simple streaming replica, set wal_segments and wal_keep_segments low (10 and 5). Now i wanna emulate how it will perform over a slow internet connection :
From lab2 and as root :
# tc qdisc add dev eth0 root tbf rate 20kbit buffer 1600 limit 3000

This will emulate an "almost" network outage limiting eth0 to 20kbit.
Next, hammer lab1 with transactions... a bit later :
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000000A has already been removed
Makes sense right ? lab2 couldn't keep up with lab1, lab1 rotated all xlogs and the replica is now broken. I know that this example is a bit extreme, these settings would never be used for an offsite replica, in all fairness they aren't even suitable for a local replica.
But ! network outages happen, and especially on geographically distributed databases this WILL happen and because :
Matter will be damaged in direct proportion to its value.

So, lets configure something that will tolerate such failures and it will do much, much more.

First of all, we want postgres to archive , we want wal files to be transferred compressed and on an encrypted channel. For all my wal management i will use OmniPITR , a wal management suite written by OmniTI that is simple to use, has almost no dependencies and makes everything so much better, i will use rsync over ssh to transfer my wals, bellow the archive_command and the recovery.conf entries for streaming + WAL shipping replication.
(please keep in mind that these settings are for the sake of this post, they are not to be used directly on production systems)

archive_command = '/home/postgres/omnipitr/bin/omnipitr-archive -D /opt/postgres/pgdata/ -dr gzip=postgres@lab2:/opt/postgres/walarchive -l /home/postgres/archive.log -v  "%p"'           

recovery.conf :
standby_mode = 'on'
primary_conninfo = 'user=postgres host=lab1 port=5432'
restore_command = '/home/postgres/omnipitr/bin/omnipitr-restore -l /home/postgres/restore.log -s gzip=/opt/postgres/walarchive -f /home/postgres/ -p /home/postgres/pause.removal -t /tmp -ep hang -pp /opt/postgres/psql/bin/pg_controldata  -sr -r -v %f %p'

Again, with the same wal settings i hammered the database with transactions, the replica started having delays because of the 20kbit limitation but eventually it caught up and everything was ok.

OmniPITR is hands down awesome, it can do much more than just archive and restore wals. You can delay your replica,  you can take hot backups from the slave, that can even be encrypted on creation, you can output the backup from a slave directly to an offsite backup server with no need for extra disk space on the replica and more..
(On some occasions directly sending WALS to the slave might create issues, if you get into this situation, remember that you can always archive wals locally to the master and schedule a script to transfer all wals generated in time intervals that serve your needs.)

This all proves that WAL shipping replication on top of streaming is still very usable when it comes to remote/offsite slaves and you can always switch to only SR or only WAL shipping without problems.
Afterall, its good to know that in case of a network outage, your only limitation is disk space for archived wals.

Feedback and ideas for future posts are always welcome :)

Thanks for reading
- Vasilis

Tuesday, 26 August 2014

PostgreSQL and ElasticSearch

Recently i had to evaluate ElasticSearch for a possible installation, elasticsearch is basically a search server that provides a distributed full-text search engine using a restful web interface, stores documents in json, it is written in java it is fast and really works out of the box with almost minimum effort. After the installation, it's just reading the documentation and adding / searching documents, I didn't really experiment much with searching but the API looks really good.
One interesting question that i had to answer was about connectivity with postgres, and how to maintain a table in both datastores and that's what this post is all about.

The first (fast and easy) answer here was rivers, it creates a jdbc connection with another datastore and based on a query it can pump data from any database table. It is available for postgres, twitter, mongo etc.. Because its jdbc its relatively slow and elasticsearch will (re)pump the data once every restart so pay extra attention if you use this and read the documentation first.

One other way is to use LISTEN/NOTIFY commands of postgres which is basically a message queueing system. The idea is to raise a notification on every insert, a deamon would grab that and insert the record into elasticsearch..

For a single postgres table it would work like this :

create table messages (
 id serial primary key,
 date timestamp without time zone,
 carrier text,
 message text

CREATE OR REPLACE FUNCTION table_message_notify() RETURNS trigger AS $$
  PERFORM pg_notify('table_messages_notifier',CAST( AS text));
$$ LANGUAGE plpgsql;

CREATE TRIGGER object_post_insert_notify AFTER insert ON messages FOR EACH ROW EXECUTE PROCEDURE table_message_notify();

This will simply send a notification on 'table_messages_notifier' channel after an insert that a new record has been inserted. Now you need something that would grab and handle these notifications, i tried with various ways like python requests, but after a while i just did it with python elasticsearch library, and my life was suddenly much easier :). Here's the python script that i end up having to work pretty well (managed to replicate about 50m rows with no errors).

NOTE that i've intentionally left garbage in the code just to show alternatives that i personally tried. Also, this was just a proof of concept and not an actual properly working solution, but it should be enough for someone who knows what he is doing to create a deamon that would actually work even in production.

Thanks for reading
-- Vasilis

Friday, 22 August 2014

PostgreSQL Bi-Directional Replication

A while ago I wanted to explore the options i had for multi master replication in postgres, I have previously worked with Oracle MMR and to be honest i don't like the idea of mmr, mostly because of the all conflicts that you sometime get, most of the times these conflicts are easy to resolve though and mmr, or in our case BDR can be a very nice way to scale up your database writes.
So, while searching i came across a project called postgresql BDR (bi-directional replication) developed by 2ndQuadrant, it allows users to create a geographically distributed asynchronous multi-master database using Logical Log Streaming Replication based on the changeset extraction feature introduced in PostgreSQL 9.4.
This post is about showing how to set this up.
For this setup i used 2 debian VMs (debian-7.5.0-amd64-netinst.iso)
the extra packages i installed were :
git, mc, joe,sudo, curl, libreadline-dev, zlib1g-dev, bison, flex
(by the way, comments about joe wont be tolerated !!)

With the vms ready and a user postgres1 created :

git clone git://
cd 2ndquadrant_bdr
git checkout bdr/0.6
./configure --prefix=$HOME/bdr
make install
cd ~/2ndquadrant_bdr/contrib/btree_gist && make && make install && cd ../../contrib/bdr && make && make install

add to user's profile:
export PATH=/home/postgres1/bdr/bin:$PATH

check version :
psql (PostgreSQL) 9.4beta1_bdr0601

sudo mkdir /opt/pgdata1
sudo mkdir /opt/pgdata2
sudo chown postgres1:postgres1 /opt/pgdata1/
sudo chown postgres1:postgres1 /opt/pgdata2/

as postgres1
initdb -D /opt/pgdata1/ -U postgres1 --auth-host=md5 --auth-local=peer

edit /opt/pgdata1/pg_hba.conf and add :
host    replication     postgres1          trust
edit /opt/pgdata1/postgresql.conf and change :
listen_addresses = '*'

wal_level = 'logical'
max_replication_slots = 3
max_wal_senders = 4
shared_preload_libraries = 'bdr'
bdr.connections = 'postgres1'
bdr.postgres2_dsn = 'dbname=postgres host= user=postgres port=5433'
track_commit_timestamp = on

Made a copy with bdr_init_copy to a second $PGDATA directory, made the corrections on the connection strings in postgresql.conf so the 2 clusters could connect with each other and started both postgres servers.

Everything worked, both databases were in read/write mode and i could perform all kinds of operations on both of them.

A few things that have to be noted are:
1. you cannot use normal sequences , you have to use global sequences or you will see all kinds of problems.
2. I noticed the following issues that i'm not sure why they happen :

master1 =# create table test_tr (id int primary key, name text);
master1 =# insert into test_tr values (1,'Vasilis');
master1 =# begin ; update test_tr set name = 'Vasilis2' where id =1;
Before i commit on master1 i do the following on master 2
master2 =# begin ; update test_tr set name = 'Vasilis3' where id =1;
master2 =# commit;
master1 =# commit;
master1 =# select * from test_tr;
id | name

1 | Vasilis3
(1 row)

In my understanding the first transaction would lock the row and the second would wait the first to commit.
#CASE 2.#
master1 =# create table tr_test (name text);
master1 =# insert into tr_test values ('Vasilis');
master1 =# begin;
master1 =# update tr_test set name = 'koko';
master1 =# select name from tr_test ;
(1 row)

master1 =# commit;
master1 =# select name from tr_test ;
(1 row)

master2 =# begin ; update tr_test set name ='llalla';
master2 =# select name from tr_test ;
(1 row)

master2 =# commit;
master2 =# select name from tr_test ;
(1 row)

AT THIS POINT if you drop the table :
master1 =# drop table tr_test ;
... never ending...

master2 =# drop table tr_test ;
... never ending...

RESTART both masters doesn't make any difference
From postgres logs :
LOG: starting background worker process "bdr (6036347025262511738,1,12155,): postgres2: apply"
LOG: connection received: host= port=34641
LOG: replication connection authorized: user=postgres1
LOG: starting logical decoding for slot bdr_12155_6036347078061848225_1_12155__
DETAIL: streaming transactions committing after 0/1E87AC8, reading WAL from 0/1E87A90
LOG: logical decoding found consistent point at 0/1E87A90
DETAIL: running xacts with xcnt == 0
LOG: unexpected EOF on standby connection
LOG: disconnection: session time: 0:00:00.163 user=postgres1 database=postgres host= port=34641
ERROR: could not find primary key for table with oid 25281
LOG: worker process: bdr (6036347025262511738,1,12155,): postgres2: apply (PID 3649) exited with exit code 1

logs will keep increasing, same message repeats..

Conclusion is that BDR is not ready for production , something that of course is absolutely normal considering that its based on 9.4 which is still beta2, but it sure is promising, and if it gets to its final stage it will boost PostgreSQL replication flexibility tremendously!

official user guide can be found here

update: After Craigs comment that 0.7 release is out, i plan making a new post about 0.7 version, trying to reproduce the problems i already found.

Thanks for Reading
-- Vasilis