Tuesday, 16 September 2014

Offsite replication problems and how to solve them.

 Those of us who use (and abuse) replication in daily basis know how cool and flexible it is. I've seen a lot of guides on how to setup streaming replication in 5 minutes, how to setup basic archiving and/or wal shipping replication but i haven't seen many guides combining these or implementing an offsite setup simulating latency, packet corruption, and basically what happens under network degradation.

In this post i will describe a resilient replication setup of 2 nodes and i will put it to the test. For this post i will use 2 debian VMs, PostgreSQL 9.4 beta2, OmniPITR 1.3.2 and netem.
Netem can be found on all current (2.6+) distributions and it can emulate variable delay, loss, duplication and re-ordering.

The Basics

Streaming replication is awesome, its fast , easy to setup, lightweight and near to realtime, but how it performs over the internet ?

I setup a simple streaming replica, set wal_segments and wal_keep_segments low (10 and 5). Now i wanna emulate how it will perform over a slow internet connection :
From lab2 and as root :
# tc qdisc add dev eth0 root tbf rate 20kbit buffer 1600 limit 3000

This will emulate an "almost" network outage limiting eth0 to 20kbit.
Next, hammer lab1 with transactions... a bit later :
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000000A has already been removed
Makes sense right ? lab2 couldn't keep up with lab1, lab1 rotated all xlogs and the replica is now broken. I know that this example is a bit extreme, these settings would never be used for an offsite replica, in all fairness they aren't even suitable for a local replica.
But ! network outages happen, and especially on geographically distributed databases this WILL happen and because :
Matter will be damaged in direct proportion to its value.

So, lets configure something that will tolerate such failures and it will do much, much more.

First of all, we want postgres to archive , we want wal files to be transferred compressed and on an encrypted channel. For all my wal management i will use OmniPITR , a wal management suite written by OmniTI that is simple to use, has almost no dependencies and makes everything so much better, i will use rsync over ssh to transfer my wals, bellow the archive_command and the recovery.conf entries for streaming + WAL shipping replication.
(please keep in mind that these settings are for the sake of this post, they are not to be used directly on production systems)

archive_command = '/home/postgres/omnipitr/bin/omnipitr-archive -D /opt/postgres/pgdata/ -dr gzip=postgres@lab2:/opt/postgres/walarchive -l /home/postgres/archive.log -v  "%p"'           

recovery.conf :
standby_mode = 'on'
primary_conninfo = 'user=postgres host=lab1 port=5432'
restore_command = '/home/postgres/omnipitr/bin/omnipitr-restore -l /home/postgres/restore.log -s gzip=/opt/postgres/walarchive -f /home/postgres/failover.now -p /home/postgres/pause.removal -t /tmp -ep hang -pp /opt/postgres/psql/bin/pg_controldata  -sr -r -v %f %p'

Again, with the same wal settings i hammered the database with transactions, the replica started having delays because of the 20kbit limitation but eventually it caught up and everything was ok.

OmniPITR is hands down awesome, it can do much more than just archive and restore wals. You can delay your replica,  you can take hot backups from the slave, that can even be encrypted on creation, you can output the backup from a slave directly to an offsite backup server with no need for extra disk space on the replica and more..
(On some occasions directly sending WALS to the slave might create issues, if you get into this situation, remember that you can always archive wals locally to the master and schedule a script to transfer all wals generated in time intervals that serve your needs.)

This all proves that WAL shipping replication on top of streaming is still very usable when it comes to remote/offsite slaves and you can always switch to only SR or only WAL shipping without problems.
Afterall, its good to know that in case of a network outage, your only limitation is disk space for archived wals.

Feedback and ideas for future posts are always welcome :)

Thanks for reading
- Vasilis

Tuesday, 26 August 2014

PostgreSQL and ElasticSearch

Recently i had to evaluate ElasticSearch for a possible installation, elasticsearch is basically a search server that provides a distributed full-text search engine using a restful web interface, stores documents in json, it is written in java it is fast and really works out of the box with almost minimum effort. After the installation, it's just reading the documentation and adding / searching documents, I didn't really experiment much with searching but the API looks really good.
One interesting question that i had to answer was about connectivity with postgres, and how to maintain a table in both datastores and that's what this post is all about.

The first (fast and easy) answer here was rivers, it creates a jdbc connection with another datastore and based on a query it can pump data from any database table. It is available for postgres, twitter, mongo etc.. Because its jdbc its relatively slow and elasticsearch will (re)pump the data once every restart so pay extra attention if you use this and read the documentation first.

One other way is to use LISTEN/NOTIFY commands of postgres which is basically a message queueing system. The idea is to raise a notification on every insert, a deamon would grab that and insert the record into elasticsearch..

For a single postgres table it would work like this :

create table messages (
 id serial primary key,
 date timestamp without time zone,
 carrier text,
 message text

CREATE OR REPLACE FUNCTION table_message_notify() RETURNS trigger AS $$
  PERFORM pg_notify('table_messages_notifier',CAST(NEW.id AS text));
$$ LANGUAGE plpgsql;

CREATE TRIGGER object_post_insert_notify AFTER insert ON messages FOR EACH ROW EXECUTE PROCEDURE table_message_notify();

This will simply send a notification on 'table_messages_notifier' channel after an insert that a new record has been inserted. Now you need something that would grab and handle these notifications, i tried with various ways like python requests, but after a while i just did it with python elasticsearch library, and my life was suddenly much easier :). Here's the python script that i end up having to work pretty well (managed to replicate about 50m rows with no errors).

NOTE that i've intentionally left garbage in the code just to show alternatives that i personally tried. Also, this was just a proof of concept and not an actual properly working solution, but it should be enough for someone who knows what he is doing to create a deamon that would actually work even in production.

Thanks for reading
-- Vasilis

Friday, 22 August 2014

PostgreSQL Bi-Directional Replication

A while ago I wanted to explore the options i had for multi master replication in postgres, I have previously worked with Oracle MMR and to be honest i don't like the idea of mmr, mostly because of the all conflicts that you sometime get, most of the times these conflicts are easy to resolve though and mmr, or in our case BDR can be a very nice way to scale up your database writes.
So, while searching i came across a project called postgresql BDR (bi-directional replication) developed by 2ndQuadrant, it allows users to create a geographically distributed asynchronous multi-master database using Logical Log Streaming Replication based on the changeset extraction feature introduced in PostgreSQL 9.4.
This post is about showing how to set this up.
For this setup i used 2 debian VMs (debian-7.5.0-amd64-netinst.iso)
the extra packages i installed were :
git, mc, joe,sudo, curl, libreadline-dev, zlib1g-dev, bison, flex
(by the way, comments about joe wont be tolerated !!)

With the vms ready and a user postgres1 created :

git clone git://git.postgresql.org/git/2ndquadrant_bdr.git
cd 2ndquadrant_bdr
git checkout bdr/0.6
./configure --prefix=$HOME/bdr
make install
cd ~/2ndquadrant_bdr/contrib/btree_gist && make && make install && cd ../../contrib/bdr && make && make install

add to user's profile:
export PATH=/home/postgres1/bdr/bin:$PATH

check version :
psql (PostgreSQL) 9.4beta1_bdr0601

sudo mkdir /opt/pgdata1
sudo mkdir /opt/pgdata2
sudo chown postgres1:postgres1 /opt/pgdata1/
sudo chown postgres1:postgres1 /opt/pgdata2/

as postgres1
initdb -D /opt/pgdata1/ -U postgres1 --auth-host=md5 --auth-local=peer

edit /opt/pgdata1/pg_hba.conf and add :
host    replication     postgres1          trust
edit /opt/pgdata1/postgresql.conf and change :
listen_addresses = '*'

wal_level = 'logical'
max_replication_slots = 3
max_wal_senders = 4
shared_preload_libraries = 'bdr'
bdr.connections = 'postgres1'
bdr.postgres2_dsn = 'dbname=postgres host= user=postgres port=5433'
track_commit_timestamp = on

Made a copy with bdr_init_copy to a second $PGDATA directory, made the corrections on the connection strings in postgresql.conf so the 2 clusters could connect with each other and started both postgres servers.

Everything worked, both databases were in read/write mode and i could perform all kinds of operations on both of them.

A few things that have to be noted are:
1. you cannot use normal sequences , you have to use global sequences or you will see all kinds of problems.
2. I noticed the following issues that i'm not sure why they happen :

master1 =# create table test_tr (id int primary key, name text);
master1 =# insert into test_tr values (1,'Vasilis');
master1 =# begin ; update test_tr set name = 'Vasilis2' where id =1;
Before i commit on master1 i do the following on master 2
master2 =# begin ; update test_tr set name = 'Vasilis3' where id =1;
master2 =# commit;
master1 =# commit;
master1 =# select * from test_tr;
id | name

1 | Vasilis3
(1 row)

In my understanding the first transaction would lock the row and the second would wait the first to commit.
#CASE 2.#
master1 =# create table tr_test (name text);
master1 =# insert into tr_test values ('Vasilis');
master1 =# begin;
master1 =# update tr_test set name = 'koko';
master1 =# select name from tr_test ;
(1 row)

master1 =# commit;
master1 =# select name from tr_test ;
(1 row)

master2 =# begin ; update tr_test set name ='llalla';
master2 =# select name from tr_test ;
(1 row)

master2 =# commit;
master2 =# select name from tr_test ;
(1 row)

AT THIS POINT if you drop the table :
master1 =# drop table tr_test ;
... never ending...

master2 =# drop table tr_test ;
... never ending...

RESTART both masters doesn't make any difference
From postgres logs :
LOG: starting background worker process "bdr (6036347025262511738,1,12155,): postgres2: apply"
LOG: connection received: host= port=34641
LOG: replication connection authorized: user=postgres1
LOG: starting logical decoding for slot bdr_12155_6036347078061848225_1_12155__
DETAIL: streaming transactions committing after 0/1E87AC8, reading WAL from 0/1E87A90
LOG: logical decoding found consistent point at 0/1E87A90
DETAIL: running xacts with xcnt == 0
LOG: unexpected EOF on standby connection
LOG: disconnection: session time: 0:00:00.163 user=postgres1 database=postgres host= port=34641
ERROR: could not find primary key for table with oid 25281
LOG: worker process: bdr (6036347025262511738,1,12155,): postgres2: apply (PID 3649) exited with exit code 1

logs will keep increasing, same message repeats..

Conclusion is that BDR is not ready for production , something that of course is absolutely normal considering that its based on 9.4 which is still beta2, but it sure is promising, and if it gets to its final stage it will boost PostgreSQL replication flexibility tremendously!

official user guide can be found here

update: After Craigs comment that 0.7 release is out, i plan making a new post about 0.7 version, trying to reproduce the problems i already found.

Thanks for Reading
-- Vasilis 

Wednesday, 22 January 2014

Setting up Postgres Automated Failover

Nowadays, failover is one of the simplest procedures in postgres. Still, there are not many ways to automate this procedure and perform it without any human interference. This blog post will guide you through a way to implement automated failover by using two open source projects, repmgr2 and pgbouncer.

PGbouncer will act like a connection pool and as an entry point to the database cluster.
Repmgr now, (since version 2) supports automatic failover, the slave will try to connect to the master for a predefined time and number of attempts, if it fails, it will run a script that can be anything. We are going to use this functionality to perform auto-failover, redirect our application connections to the new master database, and notify the dba that a failover has been performed.

Each server will have its own copy of pgbouncer configuration file(s) and when a server is being promoted, the new master also has the responsibility to overwrite the configuration file on the server that pgbouncer runs (failover.sh).

In my example I used:
3 nodes (3 debian net install virtual machines)
$PGDATA configured on each node
3 postgres installations. one on each node
repmgr2-beta2 or later

bouncer witness and pgbouncer server, will act as connection to our database cluster.
pgnode1 postgres database server 1
pgnode2 postgres database server 2

I will use these hosts as examples and as a point of reference for all the configuration files, lets start. . .

Install postgres on all nodes, set passwordless ssh between the three nodes for a user that has access to write in $PGDATA and run postgres binaries.

Install repmgr on all nodes, and pgbouncer only on bouncer.
On pgnode1 set the following in $PGDATA/postgresql.conf

wal_level = 'hot_standby'
archive_mode = on
archive_command = '/bin/false'
max_wal_senders = 10
wal_keep_segments = 5           # ----- FOR MY EXAMPLE 5, suggested by repmgr is bellow -------
#### wal_keep_segments = 5000   # 80 GB required on pg_xlog
hot_standby = on
shared_preload_libraries = 'repmgr_funcs'

and in $PGDATA/pg_hba.conf:
host    all             repmgr             trust
host    replication     repmgr           trust

Bring PostgreSQL up on pgnode1 and run:

createuser -s repmgr
createdb -O repmgr repmgr
psql -f </usr/share/postgresql/9.0>/contrib/repmgr_funcs.sql repmgr

On pgnode2 run:
repmgr -d repmgr -U repmgr -h pgnode1 standby clone
(this will use rsync)

Start postgres on pgnode2, you should have a new slave.
edit a repmgr.conf , whereever you like; in my case /etc/repmgr/repmgr.conf and add:

conninfo='host= dbname=repmgr user=repmgr'
follow_command='repmgr standby follow -f /etc/repmgr/repmgr.conf'

Each node will need its own repmgr.conf file: check github link for all my config files.
on pgnode1:
repmgr -f /etc/repmgr/repmgr.conf master register

on pgnode2:
repmgr -f /etc/repmgr/repmgr.conf standby register
on bouncer (witness server):repmgr -d repmgr -U repmgr -h -D PGDATA -f /etc/repmgr/repmgr.conf witness create

on pgnode2 and on bouncer run:
repmgrd -f /etc/repmgr/repmgr.conf > /var/log/postgresql/repmgr.log 2>&1 
(this is the deamon for repmgr)

Put the configuration file of pgbouncer from the master (pgnode1) to bouncer (pgbouncer server). Restart and test connectivity. If you stop pgnode1's postgres, pgnode2 will takeover, rsync configuration file for pgbouncer to bouncer, restart pgbouncer with ssh and mail a notification.

The main idea behind this implementation is that pgbouncer is the only entry point from the application to the database, each node upon promotion is responsible to “show” pgbouncer where it should now connect, this way, the application can remain unaware of  the current master db.
An alternative implementation could skip pgbouncer completely and each slave could directly change the configuration file of the application. In my example, I used a pgbouncer for 3 reasons: to ensure stability, to enforce connection pooling (and its advantages), and also to skip the assumption that the database stack has access to the application stack, which in my opinion shouldn’t.

  1. If the master is running repmgrd and db crashes then the deamon will exit because it can’t connect to the db, needs a way to restart, possibly with postgres startup.
  2. Bouncer in our example is a single point of failure, witness db can be configured as pgnode1/2 with repmgr and a second pgbouncer on a second machine. The application should take care on how to use these 2 pgbouncers.
  3. Failover.sh runs pg_ctl promote which is a postgresql 9.1 feature, for versions prior to 9.1 you may “touch” a trigger file.

Github link of all the configuration files used


Thanks for reading
--Vasilis Ventirozos

Managing xid wraparound without looking like a (mail) chimp

My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE . Since this was P...