Thursday, 7 May 2015

Keeping snapshots of shared buffers and prewarming accordingly.

One major problem that i face during a failover is that contents of shared buffers on the slave are invalid or irrelevant. pg_prewarm that came with 9.4 is a good solution but usually you would use it by prewarming a table, or you could manually get the pages from pg_buffercache and prewarm the pages you had on the master prior to the failover, not everyone can do this especially during a failover, i reckon only a few can, so i wrote an extension called 'pgcozy' (comments about the name should be directed to @DenishPatel hehe).
This extension basically utilizes 2 well known extensions, pg_buffercache and pg_prewarm.
pg_buffercache has helped me so many times in the past that i couldn't even count, it shows the contents (pages) of the shared buffers and their popularity.
pg_prewarm is a new (9.4) extension that can be used on the block or relation level. In our case we will use it in the block level and we will put these 2 extensions to work.

pgcozy, stores on demand (or scheduled via a crontab entry) a snapshot of the contents of shared buffers in jsonb and because it will be replicated, any slave will know the contents of masters shared buffers.
It consists in 3 functions:
  1. initialization function
  2. shapshot function
  3. warm function
But first lets start with the installation.

unzip, and run make ; make install

then from the db that you want : create extension pgcozy;
it requires pg_buffercache and pg_prewarm preinstalled so if you get an error about these, please install them first.

After installation you need to initialize a schema and a table that it will store its contents, for now these are hardcoded, the schema name is pgcozy and the table is called snapshots.

test=# select pgcozy_init();
NOTICE:  pg_buffercache and pg_prewarm exists continuing...
NOTICE:  schema "pgcozy" does not exist, skipping
CONTEXT:  SQL statement "drop schema IF EXISTS pgcozy cascade"
PL/pgSQL function pgcozy_init() line 6 at SQL statement
NOTICE:  type "cozy_type" does not exist, skipping
CONTEXT:  SQL statement "drop type if exists cozy_type"
PL/pgSQL function pgcozy_init() line 10 at SQL statement
NOTICE:  Everything is done, check pgcozy schema for more details.

The above statement will initialize the schema and the table, from here you should be good to start taking snapshots:

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..

(1 row)

test=# select pgcozy_snapshot (1);
NOTICE:  Getting a new snapshot...

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date | snapshot
  1 | 2015-05-04    |
  2 | 2015-05-04    |
(2 rows)

Because this test db is fresh, it has no contents, but if you notice the argument of pgcozy_snapshot you can see that you can choose 0-5.
0 means all contents and 1-5 refers to the page popularity of pg_buffercache.
This means that you can select which pages you want to keep in your snapshot.

So lets initialize the db with some contents. I will use pgbench for this :

vasilis@cozy1:~$ ./pgbench -i -d test
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.13 s, remaining 0.00 s).
set primary keys...
vasilis@cozy1:~$ ./pgbench test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average: 0.000 ms
tps = 576.335658 (including connections establishing)
tps = 699.154024 (excluding connections establishing)

I didn't do much here, i just initialized pgbench with default values and i run a pgbench... again, with default values.
Lets get a new snapshot now :

test=# select pgcozy_snapshot (0);
NOTICE:  Getting a new snapshot of all contents of pg_buffercache...
NOTICE:  Snapshot Taken..

(1 row)

test=# select * from pgcozy.snapshots;
 id | snapshot_date |

  1 | 2015-05-04    |
  2 | 2015-05-04    |
  3 | 2015-05-04    | [{"block_no": 0, "popularity": 3, "table_name": "pgcozy.snapshots"}, {"block_no": 0, "popularity": 2, "table_name": "pgcozy.snapshots_uniq_idx"}, {"block_no": 1, "popularity": 2, "table_nam
e": "pgcozy.snapshots_uniq_idx"}, {"block_no": 0, "popularity": 2, "table_name": "public.pgbench_branches_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_tellers_pkey"}, {"block_no": 3, "p
opularity": 5, "table_name": "public.pgbench_accounts_pkey"}, {"block_no": 0, "popularity": 1, "table_name": "public.pgbench_accounts_pkey"}]
(3 rows)

As you can see, our latest snapshot has contents , as i previously mentioned i used jsonb to keep my snapshots nice and organized. This of course is being replicated to a slave i have. I don't really care about my slaves buffer cache contents, so i will assume that they are invalid because of readonly operations that happen there.. but something happened and i have to failover, so on the slave :

test=# select pgcozy_warm(0);
NOTICE:  warming up according to the latest pgcozy snapshot...
NOTICE:  Warming up 'pgcozy.snapshots' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 0 ...
NOTICE:  Warming up 'pgcozy.snapshots_uniq_idx' block 1 ...
NOTICE:  Warming up 'public.pgbench_branches_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_tellers_pkey' block 0 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 3 ...
NOTICE:  Warming up 'public.pgbench_accounts_pkey' block 0 ...
NOTICE:  Done Warming up according to the latest snapshot...

pgcozy_warm can be used to prewarm according to the latest snapshot (0) or a specific snapshot id (check pgcozy.snapshot for that id).

I find this simple set of functions quite useful because everything passes through shared buffers and its really easy to put the wrong things in there. This way you can revert the contents of shared buffers to the way they were before any alteration.

It has some work that needs to be done in order to make this more of general use but for now it seems to work as it should, let me know what you think.

Thanks for reading
-- Vasilis

Tuesday, 16 September 2014

Offsite replication problems and how to solve them.

 Those of us who use (and abuse) replication in daily basis know how cool and flexible it is. I've seen a lot of guides on how to setup streaming replication in 5 minutes, how to setup basic archiving and/or wal shipping replication but i haven't seen many guides combining these or implementing an offsite setup simulating latency, packet corruption, and basically what happens under network degradation.

In this post i will describe a resilient replication setup of 2 nodes and i will put it to the test. For this post i will use 2 debian VMs, PostgreSQL 9.4 beta2, OmniPITR 1.3.2 and netem.
Netem can be found on all current (2.6+) distributions and it can emulate variable delay, loss, duplication and re-ordering.

The Basics

Streaming replication is awesome, its fast , easy to setup, lightweight and near to realtime, but how it performs over the internet ?

I setup a simple streaming replica, set wal_segments and wal_keep_segments low (10 and 5). Now i wanna emulate how it will perform over a slow internet connection :
From lab2 and as root :
# tc qdisc add dev eth0 root tbf rate 20kbit buffer 1600 limit 3000

This will emulate an "almost" network outage limiting eth0 to 20kbit.
Next, hammer lab1 with transactions... a bit later :
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000000A has already been removed
Makes sense right ? lab2 couldn't keep up with lab1, lab1 rotated all xlogs and the replica is now broken. I know that this example is a bit extreme, these settings would never be used for an offsite replica, in all fairness they aren't even suitable for a local replica.
But ! network outages happen, and especially on geographically distributed databases this WILL happen and because :
Matter will be damaged in direct proportion to its value.

So, lets configure something that will tolerate such failures and it will do much, much more.

First of all, we want postgres to archive , we want wal files to be transferred compressed and on an encrypted channel. For all my wal management i will use OmniPITR , a wal management suite written by OmniTI that is simple to use, has almost no dependencies and makes everything so much better, i will use rsync over ssh to transfer my wals, bellow the archive_command and the recovery.conf entries for streaming + WAL shipping replication.
(please keep in mind that these settings are for the sake of this post, they are not to be used directly on production systems)

archive_command = '/home/postgres/omnipitr/bin/omnipitr-archive -D /opt/postgres/pgdata/ -dr gzip=postgres@lab2:/opt/postgres/walarchive -l /home/postgres/archive.log -v  "%p"'           

recovery.conf :
standby_mode = 'on'
primary_conninfo = 'user=postgres host=lab1 port=5432'
restore_command = '/home/postgres/omnipitr/bin/omnipitr-restore -l /home/postgres/restore.log -s gzip=/opt/postgres/walarchive -f /home/postgres/ -p /home/postgres/pause.removal -t /tmp -ep hang -pp /opt/postgres/psql/bin/pg_controldata  -sr -r -v %f %p'

Again, with the same wal settings i hammered the database with transactions, the replica started having delays because of the 20kbit limitation but eventually it caught up and everything was ok.

OmniPITR is hands down awesome, it can do much more than just archive and restore wals. You can delay your replica,  you can take hot backups from the slave, that can even be encrypted on creation, you can output the backup from a slave directly to an offsite backup server with no need for extra disk space on the replica and more..
(On some occasions directly sending WALS to the slave might create issues, if you get into this situation, remember that you can always archive wals locally to the master and schedule a script to transfer all wals generated in time intervals that serve your needs.)

This all proves that WAL shipping replication on top of streaming is still very usable when it comes to remote/offsite slaves and you can always switch to only SR or only WAL shipping without problems.
Afterall, its good to know that in case of a network outage, your only limitation is disk space for archived wals.

Feedback and ideas for future posts are always welcome :)

Thanks for reading
- Vasilis

Tuesday, 26 August 2014

PostgreSQL and ElasticSearch

Recently i had to evaluate ElasticSearch for a possible installation, elasticsearch is basically a search server that provides a distributed full-text search engine using a restful web interface, stores documents in json, it is written in java it is fast and really works out of the box with almost minimum effort. After the installation, it's just reading the documentation and adding / searching documents, I didn't really experiment much with searching but the API looks really good.
One interesting question that i had to answer was about connectivity with postgres, and how to maintain a table in both datastores and that's what this post is all about.

The first (fast and easy) answer here was rivers, it creates a jdbc connection with another datastore and based on a query it can pump data from any database table. It is available for postgres, twitter, mongo etc.. Because its jdbc its relatively slow and elasticsearch will (re)pump the data once every restart so pay extra attention if you use this and read the documentation first.

One other way is to use LISTEN/NOTIFY commands of postgres which is basically a message queueing system. The idea is to raise a notification on every insert, a deamon would grab that and insert the record into elasticsearch..

For a single postgres table it would work like this :

create table messages (
 id serial primary key,
 date timestamp without time zone,
 carrier text,
 message text

CREATE OR REPLACE FUNCTION table_message_notify() RETURNS trigger AS $$
  PERFORM pg_notify('table_messages_notifier',CAST( AS text));
$$ LANGUAGE plpgsql;

CREATE TRIGGER object_post_insert_notify AFTER insert ON messages FOR EACH ROW EXECUTE PROCEDURE table_message_notify();

This will simply send a notification on 'table_messages_notifier' channel after an insert that a new record has been inserted. Now you need something that would grab and handle these notifications, i tried with various ways like python requests, but after a while i just did it with python elasticsearch library, and my life was suddenly much easier :). Here's the python script that i end up having to work pretty well (managed to replicate about 50m rows with no errors).

NOTE that i've intentionally left garbage in the code just to show alternatives that i personally tried. Also, this was just a proof of concept and not an actual properly working solution, but it should be enough for someone who knows what he is doing to create a deamon that would actually work even in production.

Thanks for reading
-- Vasilis

Friday, 22 August 2014

PostgreSQL Bi-Directional Replication

A while ago I wanted to explore the options i had for multi master replication in postgres, I have previously worked with Oracle MMR and to be honest i don't like the idea of mmr, mostly because of the all conflicts that you sometime get, most of the times these conflicts are easy to resolve though and mmr, or in our case BDR can be a very nice way to scale up your database writes.
So, while searching i came across a project called postgresql BDR (bi-directional replication) developed by 2ndQuadrant, it allows users to create a geographically distributed asynchronous multi-master database using Logical Log Streaming Replication based on the changeset extraction feature introduced in PostgreSQL 9.4.
This post is about showing how to set this up.
For this setup i used 2 debian VMs (debian-7.5.0-amd64-netinst.iso)
the extra packages i installed were :
git, mc, joe,sudo, curl, libreadline-dev, zlib1g-dev, bison, flex
(by the way, comments about joe wont be tolerated !!)

With the vms ready and a user postgres1 created :

git clone git://
cd 2ndquadrant_bdr
git checkout bdr/0.6
./configure --prefix=$HOME/bdr
make install
cd ~/2ndquadrant_bdr/contrib/btree_gist && make && make install && cd ../../contrib/bdr && make && make install

add to user's profile:
export PATH=/home/postgres1/bdr/bin:$PATH

check version :
psql (PostgreSQL) 9.4beta1_bdr0601

sudo mkdir /opt/pgdata1
sudo mkdir /opt/pgdata2
sudo chown postgres1:postgres1 /opt/pgdata1/
sudo chown postgres1:postgres1 /opt/pgdata2/

as postgres1
initdb -D /opt/pgdata1/ -U postgres1 --auth-host=md5 --auth-local=peer

edit /opt/pgdata1/pg_hba.conf and add :
host    replication     postgres1          trust
edit /opt/pgdata1/postgresql.conf and change :
listen_addresses = '*'

wal_level = 'logical'
max_replication_slots = 3
max_wal_senders = 4
shared_preload_libraries = 'bdr'
bdr.connections = 'postgres1'
bdr.postgres2_dsn = 'dbname=postgres host= user=postgres port=5433'
track_commit_timestamp = on

Made a copy with bdr_init_copy to a second $PGDATA directory, made the corrections on the connection strings in postgresql.conf so the 2 clusters could connect with each other and started both postgres servers.

Everything worked, both databases were in read/write mode and i could perform all kinds of operations on both of them.

A few things that have to be noted are:
1. you cannot use normal sequences , you have to use global sequences or you will see all kinds of problems.
2. I noticed the following issues that i'm not sure why they happen :

master1 =# create table test_tr (id int primary key, name text);
master1 =# insert into test_tr values (1,'Vasilis');
master1 =# begin ; update test_tr set name = 'Vasilis2' where id =1;
Before i commit on master1 i do the following on master 2
master2 =# begin ; update test_tr set name = 'Vasilis3' where id =1;
master2 =# commit;
master1 =# commit;
master1 =# select * from test_tr;
id | name

1 | Vasilis3
(1 row)

In my understanding the first transaction would lock the row and the second would wait the first to commit.
#CASE 2.#
master1 =# create table tr_test (name text);
master1 =# insert into tr_test values ('Vasilis');
master1 =# begin;
master1 =# update tr_test set name = 'koko';
master1 =# select name from tr_test ;
(1 row)

master1 =# commit;
master1 =# select name from tr_test ;
(1 row)

master2 =# begin ; update tr_test set name ='llalla';
master2 =# select name from tr_test ;
(1 row)

master2 =# commit;
master2 =# select name from tr_test ;
(1 row)

AT THIS POINT if you drop the table :
master1 =# drop table tr_test ;
... never ending...

master2 =# drop table tr_test ;
... never ending...

RESTART both masters doesn't make any difference
From postgres logs :
LOG: starting background worker process "bdr (6036347025262511738,1,12155,): postgres2: apply"
LOG: connection received: host= port=34641
LOG: replication connection authorized: user=postgres1
LOG: starting logical decoding for slot bdr_12155_6036347078061848225_1_12155__
DETAIL: streaming transactions committing after 0/1E87AC8, reading WAL from 0/1E87A90
LOG: logical decoding found consistent point at 0/1E87A90
DETAIL: running xacts with xcnt == 0
LOG: unexpected EOF on standby connection
LOG: disconnection: session time: 0:00:00.163 user=postgres1 database=postgres host= port=34641
ERROR: could not find primary key for table with oid 25281
LOG: worker process: bdr (6036347025262511738,1,12155,): postgres2: apply (PID 3649) exited with exit code 1

logs will keep increasing, same message repeats..

Conclusion is that BDR is not ready for production , something that of course is absolutely normal considering that its based on 9.4 which is still beta2, but it sure is promising, and if it gets to its final stage it will boost PostgreSQL replication flexibility tremendously!

official user guide can be found here

update: After Craigs comment that 0.7 release is out, i plan making a new post about 0.7 version, trying to reproduce the problems i already found.

Thanks for Reading
-- Vasilis 

Wednesday, 22 January 2014

Setting up Postgres Automated Failover

Nowadays, failover is one of the simplest procedures in postgres. Still, there are not many ways to automate this procedure and perform it without any human interference. This blog post will guide you through a way to implement automated failover by using two open source projects, repmgr2 and pgbouncer.

PGbouncer will act like a connection pool and as an entry point to the database cluster.
Repmgr now, (since version 2) supports automatic failover, the slave will try to connect to the master for a predefined time and number of attempts, if it fails, it will run a script that can be anything. We are going to use this functionality to perform auto-failover, redirect our application connections to the new master database, and notify the dba that a failover has been performed.

Each server will have its own copy of pgbouncer configuration file(s) and when a server is being promoted, the new master also has the responsibility to overwrite the configuration file on the server that pgbouncer runs (

In my example I used:
3 nodes (3 debian net install virtual machines)
$PGDATA configured on each node
3 postgres installations. one on each node
repmgr2-beta2 or later

bouncer witness and pgbouncer server, will act as connection to our database cluster.
pgnode1 postgres database server 1
pgnode2 postgres database server 2

I will use these hosts as examples and as a point of reference for all the configuration files, lets start. . .

Install postgres on all nodes, set passwordless ssh between the three nodes for a user that has access to write in $PGDATA and run postgres binaries.

Install repmgr on all nodes, and pgbouncer only on bouncer.
On pgnode1 set the following in $PGDATA/postgresql.conf

wal_level = 'hot_standby'
archive_mode = on
archive_command = '/bin/false'
max_wal_senders = 10
wal_keep_segments = 5           # ----- FOR MY EXAMPLE 5, suggested by repmgr is bellow -------
#### wal_keep_segments = 5000   # 80 GB required on pg_xlog
hot_standby = on
shared_preload_libraries = 'repmgr_funcs'

and in $PGDATA/pg_hba.conf:
host    all             repmgr             trust
host    replication     repmgr           trust

Bring PostgreSQL up on pgnode1 and run:

createuser -s repmgr
createdb -O repmgr repmgr
psql -f </usr/share/postgresql/9.0>/contrib/repmgr_funcs.sql repmgr

On pgnode2 run:
repmgr -d repmgr -U repmgr -h pgnode1 standby clone
(this will use rsync)

Start postgres on pgnode2, you should have a new slave.
edit a repmgr.conf , whereever you like; in my case /etc/repmgr/repmgr.conf and add:

conninfo='host= dbname=repmgr user=repmgr'
follow_command='repmgr standby follow -f /etc/repmgr/repmgr.conf'

Each node will need its own repmgr.conf file: check github link for all my config files.
on pgnode1:
repmgr -f /etc/repmgr/repmgr.conf master register

on pgnode2:
repmgr -f /etc/repmgr/repmgr.conf standby register
on bouncer (witness server):repmgr -d repmgr -U repmgr -h -D PGDATA -f /etc/repmgr/repmgr.conf witness create

on pgnode2 and on bouncer run:
repmgrd -f /etc/repmgr/repmgr.conf > /var/log/postgresql/repmgr.log 2>&1 
(this is the deamon for repmgr)

Put the configuration file of pgbouncer from the master (pgnode1) to bouncer (pgbouncer server). Restart and test connectivity. If you stop pgnode1's postgres, pgnode2 will takeover, rsync configuration file for pgbouncer to bouncer, restart pgbouncer with ssh and mail a notification.

The main idea behind this implementation is that pgbouncer is the only entry point from the application to the database, each node upon promotion is responsible to “show” pgbouncer where it should now connect, this way, the application can remain unaware of  the current master db.
An alternative implementation could skip pgbouncer completely and each slave could directly change the configuration file of the application. In my example, I used a pgbouncer for 3 reasons: to ensure stability, to enforce connection pooling (and its advantages), and also to skip the assumption that the database stack has access to the application stack, which in my opinion shouldn’t.

  1. If the master is running repmgrd and db crashes then the deamon will exit because it can’t connect to the db, needs a way to restart, possibly with postgres startup.
  2. Bouncer in our example is a single point of failure, witness db can be configured as pgnode1/2 with repmgr and a second pgbouncer on a second machine. The application should take care on how to use these 2 pgbouncers.
  3. runs pg_ctl promote which is a postgresql 9.1 feature, for versions prior to 9.1 you may “touch” a trigger file.

Github link of all the configuration files used


Thanks for reading
--Vasilis Ventirozos

Monday, 30 December 2013

pgreplay-ing logfiles

Recently I wanted to find something to rerun a workload (from logfiles) in postgres. the reasons ? benchmark , testing , troubleshooting.
My colleague Denish reminded me a project called pgreplay. pgreplay reads a PostgreSQL logfile , extracts the SQL statements and executes them in the same order and with (or without) the original timing against a PG database.

I download it, compiled it and went straight to the testing.

The setup :

log_min_messages = error  (or more)
   (if you know that you have no cancel requests, 'log' will do)
log_min_error_statement = log  (or more)
log_connections = on
log_disconnections = on
log_line_prefix = '%m|%u|%d|%c|'  (if you don't use CSV logging)
log_statement = 'all'
lc_messages must be set to English (the encoding does not matter)
bytea_output = escape  (from version 9.0 on, only if you want to replay
                        the log on 8.4 or earlier)

It is highly recommended that you use CSV logging, because anything that
the PostgreSQL server or any loaded modules write to standard error will
be written to the stderr log and might confuse the parser.

For small and simple workloads it seemed that it worked fine. Bellow the output from a simple workload :

Replay statistics

Speed factor for replay: 1.000
Total run time: 29.700 seconds
Maximum lag behind schedule: 0 seconds
Calls to the server: 8
(0.269 calls per second)
Total number of connections: 1
Maximum number of concurrent connections: 1
Average number of concurrent connections: 1.000
Average session idle percentage: 99.680%
SQL statements executed: 6
(0 or 0.000% of these completed with error)
Maximum number of concurrent SQL statements: 1
Average number of concurrent SQL statements: 0.003
Average SQL statement duration: 0.016 seconds
Maximum SQL statement duration: 0.038 seconds
Statement duration histogram:
0 to 0.02 seconds: 83.333%
0.02 to 0.1 seconds: 16.667%
0.1 to 0.5 seconds: 0.000%
0.5 to 2 seconds: 0.000%
over 2 seconds: 0.000%

So far so good, i wanted to see what will happen when concurrency occurred so i tried with pgbench , with 2 concurrent everything was ok, so i tried to scale it up a bit to 16 concurrent transactions, and then these nice messages started showing :
"Execution is 30 minutes behind schedule"

The reason ? Locks Lots and lots of Locks ! but let's see closer what happened...
While pgreplay was running trying to handle my logfile the best way it could, a simple lock investigating query was enough to show me that i would never get the successful prompt that i wanted :

monkey=#   SELECT     AS blocked_pid,
monkey-#          a.usename  AS blocked_user,
monkey-#     AS blocking_pid,
monkey-#          ka.usename AS blocking_user,
monkey-#          a.query    AS blocked_statement
monkey-#    FROM  pg_catalog.pg_locks         bl
monkey-#     JOIN pg_catalog.pg_stat_activity a  ON =
monkey-#     JOIN pg_catalog.pg_locks         kl ON kl.transactionid = bl.transactionid AND !=
monkey-#     JOIN pg_catalog.pg_stat_activity ka ON =
monkey-#    WHERE NOT bl.granted;

 blocked_pid | blocked_user | blocking_pid | blocking_user |                          blocked_statement                          
        7398 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7409 | vasilis      |         7404 | vasilis       | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2;

Deadlock, the reason ?
(quote from pgreplay documentation) :

While pgreplay makes sure that commands are sent to the server in the order in which they were originally executed, there is no way to guarantee that they will be executed in the same order during replay: Network delay, processor contention and other factors may cause a later command to "overtake" an earlier one. While this does not matter if the commands don't affect each other, it can lead to SQL statements hitting locks unexpectedly, causing replay to deadlock and "hang". This is particularly likely if many different sessions change the same data repeatedly in short intervals.

And that was my case, lots of processes were changing the same data.
For the record, i also set up a simple troubleshooting project that works on postgres, with 2 users doing their own thing pgreplay worked fine.

conclusion :
Ok , pgreplay didn't do the job for me, i am pretty sure that if i got the logs from a production DB i would run into deadlocks, still messing a bit with it was good because i could use this for replaying simple workload logfiles to a different machine or just because its a cool project that does what it does and you never know when you might need something like this..

Wednesday, 18 December 2013

A Problem Geeked Into Big Data

Today my article got published in OmniTI Seeds. it was a very interesting project, i had a lot of fun while doing it, and even if its not exactly Postgres related, I wouldn't be able to do it without Postgres. It begun from the basic everyday operation that all dba's are doing (log parsing / analysis) but it got to a whole different level. I hope you like it.

A Problem Geeked Into Big Data