Monday, 30 December 2013

pgreplay-ing logfiles

Recently I wanted to find something to rerun a workload (from logfiles) in postgres. the reasons ? benchmark , testing , troubleshooting.
My colleague Denish reminded me a project called pgreplay. pgreplay reads a PostgreSQL logfile , extracts the SQL statements and executes them in the same order and with (or without) the original timing against a PG database.

I download it, compiled it and went straight to the testing.

The setup :

log_min_messages = error  (or more)
   (if you know that you have no cancel requests, 'log' will do)
log_min_error_statement = log  (or more)
log_connections = on
log_disconnections = on
log_line_prefix = '%m|%u|%d|%c|'  (if you don't use CSV logging)
log_statement = 'all'
lc_messages must be set to English (the encoding does not matter)
bytea_output = escape  (from version 9.0 on, only if you want to replay
                        the log on 8.4 or earlier)

It is highly recommended that you use CSV logging, because anything that
the PostgreSQL server or any loaded modules write to standard error will
be written to the stderr log and might confuse the parser.

For small and simple workloads it seemed that it worked fine. Bellow the output from a simple workload :

Replay statistics

Speed factor for replay: 1.000
Total run time: 29.700 seconds
Maximum lag behind schedule: 0 seconds
Calls to the server: 8
(0.269 calls per second)
Total number of connections: 1
Maximum number of concurrent connections: 1
Average number of concurrent connections: 1.000
Average session idle percentage: 99.680%
SQL statements executed: 6
(0 or 0.000% of these completed with error)
Maximum number of concurrent SQL statements: 1
Average number of concurrent SQL statements: 0.003
Average SQL statement duration: 0.016 seconds
Maximum SQL statement duration: 0.038 seconds
Statement duration histogram:
0 to 0.02 seconds: 83.333%
0.02 to 0.1 seconds: 16.667%
0.1 to 0.5 seconds: 0.000%
0.5 to 2 seconds: 0.000%
over 2 seconds: 0.000%

So far so good, i wanted to see what will happen when concurrency occurred so i tried with pgbench , with 2 concurrent everything was ok, so i tried to scale it up a bit to 16 concurrent transactions, and then these nice messages started showing :
"Execution is 30 minutes behind schedule"

The reason ? Locks Lots and lots of Locks ! but let's see closer what happened...
While pgreplay was running trying to handle my logfile the best way it could, a simple lock investigating query was enough to show me that i would never get the successful prompt that i wanted :

monkey=#   SELECT     AS blocked_pid,
monkey-#          a.usename  AS blocked_user,
monkey-#     AS blocking_pid,
monkey-#          ka.usename AS blocking_user,
monkey-#          a.query    AS blocked_statement
monkey-#    FROM  pg_catalog.pg_locks         bl
monkey-#     JOIN pg_catalog.pg_stat_activity a  ON =
monkey-#     JOIN pg_catalog.pg_locks         kl ON kl.transactionid = bl.transactionid AND !=
monkey-#     JOIN pg_catalog.pg_stat_activity ka ON =
monkey-#    WHERE NOT bl.granted;

 blocked_pid | blocked_user | blocking_pid | blocking_user |                          blocked_statement                          
        7398 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7409 | vasilis      |         7404 | vasilis       | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2;

Deadlock, the reason ?
(quote from pgreplay documentation) :

While pgreplay makes sure that commands are sent to the server in the order in which they were originally executed, there is no way to guarantee that they will be executed in the same order during replay: Network delay, processor contention and other factors may cause a later command to "overtake" an earlier one. While this does not matter if the commands don't affect each other, it can lead to SQL statements hitting locks unexpectedly, causing replay to deadlock and "hang". This is particularly likely if many different sessions change the same data repeatedly in short intervals.

And that was my case, lots of processes were changing the same data.
For the record, i also set up a simple troubleshooting project that works on postgres, with 2 users doing their own thing pgreplay worked fine.

conclusion :
Ok , pgreplay didn't do the job for me, i am pretty sure that if i got the logs from a production DB i would run into deadlocks, still messing a bit with it was good because i could use this for replaying simple workload logfiles to a different machine or just because its a cool project that does what it does and you never know when you might need something like this..

Wednesday, 18 December 2013

A Problem Geeked Into Big Data

Today my article got published in OmniTI Seeds. it was a very interesting project, i had a lot of fun while doing it, and even if its not exactly Postgres related, I wouldn't be able to do it without Postgres. It begun from the basic everyday operation that all dba's are doing (log parsing / analysis) but it got to a whole different level. I hope you like it.

A Problem Geeked Into Big Data

cooking with fdw

Back in 9.1 a feature called foreign data wrappers was introduced, you could basically connect a PostgreSQL Server to remote data stores (other dbs, csv files etc). Today i plan to play with this FDW and see what and how we can use it.

First of all there are many FDW's, a list can be found here.

For the sake of this post i used the following data wrappers :

     Name     | Version |   Schema   |                    Description                    
 file_fdw     | 1.0     | public     | foreign-data wrapper for flat file access
 postgres_fdw | 1.0     | public     | foreign-data wrapper for remote PostgreSQL servers
 redis_fdw    | 1.0     | public     | Foreign data wrapper for querying a Redis server

file_fdw and postgres_fdw are both available in contrib, redis fdw can be found in the following link. you will also need to compile and install hiredis which can be found here. I wont go much into detail on how to install this, but there is plenty documentation available.

The plan is to connect a redis server a remote postgreSQL server and a csv and interact with everything.

First thing that has to be done after the extensions have been created is to create foreign servers and tables, the following example shows what i did with the remote postgreSQL server :

test_redis=# CREATE SERVER node_2
OPTIONS (host '', dbname 'postgres');     
test_redis=# CREATE FOREIGN TABLE test_pg (id int)
SERVER node_2 OPTIONS (table_name 'test_pg');

csv and redis foreign tables were really similar to create, so when i was done i had this :

 public | db15       | foreign table | vasilis
 public | pgcsv      | foreign table | vasilis
 public | test       | table         | vasilis
 public | test2_pg   | foreign table | vasilis

one local table and one for each of my foreign wrappers. So, lets start playing around.

test_redis=# begin;
test_redis=# insert into test2_pg values ('koko','lala');
test_redis=# commit;

Yep, i could insert into a remote PG server, in a transactional way, goodbye dblink ? (oh yes!)
Unfortunately this wouldn't work on redis, but only because the fdw doesn't support it (yet?) .
Now lets join them all and see how that works.

test_redis=# select * from db15,pgcsv,test2_pg,test where db15.key=test2_pg.key and pgcsv.key=db15.key and db15.key=test.key and test2_pg.key=db15.key;
 key  | value | id | key  |    date    |   time   | num | key  | value | key  | value 
 koko | koko1 |  1 | koko | 2013-12-01 | 00:01:01 | 234 | koko | lala  | koko | ninini

the statement is not something that is actually usable, but it demonstrates by the execution plan that all foreign tables have been accessed :

test_redis=# explain analyze select * from db15,pgcsv,test2_pg,test where db15.key=test2_pg.key and pgcsv.key=db15.key and db15.key=test.key and test2_pg.key=db15.key;
                                                          QUERY PLAN                                                          
 Nested Loop  (cost=111.24..156.43 rows=1 width=244) (actual time=0.981..1.031 rows=1 loops=1)
   Join Filter: (test2_pg.key = db15.key)
   Rows Removed by Join Filter: 2
   ->  Nested Loop  (cost=101.24..143.39 rows=1 width=180) (actual time=0.857..0.862 rows=1 loops=1)
         ->  Hash Join  (cost=101.11..142.55 rows=5 width=116) (actual time=0.833..0.835 rows=1 loops=1)
               Hash Cond: (test2_pg.key = pgcsv.key)
               ->  Foreign Scan on test2_pg  (cost=100.00..137.90 rows=930 width=64) (actual time=0.767..0.768 rows=1 loops=1)
               ->  Hash  (cost=1.10..1.10 rows=1 width=52) (actual time=0.049..0.049 rows=3 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 1kB
                     ->  Foreign Scan on pgcsv  (cost=0.00..1.10 rows=1 width=52) (actual time=0.033..0.042 rows=3 loops=1)
                           Foreign File: /home/vasilis/lala.csv
                           Foreign File Size: 98
         ->  Index Scan using kokoako on test  (cost=0.12..0.16 rows=1 width=64) (actual time=0.019..0.021 rows=1 loops=1)
               Index Cond: (key = test2_pg.key)
   ->  Foreign Scan on db15  (cost=10.00..13.00 rows=3 width=64) (actual time=0.071..0.162 rows=3 loops=1)
         Foreign Redis Table Size: 3
 Total runtime: 2.184 ms
(17 rows)

This is awesome from many aspects, imagine that you have a heterogeneous environment with mongo, redis, postgres , oracle ,csv logs etc.. you can put a postgreSQL in the middle, acting like a bridge and create supersets, subsets, aggregations, views,temp tables, report tables and so many other things. You could even limit users to access only the "bridge postgres" so they can see whatever you want from many different systems. you can use SQL in redis since redis commands are so limited. and because i am starting getting a bit paranoid i will leave it up to you on how to use this :)

Thanks for reading
-- Vasilis

Tuesday, 17 December 2013

scaling postgres with pgpool-II

Recently a new release of pgpool-II was released, 3.2.7 (released 2013-12-06). For connection pooling I mostly have experience with pgbouncer and in OmniTI we use it a lot. I heard that pgpool has some issues and because of its increased complexity i don't know if i would recommend it as a connection pool but pgpool does much more than that.

What is pgpool-II (from pg-pool wiki) :

pgpool-II is a middleware that works between PostgreSQL servers and a PostgreSQL database client. It is licensed under BSD license. It provides the following features.
  • Connection Pooling
pgpool-II saves connections to the PostgreSQL servers, and reuse them whenever a new connection with the same properties (i.e. username, database, protocol version) comes in. It reduces connection overhead, and improves system's overall throughput.
  • Replication
pgpool-II can manage multiple PostgreSQL servers. Using the replication function enables creating a realtime backup on 2 or more physical disks, so that the service can continue without stopping servers in case of a disk failure.
  • Load Balancing
If a database is replicated, executing a SELECT query on any server will return the same result. pgpool-II takes an advantage of the replication feature to reduce the load on each PostgreSQL server by distributing SELECT queries among multiple servers, improving system's overall throughput. At best, performance improves proportionally to the number of PostgreSQL servers. Load balance works best in a situation where there are a lot of users executing many queries at the same time.
  • Limiting Exceeding Connections
There is a limit on the maximum number of concurrent connections with PostgreSQL, and connections are rejected after this many connections. Setting the maximum number of connections, however, increases resource consumption and affect system performance. pgpool-II also has a limit on the maximum number of connections, but extra connections will be queued instead of returning an error immediately.
  • Parallel Query
Using the parallel query function, data can be divided among the multiple servers, so that a query can be executed on all the servers concurrently to reduce the overall execution time. Parallel query works the best when searching large-scale data.

The feature that im mostly interested and that i want to explore in this post is load ballancing. I like this feature because without changing the application you can elevate your replica one step and make it even more useful.

I setup 3 virtual machines that i called pgpool1, pgnode1, pgnode2.
I setup a basic installation of debian and i compiled pgpool-II (3.3.2) and PostgreSQL 9.3.2.
As a first step i setup a basic streaming replication between pgnode1 and pgnode2 and i set pgpool1 for Load ballancing, all configurations can be found in my github. I wont go much into detail on how to set pgpool up, my configuration files should be a good guide and i found that pgpool documentation is very informative and up to date, also many before me have blogged about how to configure it.
I will stand more on the design and what could you do with this.
As we all know streaming replication is almost instant so separating "read" from "write" can be beneficial in a large variety of applications, but especially for more reads than writes, so lets see how this works :

root@pgpool1:~# psql -p 9999 -U postgres
postgres=# create database monkey;
postgres=# \connect monkey;
You are now connected to database "monkey" as user "postgres".
monkey=# create table test (id int);
monkey=# insert into test values (1);
monkey=# select * from test;
(1 row)

With full logging enabled in the dbs lets see the logs from pgnode1 and pgnode2 :

LOG:  duration: 342.044 ms  statement: create database monkey;
LOG:  duration: 4.646 ms  statement: create table test (id int);
LOG:  duration: 1.733 ms  statement: insert into test values (1);

LOG:  duration: 1.018 ms  statement: select * from test;

OK it works as expected, now i heard that this doesnt work if select has a space in the begining so i tried to test that as well :

on pgnode2 (replica) :
LOG:  duration: 0.757 ms  statement:   selecT * from test;

LOG:  duration: 0.215 ms  statement: SeLeCt   * from test /* this is my test */;

something more complicated maybe? something like pgbench sounds appropriate.

I wont spam more logs, it worked as it should.
But what would happen with 2 replicas ? i know that read operations should be scaled and that they should be distributed among the slaves, i created another database cluster on pgnode2 (port 5433) and made it replica to pgnode1,
so now i had one master running on pgnode1 and two slaves running on pgnode2 (5432 and 5433).
I changed pgpool configuration to add the new db :

backend_hostname0 = ''
backend_port0 = 5432
backend_weight0 = 0
backend_data_directory0 = '/opt/pgdata'

backend_hostname1 = ''
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/opt/pgdata'

backend_hostname2 = ''
backend_port2 = 5433
backend_weight2 = 1
backend_data_directory2 = '/opt/pgdata2'

and from pgpool1 i run :
while true; do psql -p 9999  -c 'selecT * from test;'  -U postgres ;done

and VoilĂ  , i had my selects distributed on my 2 slaves.

pgpool has A LOT of cool features, im not in position to say if they all work as they should but from my tests on load ballancing everything was smooth, pgpool has also a web based gui worth checking and some system views that can be used to monitor the activity, also supports automatic failover, watchdog and many many other neat features that i might explore in another post.

Next time i will explore Parallel query.

all the configurations can be found in my github (including the postgres configs that were used).

Thanks for reading
- Vasilis

Tuesday, 15 October 2013

Parallel pg_dump backups

As I told in a previous post, i'm planning to post about the features that 9.3 brought to the game, today i will explore parallel pg_dump, how it works, how can we benefit from it and how does it compare to classic pg_dump.

First of all, pg_dump supports parallel dump only if -Fd (directory) is used, this is because directory dump  is the only format that supports multiple processes to write data at the same time. Directory format will write one file per relation and a toc.dat file, it is similar to -Fc with the output being compressed and supports parallel and selective restore.
The pg_dump switch that implements parallel dump is -j <njobs>,
when used, pg_dump will create n connections +1 (the master process).

The test:
for this test i have created 10 tables with 20million rows each. as a baseline i will use -Fc and then i will use -Fd -j increasing the number of jobs by 1 each time. the disk that was used was a simple 7.2k rpm sata3 disk connected over usb3.

> time pg_dump -Fc -f out.sql -d dump
real    4m58.831s
now with -Fd -j
> time pg_dump -Fd -j 1 -f out.sql1 -d dump
real    1m10.280s
> time pg_dump -Fd -j 2 -f out.sql2 -d dump
real    0m43.596s
> time pg_dump -Fd -j 3 -f out.sql3 -d dump
real    0m38.801s
> time pg_dump -Fd -j 4 -f out.sql4 -d dump
real    0m33.708s
> time pg_dump -Fd -j 5 -f out.sql5 -d dump
real    0m31.040s
> time pg_dump -Fd -j 6 -f out.sql6 -d dump
real    0m30.501s
> time pg_dump -Fd -j 60 -f out.sql60 -d dump
real    0m30.522s

Obviously the last run was just to show that you cannot use any degree of parallelism, on an I7 2600 (4 core + 4 multi threaded). After 6 concurrent jobs it had no difference. Still, we started with 1:10 and we end up with less than the half time.

Now, this feature is by nature not usable for everyone, in my test i had 10 equal tables with 20m rows each , if i had 1 large table and several small ones then it wouldn't scale the same. Also, I was talking with a friend about this feature and he said: Who does pg_dump now days anyway? Everyone is doing live backups
- That's partially true, but pg_dump always will have its use.
I think this feature is great, maybe not the greatest for terabytes of enterprise databases but its good for those who use pg_dump for backups. Test systems, tight on diskspace installations, databases with little network bandwidth.

Thanks for reading
- Vasilis

Tuesday, 8 October 2013

The cost of Reliability

At 9th of September PostgreSQL 9.3 was released with many exciting new features. This post is the first out of many (?) that will explore these new features and will try to evaluate them. I will start with data checksums, a feature that makes possible for PostgreSQL to checksum data pages and report corruptions.
I always liked how PostgreSQL doesn't cut corners when it comes to reliability, even when it (and it usually does) comes with a cost in performance. but how does this work and how much is this cost ? Thats why I'm writing this post :)

Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again.

I used 2 db clusters one with the feature enabled and one without it, same config file and 3 basic operations 1 insert 2 updates, the results :

No checksums:

monkey_no_sum=# insert into test_of_sum (id) select generate_series (1,10000000);
INSERT 0 10000000
Time: 26226.654 ms
monkey_no_sum=# update test_of_sum set name = 'The Monkey';
UPDATE 10000000
Time: 51903.185 ms
monkey_no_sum=# update test_of_sum set name = 'The Monkey', address = 'somewhere in London';
UPDATE 10000000
Time: 70125.720 ms

With checksums:

monkey_sum=# insert into test_of_sum (id) select generate_series (1,10000000);
INSERT 0 10000000
Time: 23724.373 ms
monkey_sum=# update test_of_sum set name = 'The Monkey';
UPDATE 10000000
Time: 74446.178 ms
monkey_sum=# update test_of_sum set name = 'The Monkey', address = 'somewhere in London';
UPDATE 10000000
Time: 78987.092 ms

Apart from the performance impact, where obviously there is an extra overhead when you checksum all data pages, there are other things that make this great feature not easy to use. First of all, its a cluster setting. Which means if you want it on your second db you cannot have it in the same data dir. Also you cannot disable it once the cluster is created.

Don't get me wrong, I LOVE having this ! I think its absolutely brilliant that i have the option to use it when i can. but that's the problem, i can't :(
If this feature was object wide (tables / indexes) then it would have application in production systems , being up to the hand of the dba where to enable it and where not to. Even if it was database wide, then i would enable it on a secondary database that i run on the same server which i dont care about performance. Or even turning this on and off would be very useful.

There are good reasons why this feature doesn't work the way i would like, its because the whole thing happens in shared buffers and shared buffers are a cluster entity. I'm pretty positive though that this is just the beginning and that we will soon see a switch on-off operation.
Until then enjoy having this on small dbs or when reliability REALLY matters.

-- Thank for reading

Tuesday, 27 August 2013

Backing up PostgreSQL in HDFS

There are number ways to backup a PostgreSQL database, some are standard and some just demonstrate the power of open source and the things you can do if you put in use your creativity and imagination. At OmniTI, we use OmniPITR tool to manage WAL files and running backup on secondary databases instead of primary to reduce load during backup. In this post, I will discuss OmniPITR and Hadoop to accomplish something very neat, storing your backups into HDFS (Hadoop Distributed File System). 
You might be asking Why? HDFS is rock solid reliable, it has extremely low cost per byte and it can get 2Gbit per computer, scalable up to more than a TB per second. it is proven from internet giants for running a big variety of different use-cases.
Let's say that you have a 1TB database running, an uncompressed backup will need 1TB of reliable storage just to keep one copy. HFDS has the great advantage of using cheap hardware and being fault tolerant at the same time.
Imagine adding cheap SATA disks to the company workstations and keep the database backups there. Now, imagine explaining that to the next SAN salesman who will come to sell you storage. - FUN TIMES !

Let me give some details about tools used:


It's an open-source software for reliable, scalable, distributed computing.

The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

OmniPITR :

OmniPITR, written and maintained by OmniTI , is a set of scripts to ease setting up WAL replication, and making hot backups from both Master and Slave systems.
This set of scripts has been written to make the operation seamless, secure and as light on resources-usage as possible.

Now, You must be wondering, why am I talking about it now? Depesz has recently added a new feature called dst-pipes (pipe to program), this feature is limited only by the users imagination, Depesz as an example of this feature did gpg backups in one operation! i am going to use the same feature to do something entirely different.

Setup :

Software Versions: PostgreSQL 9.3 (beta2) , Hadoop 1.2.1

For the setup , I have used a VM for Hadoop and PostgreSQL, the reason for me using just one VM is that actually there are no limitations in the topology. Hadoop and Postgres are 2 entirely different ecosystems so you have no limitations, the PostgreSQL might be running on a cascading multi slave environment, While Hadoop is running as I said, on desktops. Other than Postgres and Haoop, you have to git clone OmniPITR . Depesz has put effort on making this self sufficient so no extra perl modules will be needed. The only thing that I have added to a minimal (net install) Debian installation was rsync.

Configuration :

In postgresql.conf, enable archiving and on the archive_command put something like :
archive_command = '/usr/local/omnipitr/bin/omnipitr-archive -dl /mnt/archives/xlog -s /usr/local/omnipitr/state -db /mnt/backup/xlog -l /usr/local/omnipitr/log/omnipitr-^Y-^m-^d.log -v "%p"'

Now, you may have a different archive_command, i really suggest using OmniPITR, it really works well and it has been tested in very large installation. you can do cool stuff like sending the archives to a remote location gzip them, make combinations  of local gzipped copies and remote etc. You could read about OmniPITR features in documentation on github.

Let's make a script:

#!/usr/bin/env bash
hadoop dfs -put - /user/hduser/$1

This script will allow Hadoop to get a file from stdin and put it into a HDFS namespace

Bring up postgreSQL with above archive_command and Hadoop running on the node.

Let's try to backup straight into Hadoop:

/usr/local/omnipitr/omnipitr-backup-master -D $PGDATA -l dl-backup.log  -dp ~/test/ -x /mnt/backup/xlog -v

-dp is the key switch here, it uses the new feature of OmniPITR : 'dst-pipes' it will run the script and will output the backup as input to the script we just made.

To verify the backup is actually there :

 $hadoop dfs -ls /user/hduser

Found 2 items
-rw-r--r--   1 vasilis supergroup  115322880 2013-08-22 23:18 /user/hduser/dblab1-data-2013-08-22.tar
-rw-r--r--   1 vasilis supergroup   33566720 2013-08-22 23:18 /user/hduser/dblab1-xlog-2013-08-22.tar

You might want to refer to Hadoop documentation for how to manipulate these files, organize them etc. but the backup is there :

vasilis@dblab1:~/test$ hadoop dfs -copyToLocal /user/hduser/dblab1-data-2013-08-22.tar

vasilis@dblab1:~/test$ ls -l ~/dblab1-data-2013-08-22.tar
-rw-r--r-- 1 vasilis vasilis 115322880 Aug 23 18:06 /home/vasilis/dblab1-data-2013-08-22.tar

To restore the cluster, you would need something like :

hadoop dfs -cat /user/hduser/dblab1-data-2013-08-22.tar  |tar xv

For our example, we had no need for compression, gzip'ing the output is supported by OmniPITR and you may also do that on the script that pushes the tar to Hdfs

This is just the tip of the iceberg, just a proof of concept , the tools are there and everything is stable.

Use your imagination , save your company some money and be the hero !!!

Thanks for reading !!

Setting shared_buffers the hard way

One of the main performance parameters in PostgreSQL is shared_buffers, probably the most important one, there are guidelines and rules of thumb that say just set it to 20-30% of your machine total memory.
Don't get me wrong , these rules are generaly good and its a perfect starting point, but there are reasons why you should tune your shared buffers in more detail

a. 30% might not be enough and you will never know if you dont know exactly how to set this parameter
b. 30% might be a lot and you spend resources in vain.
c. you want to be awesome and tune every bit of your DB the optimal way.

What is shared buffers though ?
Shared buffers defines a block of memory that PostgreSQL will use to hold requests that are awaiting attention from the kernel buffer and CPU.
so basically PostgreSQL will put temporarily data blocks in the memory in order to process - EVERYTHING will go through the shared buffers.

Why not set shared_buffers to 80% of ram in a DB dedicated server ?
The OS also has cache, and if you set shared_buffers too high you will most likely have an overlap which called double buffering, having datablocks on both caches.

So, in order to set your shared_buffers you need to know what's happening inside shared memory.PostgreSQL has an implementation called clock-sweep algorithm, so everytime you use a datablock a usage counter is making that block harder to get rid of. the block gets a popularity number from 1-5 with 5 being heavily used and most likely it will stay in shared memory.
In theory you want the most popular data blocks in the shared buffers and the least popular ones out of it. To do that you have to be able to see what is inside the shared buffers, and thats exactly what pg_buffercache package does.You will find pg_buffercache in the contrib.

lets create 2 tables and full join them, update them and generaly run operations on these 2 tables while monitoring the buffer cache
I will give dramatic examples by setting shared_buffers too low , default and too high just to demonstrate what pg_buffercache views will show and then i i will find a good value for this specific workflow.i will run the same statements while i analyze what is happening inside shared buffers.
the to most very useful sql statements from the pg_buffercache views are the following :

-- buffers per relation and size
pg_size_pretty(count(*) * 8192) as buffered,
round(100.0 * count(*) /
(SELECT setting FROM pg_settings
WHERE name='shared_buffers')::integer,1)
AS buffers_percent,
round(100.0 * count(*) * 8192 /
AS percent_of_relation
FROM pg_class c
INNER JOIN pg_buffercache b
ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d
ON (b.reldatabase = d.oid AND d.datname = current_database())
where pg_relation_size(c.oid) >0
GROUP BY c.oid,c.relname

-- buffers per usage count
c.relname, count(*) AS buffers,usagecount
FROM pg_class c
INNER JOIN pg_buffercache b
ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d
ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.relname,usagecount
ORDER BY usagecount,c.relname

PostgreSQL backup and recovery

One of the main tasks of any kind of administrator is to make sure that the data that he's responsible for will be available if anything bad happens (asteroids, flood, locusts, hail) in order to do that you will need a high availability solution to obtain continuity and a good backup plan to support your system in case of human error ( -Hello Mr. admin, i just dropped the customer payment table) trust me, it happens ... a lot...
I briefly covered high availability with the implementation of a hot standby in an earlier post and now its time to cover backup options available in PostgreSQL.

There are 2 kinds of backups in PostgreSQL, physical and logical.

The good thing with logical backups is that they are simple to implement and maintain, selective backup and restore (even in later PG versions) is easy. Usually the backup output consists in one file that of course can be compressed. the major con of this method is that lacks Point In Time Recovery. (A database with the PITR feature can be restored or recovered to the state that it had at any time since PITR logging was started for that database.) it also lacks incrementality (each backup is a new WHOLE backup of your database). This makes these kinds of backups not really usable in very large production installations. Before you decide to say, "Not good" let me tell you cases that these kinds of backups would be better than incremental backups.
Test servers with not much data, small installations that the output will be small and the backups can be taken in a daily (or even less) basis. Large installations that don't have many data changes over the day. Data Warehouses and reporting servers.

once again PGlab1 will be the my lab rat , don't worry, No Animals or Data Were Harmed


export backupfile=""backup"_`date "+%d%m%Y"`.tgz"
pg_ctl stop && tar cvfz ~/$backupfile $PGDATA && pg_ctl start

Basicaly i just stopped the database , took a tgz of my $PGDATA and started the DB again.
simple and effective, restore can be done on a different path than $PGDATA, just make sure you provide -D on pg_ctl or set the PGDATA to the correct path before you start.

pg_dump and pg_dumpall

pg_dump and pg_dumpall, exports one or all databases to a (by default) human readable sql format.
it can be compressed by default and it supports a lot of options like data only , schema only etc.

i have a database for this posts sake called testbackup

if i run pg_dump testbackup i will get on stdout one by one the sql commands that i would need to remake the database from scratch so just by redirecting it to a file you have a simple backup. i wont get into details about the format of this file , i will just say that at first you will see connection details like the encoding and the extensions that exist , then the table creation script, then the data (using a postgres command called COPY) and then the indexes and constrains.
NOTE that taking a pg_dump wont backup the users, and thats because users in postgres are global and they exist in postgres database. to backup users you can use pg_dumpall -g (-g means globals).

Here's a script that i am using in order to take this kind of backup :

export PGDATA=/opt/db/data
export PGPASSFILE=/opt/db/data/pgpass.conf

usersfilename="`date "+%d%m%Y"`.bck.users"
$pgdumpall -g -U $psqluser >  $backupdir/$usersfilename && find $backupdir/*bck.users* -ctime +$retention -exec rm {} \;

for db in `echo "select datname from pg_database
where datname not in ('template0','template1','postgres');
"|$psql -A -t postgres`
backupname=""$db"_`date "+%d%m%Y"`.bck"
logfilename=""$db"_`date "+%d%m%Y"`.bck.log"
usersfilename=""$db"_`date "+%d%m%Y"`.users"
$pgdump -Fc -v -f $backupdir/$backupname -U $psqluser $db 2> $backupdir/$logfilename && find $backupdir/$db*bck* -ctime +$retention -exec rm {} \;

Notice that i use the -Fc switch in pg_dump, that means custom format, and it can be used for selective restore using the pg_restore command. if i had one of these backups and i wanted to restore the table "customers" i would run :
pg_restore -Fc -t customers -f <file name> -U <username> -h < host name> -d <db name>
NOTE that there is a switch (-j) for parallelism.
more about pg_restore pg_dump and pg_dumpall on : pg_restore , pg_dump , pg_dumpall

Now that we are done with database dump backup basics , lets move to live , or online backup , PITR and timelines.
In order to get a backup that is incremental you will need a basebackup and all the changes that transactions do to the database, so you need the transaction logs or as we call them in postgres WAL segments. I wont say many things about how transaction mechanism works in postgres, this is a backup and restore post so i will leave WAL mechanism for another post.

Standalone hot physical database backup

I will use the following directories and variables for examples sake

export BACKUPNAME=""backup"_`date "+%d%m%Y"`.tgz"

postgres@pglab1:/opt$ ls -l
total 8
drwxr-xr-x  2 postgres postgres 4096 Mar 27 12:32 BACKUP
drwx------ 15 postgres postgres 4096 Mar 27 11:40 PGDATA

mkdir /opt/BACKUP/archives

Set an archive_command. In postgresql.conf  and restart the server:
wal_level = archive 
archive_mode = on
archive_command = 'test -f /opt/BACKUP/archiving/archiving_active && cp %p /opt/BACKUP/archive/%f'

mkdir /opt/BACKUP/archiving/
touch /opt/BACKUP/archiving/archiving_active

now run :
psql -c "select pg_start_backup('BACKUP')"
tar -cvzf --exclude=$PGDATA/pg_xlog -f ../BACKUP/$BACKUPNAME $PGDATA
psql -c "select pg_stop_backup(), current_timestamp"

now, lets crash and restore

rm -rf /opt/PGDATA/*  (yoohoo !!!)
untar the backup (.tgz) in $PGDATA , you should miss pg_xlog dir , create it as postgres user
then on $PGDATA edit a file called recovery.conf and add :
restore_command = 'cp /opt/BACKUP/archive/%f %p'

start the database and watch the logfile, it should show something like :

2013-03-27 13:22:58 EET::@:[3047]: LOG:  archive recovery complete
2013-03-27 13:22:58 EET::@:[3045]: LOG:  database system is ready to accept connections
2013-03-27 13:22:58 EET::@:[3061]: LOG:  autovacuum launcher started

the recovery.conf will also be automatically renamed to recovery.done.

Hot physical backup & Continuous Archiving

Now this is what you would want for a mission critical production installation with a lot of GBs or Tbs of data and a lot of concurrent users hitting the DB 24/7.
For examples sake i will delete my whole cluster and make the steps one at the time the backup will be taken locally something that of course is not suggested, and at the end i will perform a PITR and i will also talk about timelines.

 edit postgresql.conf and enable archiving :

wal_level = archive
archive_mode = on
archive_command = 'cp %p /opt/BACKUP/archives/%f'
(NOTE that archive_command can be scp, a more advanced external script or anything that would transfer the archived WALs to the desired location)
restart the server
psql -c "select pg_start_backup('my backup')"

you can now tar , rsync or whatever you want to another node, something like 
"rsync -cva --inplace --exclude=*pg_xlog* ${PGDATA}$OTHERNODE:$BACKUPNAME/$PGDATA"
would work

for my example, i will just use tar like the previous example:
tar -cvz --exclude=/opt/PGDATA/pg_xlog/ -f /opt/BACKUP/backup.tgz $PGDATA

psql -c "select pg_stop_backup(), current_timestamp"

At this moment i have a base backup , and the mechanism that archives all wal segments, lets add some data and force some checkpoints.

notice that the archives directory now has WALs
postgres@pglab1:/opt/PGDATA/pg_xlog$ ls -l /opt/BACKUP/archives/
total 49156
-rw------- 1 postgres postgres 16777216 Mar 27 13:57 000000010000000000000001
-rw------- 1 postgres postgres 16777216 Mar 27 14:02 000000010000000000000002
-rw------- 1 postgres postgres      293 Mar 27 14:02 000000010000000000000002.00000020.backup
-rw------- 1 postgres postgres 16777216 Mar 27 14:04 000000010000000000000003

a WAL segment is happening either on size or time threshold, with the default postgresql.conf values that means on 16Mb or every 5 minutes, whatever happens first. Both parameters can be, and should be changed for performance's sake depending on your workload, so monitor checkpoint frequency.

now lets say that something really bad happened, like a mistaken but commited update on the table backup to make it easier for me i created that table with a datetime column with default value now().
so we have :

         datetime          | count
 2013-03-27 14:05:05.999257 |  1000
 2013-03-27 14:05:14.911462 |  1000
 2013-03-27 14:05:19.419173 |  1000
 2013-03-27 14:05:25.631254 |  1000
 2013-03-27 14:06:39.97177   |  1000
 2013-03-27 14:09:53.571976 |  1000

Lets also assume that we know that the update was recorded at 2013-03-27 14:05:25.631254 and we want the database back to that exact time.

edit a recovery.conf as we did before :
restore_command = 'cp /opt/BACKUP/archives/%f %p'
recovery_target_time = '2013-03-27 14:04:00'

and restart the db, check the logfile , you'll see something like :
LOG:  starting point-in-time recovery to 2013-03-27 14:04:00+02

Now lets Recover from a crush
once again , rm -rf /opt/PGDATA/*

untar the basebackup , place recovery.conf with or without the recovery_target_time
and start the database.
in my example i also did a PITR to 2013-03-27 14:01:00
and the table now has :
 1 | 2013-03-27 13:56:49.163269


PostgreSQL documentation describes timelines much better than i could. so here it is right from the documentation :
The ability to restore the database to a previous point in time creates some complexities that are akin to science-fiction stories about time travel and parallel universes. For example, in the original history of the database, suppose you dropped a critical table at 5:15PM on Tuesday evening, but didn't realize your mistake until Wednesday noon. Unfazed, you get out your backup, restore to the point-in-time 5:14PM Tuesday evening, and are up and running. In this history of the database universe, you never dropped the table. But suppose you later realize this wasn't such a great idea, and would like to return to sometime Wednesday morning in the original history. You won't be able to if, while your database was up-and-running, it overwrote some of the WAL segment files that led up to the time you now wish you could get back to. Thus, to avoid this, you need to distinguish the series of WAL records generated after you've done a point-in-time recovery from those that were generated in the original database history.
To deal with this problem, PostgreSQL has a notion of timelines. Whenever an archive recovery completes, a new timeline is created to identify the series of WAL records generated after that recovery. The timeline ID number is part of WAL segment file names so a new timeline does not overwrite the WAL data generated by previous timelines. It is in fact possible to archive many different timelines. While that might seem like a useless feature, it's often a lifesaver. Consider the situation where you aren't quite sure what point-in-time to recover to, and so have to do several point-in-time recoveries by trial and error until you find the best place to branch off from the old history. Without timelines this process would soon generate an unmanageable mess. With timelines, you can recover to any prior state, including states in timeline branches that you abandoned earlier.
Every time a new timeline is created, PostgreSQL creates a "timeline history" file that shows which timeline it branched off from and when. These history files are necessary to allow the system to pick the right WAL segment files when recovering from an archive that contains multiple timelines. Therefore, they are archived into the WAL archive area just like WAL segment files. The history files are just small text files, so it's cheap and appropriate to keep them around indefinitely (unlike the segment files which are large). You can, if you like, add comments to a history file to record your own notes about how and why this particular timeline was created. Such comments will be especially valuable when you have a thicket of different timelines as a result of experimentation.
The default behavior of recovery is to recover along the same timeline that was current when the base backup was taken. If you wish to recover into some child timeline (that is, you want to return to some state that was itself generated after a recovery attempt), you need to specify the target timeline ID in recovery.conf. You cannot recover into timelines that branched off earlier than the base backup.

I know that this probably needs a review, and the plan is to do it at some point.
Thanks for reading

Migrating Oracle to PostgreSQL

I have an oracle dump from a database that i want to migrate to postgres for a project. I setup a 100gb VM, I setup oracle 11gr2 and i compiled postgres 9.2.3 on the same VM. i restored the dump to oracle and now i had to make a migration strategy.
I had some ideas like getting the schema from the dump with expdp (sqlfile=) and manually translate it to postgres but i remembered a project that i saw some time ago called ora2pg. This cool thing is basically a perl program that needs except of perl (duh)  DBI and DBD for oracle and postgres. i set it up and i started experimenting first on the DDL's. the output was actually very good ! the only things that i had to change was some function based indexes and it was parse-able from postgres without problems, next was data. No problems there , it created a 20gb output file that all i had to do was to throw it to postgres. Just because i wasn't very proactive with my VM's disk space i faced some problems... 20Gb from the export plus the oracle data
(36664764 /opt/oradata/) + oracle software + other things ?
that was about 65% of my diskspace, and i had to also maintain a second live copy of these data+indexes on postgres.. So i gzipped the whole output file and i used "zcat |psql". worked like charm ,actually its still running as i write this because ...
And here are somethings to take under consideration before starting!
the dumpfile is 20gb, what happens if something goes wrong ?
You drop the database fix the ddl script and reapply the data
Do it one object at the time, not like i did or at least separate the small from the big tables in groups and do the big tables one at the time, or something like :
ERROR: invalid input syntax for integer: "26.19"
CONTEXT: COPY twitter_tmp, line 1, column network_score: "26.19"

might end up fucking you up !
ora2pg translated a number column as integer , something that came back to bite me in the ass 2 hours after import run...
i changed the ddl script and i rerun it, if it fails again i have to start over.. SO, don't be like me , act smart, do it one part at the time :)
in all fairness i still have no idea what tables are in the database and my job has been made easy mode with ora2pg. So.. i cant blame it, i can only blame me for not MAKING A MIGRATION STRATEGY afterall.
i will also show you the parameters that i have changed in order to make the import faster, the difference is HUGE actually..

shared_buffers = 2GB
synchronous_commit = off
wal_buffers = 16MB
wal_writer_delay = 1000ms
checkpoint_segments = 64  

Anyway so far i think i'm good with the table definitions (including constraints) and the data (hopefully) but there is a lot more that have to be done, views , code etc...

thanks for reading
to be continued...

PostgreSQL Partitioning

Partitions are very usable when it comes to big tables, documentation suggests applying table partitioning when a table is bigger than 10Gb. In postgres there are 2 kinds of partitions 
  • Range
  • List
implementation will just need the following steps.
  1. Enable constraint exclusion in config file
  2. Create a master table
  3. Create child tables WITHOUT overlapping table constraints
  4. Create indexes , pk's
  5. Create function and trigger to insert data to child tables
 First thing to notice here is that partitioning is basically using table inheritance. but enough with how things are working in theory, lets create one.

First , check into postgresql.conf for this parameter:

constraint_exclusion = partition 

Now , lets create the master and child tables :

    sales_id serial NOT NULL,
    description text

CREATE TABLE sales_2013_p1 (
CHECK ( sales_date >= DATE '2013-01-01' AND sales_date < DATE '2013-05-01' )
 ) INHERITS (sales);

 CREATE TABLE sales_2013_p2 (
CHECK ( sales_date >= DATE '2013-05-01' AND sales_date < DATE '2013-09-01' )
 ) INHERITS (sales);

 CREATE TABLE sales_2013_p3 (
CHECK ( sales_date >= DATE '2013-09-01' AND sales_date < DATE '2014-01-01' )
 ) INHERITS (sales);

notice the keyword INHERITS here :)
next, PK's and indexes on child tables,

ALTER TABLE sales_2013_p1 ADD CONSTRAINT sales_2013_p1_pkey PRIMARY KEY (sales_id, sales_date);
ALTER TABLE sales_2013_p2 ADD CONSTRAINT sales_2013_p2_pkey PRIMARY KEY (sales_id, sales_date);
ALTER TABLE sales_2013_p3 ADD CONSTRAINT sales_2013_p3_pkey PRIMARY KEY (sales_id, sales_date);

CREATE INDEX idx_2013_p1 ON sales_2013_p1 (sales_date);
CREATE INDEX idx_2013_p2 ON sales_2013_p2 (sales_date);
CREATE INDEX idx_2013_p3 ON sales_2013_p3 (sales_date);

and finaly a function that returns trigger and the on-insert trigger itself.

    IF ( NEW.sales_date >= DATE '2013-01-01' AND NEW.sales_date < DATE '2013-05-01' ) THEN
        INSERT INTO sales_2013_p1 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-05-01' AND NEW.sales_date < DATE '2013-09-01' ) THEN
        INSERT INTO sales_2013_p2 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-09-01' AND NEW.sales_date < DATE '2014-01-01' ) THEN
        INSERT INTO sales_2013_p3 VALUES (NEW.*);
        RAISE EXCEPTION 'Date out of range.!';
    END IF;
LANGUAGE plpgsql;

CREATE TRIGGER insert_on_sales
    FOR EACH ROW EXECUTE PROCEDURE sales_trig_func();

Now that we have a table with a basic partitioning schema, lets assume that we want to add more partitions for 2014. create a new child table (for examples sake) i will just create a partition for 2014.

CREATE TABLE sales_2014 (
CHECK ( sales_date >= DATE '2014-01-01' AND sales_date < DATE '2015-01-01' )
 ) INHERITS (sales);

ALTER TABLE sales_2014 ADD CONSTRAINT sales_2014_pkey PRIMARY KEY (sales_id, sales_date);

CREATE INDEX idx_2014 ON sales_2014 (sales_date);

    IF ( NEW.sales_date >= DATE '2013-01-01' AND NEW.sales_date < DATE '2013-05-01' ) THEN
        INSERT INTO sales_2013_p1 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-05-01' AND NEW.sales_date < DATE '2013-09-01' ) THEN
        INSERT INTO sales_2013_p2 VALUES (NEW.*);
    ELSIF ( NEW.sales_date >= DATE '2013-09-01' AND NEW.sales_date < DATE '2014-01-01' ) THEN
        INSERT INTO sales_2013_p3 VALUES (NEW.*);

    ELSIF ( NEW.sales_date >= DATE '2014-01-01' AND NEW.sales_date < DATE '2015-01-01' ) THEN
        INSERT INTO sales_2014 VALUES (NEW.*);

        RAISE EXCEPTION 'Date out of range.!';
    END IF;
LANGUAGE plpgsql;

and we are done!

Now lets say that 2013_p1 data are obsolete and we want to move them to a historical database, drop table, correct and replace the function and you are done.
 This is how the master table would look after these operations :

partition=# \d+ sales
                                                   Table "public.sales"
   Column    |  Type   |                        Modifiers                                               | Storage  | Stats target | Description
 sales_id    | integer | not null default nextval('sales_sales_id_seq'::regclass) | plain    |              |
 sales_date  | date    | not null default ('now'::text)::date                                  | plain    |              |
 description | text    |                                                                                         | extended |              |
    insert_on_sales BEFORE INSERT ON sales FOR EACH ROW EXECUTE PROCEDURE sales_trig_func()
Child tables: sales_2013_p2,
Has OIDs: no

Good thing about this approach is that partitions are generally easy to maintain and administer, child tables can have different indexes with each other, and you can of course delete large portions of data that may not be needed any more just by dropping a partition, uh ! the performance is VERY good on insert and select and maintenance work like reindex is faster. Reindex in particular wouldn't lock the whole master table for writes.

Thanks for reading

PostgreSQL, Binary replication in practice

 A couple of days ago I started making a short howto about streaming replication in PostgreSQL 9.2. Most of these things are well documented but in this howto i will also try to experiment with switch overs and switchbacks. It aims to show how easy it is to set it up right out of the box.

Streaming replication PostgreSQL 9.2

For my example i will use 2 debian VM's (pglab1) and (pglab2)

- not mandatory -

exchange ssh keys for passwordless ssh ,might be used if we need to scp scripts , rsync or whatever.

if you don't know how to do it, follow these steps :

after you create a postgres cluster using initdb,
edit master's postgresql.conf and change the following :
listen_addresses = '*'
wal_level = hot_standby #(could be archive too)
max_wal_senders = 5
hot_standby = on

create a replication user :
create user repuser replication password 'passwd';

edit pg_hba.conf and add :

host all all trust
host replication repuser md5


Now ssh to slave and from $PGDATA run :

pg_basebackup -D /opt/data/ -v -h -U repuser
 enter password , this will transfer a full copy of your cluster from your master. check the documentation of pg_basebackup for compression and other options available.

In $PGDATA edit a file called recovery.conf containing :
standby_mode = on
 primary_conninfo = 'host= port=5432 user=repuser password=passwd'

with the master up, start slave , it should say :
LOG: database system is ready to accept read only connections

At this point you will have 2 nodes running with your master accepting read/write operations and your slave accepting only read only operations (reporting goes here maybe ?) now, lets say the master crashes and you need to failover and promote slave as the new master.


shutdown master 
on slave, execute:
pg_ctl promote

that's it, your slave (pglab2) is now master accepting all kinds of connections.

now lets say that the ex-master (pglab1) , is fixed and is ready to come up again,

- Switch back to original Master -

on the current master (pglab2) :

echo "select pg_start_backup('clone',true);" |psql pgbench
rsync -av --exclude /opt/data/*
echo "select pg_stop_backup();"|psql pgbench

this will sync all data from my current master (pglab2) to my current -to be- slave (pglab1), should be currently down.

edit recovery.done and fix the ip of the current master
rename recovery.done to recovery.conf
start ex-master (pglab1), now as slave , promote it with the slave (pglab2) down,
recreate slave (pglab2) with rsync, edit recovery.conf and start it again, 
the servers now have their original roles.

note that you can have 2 master databases but this might (and probably will) create a mess, so be sure that you bring down the correct server before promoting.

if you want to check if a server is acting as master or slave , run :

select pg_is_in_recovery()
If it's true, you're on a slave.

Thanks for reading.

PostgreSQL proper installation

Recently at work, i got assigned to upgrade some (about 10)
postgres 8.1.x to 9, I always liked compiling basically because I like the flexibility that compile offers, and thats what i proposed to the guys that are in charge of the project. They gave me a test system (vm) to play with, in all fairness they were a bit skeptical with the idea of compiling the rdbms. Mostly for reliability issues (don't ask me why). I explained that upgrading from source would be much easier later on and that the last year PostgreSQL developers are doing so much work that it cannot be ignored (pg_basebackup for example).Since the latest centos package was 9.1.x they agreed and i started working.
PostgreSQL is so easy to compile, no strange dependencies not many libraries needed. and about these reliability concerns ? Because developers are so neat, and i quote from

"While upgrades always have some risk, PostgreSQL minor releases fix only frequently-encountered security and data corruption bugs to reduce the risk of upgrading. The community considers not upgrading to be riskier than upgrading."

You should never find an unexpected change that breaks an application in a minor PostgreSQL upgrade. Bug, security, and corruption fixes are always done in a way that minimizes the odds of introducing an externally visible behavior change, and if that's not possible, the reason why and the suggested workarounds will be detailed in the release notes. What you will find is that some subtle problems, resulting from resolved bugs, can clear up even after a minor version update. It's not uncommon to discover a report of a problem to one of the PostgreSQL mailing lists is resolved in the latest minor version update compatible with that installation, and upgrading to that version is all that's needed to make the issue go away.

so don't be afraid to upgrade , DO IT

If you want to compile postgres you will need (except from the basic development tools) the following 2 libraries :

zlib1g-dev (compression library needed if you want to compress directly from postgres, pg_dump -Z for example)

libreadline6-dev (used by the client to support history on psql)

You could call both these libraries optional and you actually can compile postgres without them, but meh , don't do that...

Other than that its pretty much straight forward,
make install (check the documentation for extra flags , change of prefix etc)

postgres will install the binaries by default in /usr/local/pgsql/
Packages from the other hand will install in /usr/bin (MEGA SUPER LAME)

SO ! lets see the current situation. I inherited a postgresql server that had a package installation, the binaries located in /usr/bin and postgres home in /var/lib.
If it was compiled i could go to /usr/local/pgsql , rename it to /usr/local/pgsql8 remove it from the path and let it be, but yeah, i couldn't do that, i fixed the path, put /usr/local/pgsql before /usr/bin so the first postgres binaries in path order were 9's so i was ok. But still ! the packages tend to inter-grade deep into the OS making any other option (like running 2 postgres servers at the same time) more difficult that it should be.

I like adding some of the contrib packages into postgres , for logging, monitoring and tuning reasons pg_buffercache, pgbench, pg_stat_statements, pg_freespacemap are some examples.
In a compiled environment there is nothing easier than that, nothing extra to download not much to do, just compile the contrib module and add it to the database with CREATE EXTENSION.
Now lets say that after a while a new release of postgres comes and i wanna upgrade to that just because i want to stay current or because a new wow feature was added or a bug was resolved, all i have to do is compile the new version (assuming its a minor release) replace the binaries and restart the server. Package installed postgres server would have to wait till the distribution released a new package of the new server usually a few versions behind current, which means what ? a month after the release date? maybe more!?

Basically the only thing that i can imagine being easier with a package installation are the init scripts, and yeah , ok i don't think that is a huge pro comparing to a compiled version.

SO my suggestion is , compile the damn thing.. COMPILE !

Thanks for reading

Managing xid wraparound without looking like a (mail) chimp

My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE . Since this was P...