Tuesday, 17 December 2013

scaling postgres with pgpool-II

Recently a new release of pgpool-II was released, 3.2.7 (released 2013-12-06). For connection pooling I mostly have experience with pgbouncer and in OmniTI we use it a lot. I heard that pgpool has some issues and because of its increased complexity i don't know if i would recommend it as a connection pool but pgpool does much more than that.

What is pgpool-II (from pg-pool wiki) :

pgpool-II is a middleware that works between PostgreSQL servers and a PostgreSQL database client. It is licensed under BSD license. It provides the following features.
  • Connection Pooling
pgpool-II saves connections to the PostgreSQL servers, and reuse them whenever a new connection with the same properties (i.e. username, database, protocol version) comes in. It reduces connection overhead, and improves system's overall throughput.
  • Replication
pgpool-II can manage multiple PostgreSQL servers. Using the replication function enables creating a realtime backup on 2 or more physical disks, so that the service can continue without stopping servers in case of a disk failure.
  • Load Balancing
If a database is replicated, executing a SELECT query on any server will return the same result. pgpool-II takes an advantage of the replication feature to reduce the load on each PostgreSQL server by distributing SELECT queries among multiple servers, improving system's overall throughput. At best, performance improves proportionally to the number of PostgreSQL servers. Load balance works best in a situation where there are a lot of users executing many queries at the same time.
  • Limiting Exceeding Connections
There is a limit on the maximum number of concurrent connections with PostgreSQL, and connections are rejected after this many connections. Setting the maximum number of connections, however, increases resource consumption and affect system performance. pgpool-II also has a limit on the maximum number of connections, but extra connections will be queued instead of returning an error immediately.
  • Parallel Query
Using the parallel query function, data can be divided among the multiple servers, so that a query can be executed on all the servers concurrently to reduce the overall execution time. Parallel query works the best when searching large-scale data.

The feature that im mostly interested and that i want to explore in this post is load ballancing. I like this feature because without changing the application you can elevate your replica one step and make it even more useful.

I setup 3 virtual machines that i called pgpool1, pgnode1, pgnode2.
I setup a basic installation of debian and i compiled pgpool-II (3.3.2) and PostgreSQL 9.3.2.
As a first step i setup a basic streaming replication between pgnode1 and pgnode2 and i set pgpool1 for Load ballancing, all configurations can be found in my github. I wont go much into detail on how to set pgpool up, my configuration files should be a good guide and i found that pgpool documentation is very informative and up to date, also many before me have blogged about how to configure it.
I will stand more on the design and what could you do with this.
As we all know streaming replication is almost instant so separating "read" from "write" can be beneficial in a large variety of applications, but especially for more reads than writes, so lets see how this works :

root@pgpool1:~# psql -p 9999 -U postgres
postgres=# create database monkey;
postgres=# \connect monkey;
You are now connected to database "monkey" as user "postgres".
monkey=# create table test (id int);
monkey=# insert into test values (1);
monkey=# select * from test;
(1 row)

With full logging enabled in the dbs lets see the logs from pgnode1 and pgnode2 :

LOG:  duration: 342.044 ms  statement: create database monkey;
LOG:  duration: 4.646 ms  statement: create table test (id int);
LOG:  duration: 1.733 ms  statement: insert into test values (1);

LOG:  duration: 1.018 ms  statement: select * from test;

OK it works as expected, now i heard that this doesnt work if select has a space in the begining so i tried to test that as well :

on pgnode2 (replica) :
LOG:  duration: 0.757 ms  statement:   selecT * from test;

LOG:  duration: 0.215 ms  statement: SeLeCt   * from test /* this is my test */;

something more complicated maybe? something like pgbench sounds appropriate.

I wont spam more logs, it worked as it should.
But what would happen with 2 replicas ? i know that read operations should be scaled and that they should be distributed among the slaves, i created another database cluster on pgnode2 (port 5433) and made it replica to pgnode1,
so now i had one master running on pgnode1 and two slaves running on pgnode2 (5432 and 5433).
I changed pgpool configuration to add the new db :

backend_hostname0 = ''
backend_port0 = 5432
backend_weight0 = 0
backend_data_directory0 = '/opt/pgdata'

backend_hostname1 = ''
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/opt/pgdata'

backend_hostname2 = ''
backend_port2 = 5433
backend_weight2 = 1
backend_data_directory2 = '/opt/pgdata2'

and from pgpool1 i run :
while true; do psql -p 9999  -c 'selecT * from test;'  -U postgres ;done

and VoilĂ  , i had my selects distributed on my 2 slaves.

pgpool has A LOT of cool features, im not in position to say if they all work as they should but from my tests on load ballancing everything was smooth, pgpool has also a web based gui worth checking and some system views that can be used to monitor the activity, also supports automatic failover, watchdog and many many other neat features that i might explore in another post.

Next time i will explore Parallel query.

all the configurations can be found in my github (including the postgres configs that were used).

Thanks for reading
- Vasilis

No comments:

Post a Comment