Friday, 16 June 2017

PostgreSQL on ZFS with BPF tracing on top.

At OmniTI we love solaris, my personal favourite features are ZFS and DTrace. Unfortunately not many run postgres on solaris so i have decided to implement similar features in linux. Instead of Dtrace i'll install BPF, in-kernel bytecode that can be used for tracing introduced in recent kernels (4.X). 
This post will be a part of a three series post. In this post we'll start with setup, in part #2 with ZFS and how to use it for backups / snapshots. In part #3 we'll dig into BPF a bit more.

Step 1 is to setup a new ubuntu. I setup a VM using ubuntu-16.04.2-server-amd64.iso.

As root :
Add the repo for bcc :
> echo "deb [trusted=yes] xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list

sudo apt-get update

Install all necessary and some optional packages :

> apt-get install -y sudo wget apt-transport-https joe less build-essential libreadline-dev \
zlib1g-dev flex bison libxml2-dev libxslt-dev libssl-dev openssh-server screen git \
binutils bcc bcc-tools libbcc-examples python-bcc zfsutils-linux \
systemtap systemtap-sdt-dev

Make sure kernel is properly compiled :

> cat /boot/config-`uname -r` |grep BPF

Test BCC (stands for BPF Compiler Collection)

> python /usr/share/bcc/examples/tracing/
Tracing... Hit Ctrl-C to end.
     kbytes              : count     distribution
         0 -> 1          : 7        |************                            |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 22       |****************************************|
         8 -> 15         : 19       |**********************************      |
        16 -> 31         : 8        |**************                          |
        32 -> 63         : 6        |**********                              |
        64 -> 127        : 1        |*                                       |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 1        |*                                       |

Now its time to install postgres on a zfs partition, in my case i had a disk (sdb) attached on my VM :

> fdisk /dev/sdb

Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x4226e0bf.

Command (m for help): n
Partition type
   p   primary (0 primary, 0 extended, 4 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1):
First sector (2048-41943039, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-41943039, default 41943039):

Created a new partition 1 of type 'Linux' and of size 20 GiB.

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

To create the pool :

> sudo zpool create pg_zfs /dev/sdb1

> zpool status
  pool: pg_zfs
 state: ONLINE
  scan: none requested

pg_zfs      ONLINE       0     0     0
 sdb1      ONLINE       0     0     0

errors: No known data errors

> mount |grep pg_
pg_zfs on /pg_zfs type zfs (rw,relatime,xattr,noacl)

> ls /pg_zfs/ -l
total 0

> cd /pg_zfs/
> mkdir pgsql
> mkdir pgdata
> chown postgres:postgres pgsql/
> chown postgres:postgres pgdata/

Now with everything ready compile postgres from source :

> wget -c

> tar zxfv postgresql-10beta1.tar.gz

> cd postgresql-10*
> ./configure --prefix=/pg_zfs/pgsql/ --enable-dtrace
> make -j 4 world
> make -j 4 install-world
> export PATH=$PATH:/pg_zfs/pgsql/bin
> export LD_LIBRARY_PATH=/pg_zfs/pgsql/lib
> export PGDATA=/pg_zfs/pgdata
> initdb
> pg_ctl start

At this point, postgres binaries and datafiles are on zfs. Now to check the probes we have available :

/usr/share/bcc/tools/tplist -l /pg_zfs/pgsql/bin/postgres  |awk {'print $2'}

To make sure tracing works properly, while running a statement on a different terminal :

> sudo /usr/share/bcc/tools/dbslower postgres -p 1208
Tracing database queries for pids 1208 slower than 1 ms...
TIME(s)        PID          MS QUERY
2.729496       1208   2399.665 insert into test select * from generate_series (1,100000);

To be continued ...
-- Vasilis Ventirozos

Saturday, 10 June 2017

An unusual upgrade

I have mentioned in previous posts that in my 4 years with OmniTI, we've tackled a lot of migrations. Most of them are usually the "typical" procedure. The methodology we use is more or less explained here. Last week we had a usecase for a kind of "unusual" upgrade, a 9.2 compiled with 
"--disable-integer-datetimes" meaning that all datetimes were represented as floating point internally, something that was the default at up to 8.3. This changed at (i think) 8.4 where datetimes were represented as int64 which offers more precision. 
The requirement was to migrate the database to a new one that will use integer datetimes with the minimum possible downtime. Obviously a direct upgrade wouldn't work and pg_dump / restore was not an option so we decided to approach and test this scenario differently.

The general idea is the following :

Upgrade to a 9.6 that was compiled with "--disable-integer-datetimes" and then using something like pglogical or mimeo to replicate to another 9.6 that would use integer datetimes. For this, i used 2 containers and pagila test database to make this simulation as much realistic as i could. In this post i will describe the i steps I followed.

Installed both 9.2 and 9.6 on the same box :

9.2.21 with the following options :
./configure --prefix=/home/postgres/pgsql92/ --disable-integer-datetimes
make -j 8 world
sudo make install-world

9.6.2 with the following options :
./configure --prefix=/home/postgres/pgsql96/ --disable-integer-datetimes
make -j 8 world
sudo make install-world

initiated a new cluster and started 9.2, loaded pagila testdb (schema and data), started the database.
From now on this will act like my production database.

downloaded and installed pglogical 2.0.1 9.6 :
wget -c
uncompress :
tar jxfv pglogical-2.0.1.tar.bz2
compile and install :
make USE_PGXS=1 clean all
sudo PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/postgres/pgsql96/bin make USE_PGXS=1 install

Note: i installed 9.6 before upgrading because pglogical requires some changes in the parameters (shown later) and a library loaded, so in order not to restart twice i had it installed before the upgrade.

initiated a new 9.6 cluster so i can upgrade, stopped 9.2 and upgraded 9.2 to 9.6 :

pre upgrade check :
pgsql96/bin/pg_upgrade -b /home/postgres/pgsql92/bin/ -B /home/postgres/pgsql96/bin/ -c -d /home/postgres/pgdata92/ -D /home/postgres/pgdata96/ -v
stopped 9.2 :
-- outage starts --
pgsql92/bin/pg_ctl -D /home/postgres/pgdata92/ stop
pgsql96/bin/pg_upgrade -b /home/postgres/pgsql92/bin/ -B /home/postgres/pgsql96/bin/  -d /home/postgres/pgdata92/ -D /home/postgres/pgdata96/ -v -k

added the following in postgresql.conf :
wal_level = 'logical'
max_worker_processes = 10
max_replication_slots = 10
max_wal_senders = 10
shared_preload_libraries = 'pglogical'
track_commit_timestamp = on

on master hba.conf (and slave, cause why not) added :
host    replication     postgres             trust  
(security was not a concern so "trust" was ok)

started 9.6
-- outage stops --
analyzed 9.6 and cleaned up 9.2
and issued "create extension pg_logical;" to postgres

At this point i had my "production" db upgraded to 9.6 with pglogical installed and everything ready for logical replication.

On the second box that would have postgres 9.6,compiled without the "--disable-integer-datetimes" flag. I installed pglogical with exactly the same way i did for the first box and at this point i was ready to replicate :

on production (provider) I created a new node and added a set with all objects in public schema:
SELECT pglogical.create_node( node_name := 'provider1', dsn := 'host= port=5432 dbname=monkey' );
SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']);

Keep in mind :
ERROR:  table payment_p2007_01 cannot be added to replication set default
DETAIL:  table does not have PRIMARY KEY and given replication set is configured to replicate UPDATEs and/or DELETEs
HINT:  Add a PRIMARY KEY to the table

ALL tables that are going to be replicated need to have a primary key.
after adding a pk to the tables that didn't have one i went to the slave and i did :

Create a node for the subscriber :
SELECT pglogical.create_node(node_name := 'subscriber1',dsn := 'host= port=5432 dbname=monkey');

Started the replication process which sync'd (schema and data): 
SELECT pglogical.create_subscription( subscription_name := 'subscription1',provider_dsn := 'host= port=5432 dbname=monkey', synchronize_structure := true);

To verify that these 2 databases have different storage types :

postgres@old_server:~/pgdata96$ pg_controldata |grep Date
Date/time type storage:               floating-point numbers

postgres@new_server:~/pgdata$ pg_controldata |grep Date
Date/time type storage:               64-bit integers

The database was transferred and from now on replicated.
At this point, if this was the real deal it would be preferred to first transfer the schema and then start replication just to be sure that all objects will transfer but in my case i didn't really care about that.

Keep in mind that since postgres 10 floating point datetimes are no longer supported. If you tried to compile it it would give :

postgres@a1bdb0750dc5:~/postgresql-10beta1$ ./configure --disable-integer-datetimes
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking which template to use... linux
configure: error: --disable-integer-datetimes is no longer supported 

This is a pretty specialized scenario, but from what i saw there are some databases out there having their date times stored as floating points that hopefully could benefit from this migration procedure.

Thanks for reading
-- Vasilis Ventirozos

Friday, 9 June 2017

Tip for faster wal replay on a slave

I've been in situations where i need a slave db to replay a lot of wal files fast, and by a lot i mean tens of thousands. This could happen because of a reporting database refreshing or simply because a slave was down for an extended period of time. It's known that lowering shared_buffers speeds up wal replay for obvious reasons, but by how much ?

I did a benchmark on an old server and the results are interesting :

With 32GB of shared buffers and with 6390Mb of wals (1840 wal files)
it took 1408 seconds to complete the replay.

With 64MB of shared buffers and with 6510Mb of wals (1920 wal files)
it took 1132 seconds to complete the replay.

My test was done by stopping the slave, inserting 50 mil rows to a test table, wait for the wal transfer to complete, then stop the master and start the slave and watch OmniPITR logs.

The performance gain in wal replay was about 20% in postgres 10beta1 which doesn't sound bad, especially in times of need.

Thanks for reading
-- Vasilis