PostgreSQL on ZFS with BPF tracing on top.
At OmniTI we love solaris, my personal favourite features are ZFS and DTrace. Unfortunately not many run postgres on solaris so i have decided to implement similar features in linux. Instead of Dtrace i'll install BPF, in-kernel bytecode that can be used for tracing introduced in recent kernels (4.X).
This post will be a part of a three series post. In this post we'll start with setup, in part #2 with ZFS and how to use it for backups / snapshots. In part #3 we'll dig into BPF a bit more.
Step 1 is to setup a new ubuntu. I setup a VM using ubuntu-16.04.2-server-amd64.iso.
As root :
Add the repo for bcc :
> echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt-get update
Install all necessary and some optional packages :
> apt-get install -y sudo wget apt-transport-https joe less build-essential libreadline-dev \
zlib1g-dev flex bison libxml2-dev libxslt-dev libssl-dev openssh-server screen git \
binutils bcc bcc-tools libbcc-examples python-bcc zfsutils-linux \
systemtap systemtap-sdt-dev
Make sure kernel is properly compiled :
> cat /boot/config-`uname -r` |grep BPF
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_HAVE_BPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m
Test BCC (stands for BPF Compiler Collection)
> python /usr/share/bcc/examples/tracing/bitehist.py
Tracing... Hit Ctrl-C to end.
^C
kbytes : count distribution
0 -> 1 : 7 |************ |
2 -> 3 : 0 | |
4 -> 7 : 22 |****************************************|
8 -> 15 : 19 |********************************** |
16 -> 31 : 8 |************** |
32 -> 63 : 6 |********** |
64 -> 127 : 1 |* |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |* |
Now its time to install postgres on a zfs partition, in my case i had a disk (sdb) attached on my VM :
> fdisk /dev/sdb
Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x4226e0bf.
Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1):
First sector (2048-41943039, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-41943039, default 41943039):
Created a new partition 1 of type 'Linux' and of size 20 GiB.
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
To create the pool :
> sudo zpool create pg_zfs /dev/sdb1
> zpool status
pool: pg_zfs
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
pg_zfs ONLINE 0 0 0
sdb1 ONLINE 0 0 0
errors: No known data errors
> mount |grep pg_
pg_zfs on /pg_zfs type zfs (rw,relatime,xattr,noacl)
> ls /pg_zfs/ -l
total 0
> cd /pg_zfs/
> mkdir pgsql
> mkdir pgdata
> chown postgres:postgres pgsql/
> chown postgres:postgres pgdata/
Now with everything ready compile postgres from source :
> wget -c https://ftp.postgresql.org/pub/source/v10beta1/postgresql-10beta1.tar.gz
> tar zxfv postgresql-10beta1.tar.gz
> cd postgresql-10*
> ./configure --prefix=/pg_zfs/pgsql/ --enable-dtrace
> make -j 4 world
> make -j 4 install-world
> export PATH=$PATH:/pg_zfs/pgsql/bin
> export LD_LIBRARY_PATH=/pg_zfs/pgsql/lib
> export PGDATA=/pg_zfs/pgdata
> initdb
> pg_ctl start
At this point, postgres binaries and datafiles are on zfs. Now to check the probes we have available :
/usr/share/bcc/tools/tplist -l /pg_zfs/pgsql/bin/postgres |awk {'print $2'}
postgresql:clog__checkpoint__start
postgresql:clog__checkpoint__done
postgresql:multixact__checkpoint__start
postgresql:multixact__checkpoint__done
postgresql:subtrans__checkpoint__start
postgresql:subtrans__checkpoint__done
postgresql:twophase__checkpoint__start
postgresql:twophase__checkpoint__done
postgresql:transaction__start
postgresql:transaction__commit
postgresql:transaction__abort
postgresql:wal__buffer__write__dirty__start
postgresql:wal__buffer__write__dirty__done
postgresql:wal__switch
postgresql:checkpoint__start
postgresql:checkpoint__done
postgresql:wal__insert
postgresql:statement__status
postgresql:buffer__flush__start
postgresql:buffer__flush__done
postgresql:buffer__read__start
postgresql:buffer__read__done
postgresql:buffer__write__dirty__start
postgresql:buffer__write__dirty__done
postgresql:buffer__sync__done
postgresql:buffer__sync__start
postgresql:buffer__sync__written
postgresql:buffer__checkpoint__start
postgresql:buffer__checkpoint__sync__start
postgresql:buffer__checkpoint__done
postgresql:lock__wait__start
postgresql:lock__wait__done
postgresql:deadlock__found
postgresql:lwlock__wait__start
postgresql:lwlock__wait__done
postgresql:lwlock__acquire
postgresql:lwlock__condacquire
postgresql:lwlock__condacquire__fail
postgresql:lwlock__acquire__or__wait
postgresql:lwlock__acquire__or__wait__fail
postgresql:lwlock__release
postgresql:smgr__md__read__start
postgresql:smgr__md__read__done
postgresql:smgr__md__write__start
postgresql:smgr__md__write__done
postgresql:query__parse__start
postgresql:query__parse__done
postgresql:query__rewrite__start
postgresql:query__rewrite__done
postgresql:query__plan__start
postgresql:query__plan__done
postgresql:query__start
postgresql:query__done
postgresql:query__execute__start
postgresql:query__execute__done
postgresql:sort__start
postgresql:sort__done
To make sure tracing works properly, while running a statement on a different terminal :
> sudo /usr/share/bcc/tools/dbslower postgres -p 1208
Tracing database queries for pids 1208 slower than 1 ms...
TIME(s) PID MS QUERY
2.729496 1208 2399.665 insert into test select * from generate_series (1,100000);
Thanks for reading.
Vasilis Ventirozos
OmniTI Computer Consulting
This post will be a part of a three series post. In this post we'll start with setup, in part #2 with ZFS and how to use it for backups / snapshots. In part #3 we'll dig into BPF a bit more.
Step 1 is to setup a new ubuntu. I setup a VM using ubuntu-16.04.2-server-amd64.iso.
As root :
Add the repo for bcc :
> echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt-get update
Install all necessary and some optional packages :
> apt-get install -y sudo wget apt-transport-https joe less build-essential libreadline-dev \
zlib1g-dev flex bison libxml2-dev libxslt-dev libssl-dev openssh-server screen git \
binutils bcc bcc-tools libbcc-examples python-bcc zfsutils-linux \
systemtap systemtap-sdt-dev
Make sure kernel is properly compiled :
> cat /boot/config-`uname -r` |grep BPF
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_HAVE_BPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m
Test BCC (stands for BPF Compiler Collection)
> python /usr/share/bcc/examples/tracing/bitehist.py
Tracing... Hit Ctrl-C to end.
^C
kbytes : count distribution
0 -> 1 : 7 |************ |
2 -> 3 : 0 | |
4 -> 7 : 22 |****************************************|
8 -> 15 : 19 |********************************** |
16 -> 31 : 8 |************** |
32 -> 63 : 6 |********** |
64 -> 127 : 1 |* |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |* |
Now its time to install postgres on a zfs partition, in my case i had a disk (sdb) attached on my VM :
> fdisk /dev/sdb
Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x4226e0bf.
Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1):
First sector (2048-41943039, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-41943039, default 41943039):
Created a new partition 1 of type 'Linux' and of size 20 GiB.
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
To create the pool :
> sudo zpool create pg_zfs /dev/sdb1
> zpool status
pool: pg_zfs
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
pg_zfs ONLINE 0 0 0
sdb1 ONLINE 0 0 0
errors: No known data errors
> mount |grep pg_
pg_zfs on /pg_zfs type zfs (rw,relatime,xattr,noacl)
> ls /pg_zfs/ -l
total 0
> cd /pg_zfs/
> mkdir pgsql
> mkdir pgdata
> chown postgres:postgres pgsql/
> chown postgres:postgres pgdata/
Now with everything ready compile postgres from source :
> wget -c https://ftp.postgresql.org/pub/source/v10beta1/postgresql-10beta1.tar.gz
> tar zxfv postgresql-10beta1.tar.gz
> cd postgresql-10*
> ./configure --prefix=/pg_zfs/pgsql/ --enable-dtrace
> make -j 4 world
> make -j 4 install-world
> export PATH=$PATH:/pg_zfs/pgsql/bin
> export LD_LIBRARY_PATH=/pg_zfs/pgsql/lib
> export PGDATA=/pg_zfs/pgdata
> initdb
> pg_ctl start
At this point, postgres binaries and datafiles are on zfs. Now to check the probes we have available :
/usr/share/bcc/tools/tplist -l /pg_zfs/pgsql/bin/postgres |awk {'print $2'}
postgresql:clog__checkpoint__start
postgresql:clog__checkpoint__done
postgresql:multixact__checkpoint__start
postgresql:multixact__checkpoint__done
postgresql:subtrans__checkpoint__start
postgresql:subtrans__checkpoint__done
postgresql:twophase__checkpoint__start
postgresql:twophase__checkpoint__done
postgresql:transaction__start
postgresql:transaction__commit
postgresql:transaction__abort
postgresql:wal__buffer__write__dirty__start
postgresql:wal__buffer__write__dirty__done
postgresql:wal__switch
postgresql:checkpoint__start
postgresql:checkpoint__done
postgresql:wal__insert
postgresql:statement__status
postgresql:buffer__flush__start
postgresql:buffer__flush__done
postgresql:buffer__read__start
postgresql:buffer__read__done
postgresql:buffer__write__dirty__start
postgresql:buffer__write__dirty__done
postgresql:buffer__sync__done
postgresql:buffer__sync__start
postgresql:buffer__sync__written
postgresql:buffer__checkpoint__start
postgresql:buffer__checkpoint__sync__start
postgresql:buffer__checkpoint__done
postgresql:lock__wait__start
postgresql:lock__wait__done
postgresql:deadlock__found
postgresql:lwlock__wait__start
postgresql:lwlock__wait__done
postgresql:lwlock__acquire
postgresql:lwlock__condacquire
postgresql:lwlock__condacquire__fail
postgresql:lwlock__acquire__or__wait
postgresql:lwlock__acquire__or__wait__fail
postgresql:lwlock__release
postgresql:smgr__md__read__start
postgresql:smgr__md__read__done
postgresql:smgr__md__write__start
postgresql:smgr__md__write__done
postgresql:query__parse__start
postgresql:query__parse__done
postgresql:query__rewrite__start
postgresql:query__rewrite__done
postgresql:query__plan__start
postgresql:query__plan__done
postgresql:query__start
postgresql:query__done
postgresql:query__execute__start
postgresql:query__execute__done
postgresql:sort__start
postgresql:sort__done
To make sure tracing works properly, while running a statement on a different terminal :
> sudo /usr/share/bcc/tools/dbslower postgres -p 1208
Tracing database queries for pids 1208 slower than 1 ms...
TIME(s) PID MS QUERY
2.729496 1208 2399.665 insert into test select * from generate_series (1,100000);
Thanks for reading.
Vasilis Ventirozos
OmniTI Computer Consulting
"Unfortunately not many run postgres on solaris". Given that Joyent make a living out of doing this - Postgres on SmartOS, can you comment further on your doubts.
ReplyDeleteWe do the same at OmniTI, all our internally hosted clients are on OmniOS, i know what Joyent does and how they make a living but still, you can't deny that the strong majority would go for linux.
DeleteIndeed. Thanks for your thoughts
ReplyDelete/d