Friday, 19 May 2017

Reusing an old master (as slave) after a switchover

Todays blogpost is old news but it's not very well known and it has to do with how to add an old master after a slave has been promoted. Fujii Masao explained the situation in his patch back in the day.

So in todays post i will be demonstrating a combination of replication slots for retaining the writes on a promoted slave and how to re-introduce an old master to the replication.

Say that we have a master-slave setup, stop the master gracefully and create and activate a replica slot on the slave :

monkey=# SELECT * FROM pg_create_physical_replication_slot('this_is_a_replica_slot');
       slot_name        | xlog_position
------------------------+---------------
 this_is_a_replica_slot |
(1 row)

postgres@bf9823730feb:~$ pg_receivexlog -D . -S this_is_a_replica_slot -v -h 10.0.0.3 -U repuser
pg_receivexlog: starting log streaming at 0/4000000 (timeline 1) ^C
pg_receivexlog: received interrupt signal, exiting
pg_receivexlog: not renaming "000000010000000000000004.partial", segment is not complete
pg_receivexlog: received interrupt signal, exiting


This way all the changes after the slave promotion will be retained. Now , lets promote the slave and check the replication slot status.

postgres@bf9823730feb:~$ pg_ctl promote
server promoting

postgres@bf9823730feb:~$ psql -c "SELECT slot_name, database,active,  pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots;" postgres
       slot_name        | database | active | retained_bytes
------------------------+----------+--------+----------------
 this_is_a_replica_slot |          | f      |           9056

As you can see , the new master is now retaining all writes. Now, to the old master, lets make a recovery.conf that looks like this :

standby_mode = 'on'
primary_conninfo = 'user=repuser host=10.0.0.3 port=5432 application_name=a_slave'
trigger_file = '/home/postgres/pgdata/finish.recovery'
primary_slot_name = 'this_is_a_replica_slot'
recovery_target_timeline = 'latest'

where host is obviously the new master. 
Start the old master and you should see something similar to:

LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/5000098
LOG:  database system is ready to accept read only connections
LOG:  invalid record length at 0/5000098: wanted 24, got 0
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 0/5000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/5000098.
LOG:  new target timeline is 2
LOG:  restarted WAL streaming at 0/5000000 on timeline 2
LOG:  redo starts at 0/5000098


This would also work if you had archiving enabled instead of replication slots. Given that you have plenty of available disk space on the new master, it should be fine to keep the old master down for any reasonable amount of time and re-introduce it to the replication without issues.

The only 2 things that you have to make sure is to shutdown postgres on old-master gracefully and that the slave has caught up before the promotion.


Thanks for reading.




Tuesday, 9 May 2017

PostgreSQL statistics as data

 Postgres provides a lot of information when it comes to statistics. Only problem is that all the statistics are frozen in time the moment you query the system catalog views. Sometimes you need to see whats happening over time or in the past,and this is easily solvable with a script and a crontab entry. At OmniTI we use a project called system_monitoring a lot. This is basically a perl scheduler that runs as a deamon exporting useful information from postgres (and more) to tab delimited text files for later investigation. If you have decent one liner skills this is very helpful: you can answer questions about what happened yesterday night, what was running, who locked who etc. It's very good but data analysis can be a pain in the ass sometimes, especially if you need to reimport to a db and analyze with SQL. So I decided to approach this in a different way, by storing the statistics to a schema. Thats why i wrote statz
Statz is a very simple python script that will gather all postgres statistics for a period of time in an interval, aggregate them and keep raw and aggregated data in its own schema. The raw data that it keeps are just a now(),* from the most important statistics tables :

  • pg_stat_user_indexes
  • pg_stat_activity
  • pg_locks
  • pg_stat_user_tables
  • pg_stat_database
  • pg_stat_bgwriter
But its relatively easy to add more, even application specific data.

How to use :

statz needs a database to connect (for now it's the same with the database that monitors), an interval and a total duration (interval and total duration should be in seconds).
So, something like:

./statz.py -c dbname=monkey -i 5 -t 120 

Would initialize a schema called statz (this for now is hardcoded)
if schema exists it will drop it (cascade) and recreate it.
In a loop that will last for <total time> , sleeping for <interval> it will keep output of select now(),* from the tables listed above, populating the following tables :

statz.backend_activity
statz.bgwriter_activity
statz.database_activity
statz.index_activity
statz.lock_activity
statz.table_activity

In the same loop it will aggregate and populate the following aggregated tables.
statz.database_activity_agg
statz.table_activity_agg



The aggregated tables look like this :
for the database:


snap_date                 | 2017-05-08 16:35:17
interval                  | 00:00:02
step                      | 00:00:10
datname                   | monkey
commits                   | 863
rows_returned             | 21761
rows_fetched              | 9102
rows_inserted             | 1662
rows_updated              | 2251
rows_deleted              | 0
blocks_read               | 89
blocks_hit_cached         | 57572
commits_per_sec           | 432
rows_returned_per_sec     | 10881
rows_fetched_per_sec      | 4551
rows_inserted_per_sec     | 831
rows_updated_per_sec      | 1126
rows_deleted_per_sec      | 0
blocks_read_per_sec       | 45
blocks_hit_cached_per_sec | 28786
txn_per_sec               | 432
cache_hit_ratio           | 99.8456

and for each table :

snap_date                  | 2017-05-08 16:35:49
interval                   | 00:00:02
step                       | 00:00:42
table_name                 | public.pgbench_tellers
seq_scans                  | 0
seq_rows_read              | 0
index_scans                | 5807
index_rows_fetched         | 5807
rows_inserted              | 0
rows_updated               | 5807
rows_deleted               | 0
rows_hot_updated           | 5759
live_row_count             | 0
n_dead_tup                 | 634
seq_scans_per_sec          | 0
seq_rows_read_per_sec      | 0
index_scans_per_sec        | 5807
index_rows_fetched_per_sec | 2904
rows_inserted_per_sec      | 0
rows_updated_per_sec       | 2904
rows_deleted_per_sec       | 0
rows_hot_updated_per_sec   | 2880
live_row_count_per_sec     | 0
n_dead_tup_per_sec         | 317


I have also included a couple of views that are useful for fast monitoring :

statz.db_stats_per_sec 
statz.table_stats_per_sec

Keep in mind that the numbers are being calculated using the last inserted row and since its in the same loop with populating the historical tables the information will be refreshed every <interval> 
Which makes this a great tool for things like :

watch -n 2 "psql -x -c 'select * from statz.database_activity_agg order by step desc limit 1;' monkey"

You can even create and send graphs like :

Usecases


Statz is gathering a lot of data so it's not meant to run 24/7. Thats why it has an interval and a total time and thats why it will destroy old statistical data when re-run
Say that you observed that during night, queries are getting slower or wal file generation is increased or any abnormality (locks / unknown statements / rollbacks etc). You can schedule statz to run for this period of time and analyze data next morning. Keeping snapshots of pg_stat_activity is also helpful for slow query investigation where you can do things like :

select (snap_date - xact_start) as age,user,query from backend_activity order by 1 desc;

This script is also useful for analyzing application specific usage patterns, To get statistics from benchmarks and fine tune checkpoints , missing indexes etc..

And because it all happens in a single transaction, snap_date can be used as a key linking all statz tables.
This means that if you see database doing a lot of commits for example, you can use snap_date to find the busy table in statz.table_activity or you can check locks or statements run at that specific time in the rest of the tables.

link for statz can be found here.

Next steps

I plan on the following things :
  • Refactor schema / statements for performance.
  • More aggregating statements for the rest of the tables.
  • Possibly (live) graphs or maybe a dashboard.
  • Possibly rewrite some parts for better general use.
  • Lately im reading a book about machine learning in python i might experiment with that a bit.

Thanks for reading.

Wednesday, 25 January 2017

PostgreSQL upgrades - Methodology

 Yesterday, i came across a conversation in postgres slack channel, someone was interested in upgrading his database and i realised that there is a pattern i see all the time. Those who are interested in upgrading major postgres versions, only ask questions about pg_upgrade. Don't get me wrong, pg_upgrade is awesome its well documented, and it explains things much better than i can. 
During my time in OmniTI we've done many... MANY upgrades some of them pretty challenging,
so i have come to realise that the upgrade tool is not as important as the plan and the methodology, especially during an upgrade or a migration. For this reason this post won't be about how to use pg_upgrade, instead it will be all about methodology, minimising downtime and reducing risk to almost zero.
For this post i will be showing the steps i would follow in order to upgrade an 8.4.22 to 9.6.1. I will be using docker containers using a dockerfile that i wrote for lab use (more info about this docker file can be found in this post) I will include a couple of "traps" that could potentially cause trouble to my upgrade just so i can show how to work around them.
A typical postgres installation would have one master and one or more slaves and maybe a dev/test server. 
For the needs of this post , i will concentrate on the master and on the test. As a matter of fact the test server will be much more important because thats where we will develop an upgrade plan.
So,  we need 2 containers :
An upgrade target host
An upgrade plan development server (aka a test machine)

So , lets assume that we have a master with the following objects that we want to upgrade :


postgres=# select version();

                                                          version
---------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.4.22 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit
(1 row)

postgres=# \dt

                 List of relations
 Schema |          Name          | Type  |  Owner
--------+------------------------+-------+----------
 public | another_small_table    | table | postgres
 public | big_reporting_table    | table | postgres
 public | really_important_table | table | postgres
 public | small_oltp_table       | table | postgres
(4 rows)

postgres=# \dv

              List of relations
 Schema |       Name       | Type |  Owner
--------+------------------+------+----------
 public | my_stat_activity | view | postgres
(1 row)

First order of business would be to create a replica (or to dump and restore if size permits) on the test server that should run on exactly the same postgres version. 

Next, you should check if the test server has the same configuration and if all contrib modules are installed (not calling them extensions because ... 8.4.x).
So compare postgresql.conf on these 2 servers and compare output of ls -l <postgres binary installation directory>/lib. 
In my case i didn't have pg_buffercache , so i installed it. 
At this point it is a good idea to write down the extensions you would need for 9.6, in my case just pg_buffercache.

To sum up, the steps are :


On the test server :



  1. Install exactly the same version as production
  2. Transfer the configs from production
  3. Install all the extensions that exist in production
  4. Create a replica from production
  5. Promote the replica to be a standalone master


At this point we should have a test environment that looks like production. From this moment on, everything should be timed and documented.


On the Test server



  1. Install the version you want to upgrade to, in my case 9.6.1 and don't put its binaries in the path (yet) while documenting and timing all the steps. DO NOT overwrite your old binaries.
  2. Port your configuration file to the new version, DO NOT copy the old config fileIts not just that some parameters have changed, or that some parameters are not compatible from an old to a new pg version, there are new settings commented out that are important and new defaults that you should tune before putting it to production.
  3. Initdb a cluster using 9.6 binaries and test your configuration files, remember to start it on a different port than 8.4.  Keep the config files somewhere safe ideally at the same place where you develop your migration plan
  4. Stop 8.4 and 9.6 clusters
  5. Delete the previously created (for config test) cluster and re-initialize a new $PGDATA directory using the 9.6 binary. This will be your upgraded $PGDATA directory
  6. Test pg_upgrade (9.6 binary) like this : 

pg_upgrade -b ~/pgsql8/bin/ -B ~/pgsql9/bin/ -d ~/pgdata8 -D ~/pgdata9 -c
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* system OID user data types                ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for roles starting with 'pg_'                      ok
Checking for invalid "line" user columns                    ok
Checking for large objects                                  ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok

*Clusters are compatible*


everything looks good,  lets try to upgrade :


time pg_upgrade -b ~/pgsql8/bin/ -B ~/pgsql9/bin/ -d ~/pgdata8 -D ~/pgdata9 -k

Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* system OID user data types                ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for roles starting with 'pg_'                      ok
Checking for invalid "line" user columns                    ok
Checking for large objects                                  ok
Creating dump of global objects                             ok
Creating dump of database schemas
                                                            ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok

If pg_upgrade fails after this point, you must re-initdb the

new cluster before continuing.

Performing Upgrade

------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows on the new cluster                        ok
Deleting files from new pg_clog                             ok
Copying old pg_clog to new server                           ok
Setting next transaction ID and epoch for new cluster       ok
Deleting files from new pg_multixact/offsets                ok
Setting oldest multixact ID on new cluster                  ok
Resetting WAL archives                                      ok
Setting frozenxid and minmxid counters in new cluster       ok
Restoring global objects in the new cluster                 ok
Restoring database schemas in the new cluster
  postgres
*failure*

Consult the last few lines of "pg_upgrade_dump_11564.log" for

the probable cause of the failure.
Failure, exiting

real 0m7.011s

user 0m0.000s
sys 0m0.050s

This is why we HAVE to test and time things in a test server. If we check the log we'll see :


pg_restore: [archiver (db)] Error from TOC entry 142; 1259 16390 VIEW my_stat_activity postgres

pg_restore: [archiver (db)] could not execute query: ERROR:  column pg_stat_activity.procpid does not exist
LINE 14: ...CT (now() - pg_stat_activity.query_start) AS age, pg_stat_ac...

This was an intended to happen error, just to show things can go wrong, and that you don't want these kinds of things to happen in production. Its just a view that uses columns from pg_stat_activity that were renamed later on. Same thing can happen from some old extensions (like tsearch) so one more reason to test.


In order to fix this i would get the view definition, i would adjust it to work in 9.6, i would test it and when i felt confident that the view works I would write a script that would look something like :

pg_ctl -D ~/pgdata8 start && psql -c "drop view my_stat_activity" postgres && pg_ctl -D ~/pgdata8 stop && rm -rf ~/pgdata9/* && ~/pgsql9/bin/initdb -D ~/pgdata9/ && time pg_upgrade -b ~/pgsql8/bin/ -B ~/pgsql/bin/ -d ~/pgdata -D ~/pgdata9 -k && cp ~/configs9/*.conf ~/pgdata9/ && ~/pgsql9/bin/pg_ctl -D ~/pgdata9/ start  && psql -c "create view my_stat_activity as select now()-query_start as age,pid,state, query from pg_stat_activity where query !='<IDLE>' order by 1 desc;" postgres && psql -c "create extension pg_buffercache" postgres

Don't pay too much attention to the code above since its not even tested, just keep a similar mindset that its a really good idea to automate / script everything, scripts are easy to test, easy to predict and easy to keep and change.

The steps for the code above would be :


  • start pg8
  • drop problematic view
  • stop pg8
  • empty pg9 data dir
  • initdb pg9 data dir
  • upgrade (this should be timed)
  • copy configs to pg9 config
  • start pg9
  • create view
  • create extensions


Upgrade is not done though. Next step would be to analyze the database and this can take a long time!

The last 2 steps would be to delete old cluster and analyze the new one (because optimizer statistics are not kept by pg_upgrade). pg_upgrade provides 2 scripts that could be used for these steps :

analyze_new_cluster.sh

delete_old_cluster.sh

delete_old_cluster.sh can be run now, but before you run analyze_new_cluster you could prioritise work and maybe even run analyze commands in parallel. My dummy database has the following tables :


                 List of relations

 Schema |          Name          | Type  |  Owner
--------+------------------------+-------+----------
 public | another_small_table    | table | postgres
 public | big_reporting_table    | table | postgres
 public | really_important_table | table | postgres
 public | small_oltp_table       | table | postgres

At this point i could say that i want really_important_table analyzed asap followed by another_small_table and small_oltp_table. big_reporting_table can be last or run on a different process. My suggestion is to prioritise important tables first by creating one or more scripts. 

Remember that we are still working on test server so test and time these analyze scripts. eg :

time psql -c "analyze really_important_table;" postgres

time psql -c "analyze small_oltp_table;" postgres
time psql -c "analyze another_small_table;" postgres
time psql -c "analyze big_reporting_table;" postgres

In my opinion some of the most important things in an upgrade are :


  • Minimising risk of losing data
  • Minimising downtime
  • Providing accurate estimations on downtime to whoever needs to know.
  • Providing accurate estimations on analyze. Assuming you'll use hard links (-k) This is the step that will take the most, and during the time that the database lacks statistics, performance will be bad. (READ HOW -k WORKS BEFORE USING IT)
  • Not having any surprises.

I can't emphasis enough on the importance of scripting, timing and testing any upgrade procedure. In my opinion it's more important that the production upgrade itself. Do these steps again and again till you are confident that your procedure :


  • Has been tested well enough
  • Has been documented well enough
  • Can't go wrong
  • Is so easy that you could even train your dog to do it.
This upgraded test database should now be used to thoroughly test the application against the new pg version.


When you start upgrading production, its a good idea to keep two things in mind


  • Keep a backup handy. If the database is large, this is not as easy as it sounds and restoration time can be a lot, if you can afford the safety of a backup though, do it !!
  • ALWAYS have a rollback plan. Make sure a replica of the master is up that has no replication lag, bring it down at the time of the upgrade and be ready to move the application there, just in case something goes south. Even if you don't have a replica in your stack, the upgrade is a good opportunity to get one, even temporarily until the upgrade is done. (protip: if you failover your application to a slave keep connections on your failed-to-upgrade-master restricted making sure no one will commit any transaction).


Thanks for reading.

-- Vasilis

Tuesday, 20 December 2016

Traveling in time(lines) with pg_rewind

At 9.5 pg_rewind was introduced, with this feature it was possible to make a server that is no longer master to follow a promoted standby that has a new timeline. 
There are at least a few use cases that could benefit from this feature, to name a few :

Testing Failovers : Its pretty common for one of customers to ask for a failover simulation just to make sure that everything works, by using pg_rewind this is much easier and much faster because we don't have to re-sync the whole database that sometimes is multi-TB sized. 

Minor version upgrades : during an upgrade you have to put the new binaries in place and restart the database. By using pg_rewind if the application is flexible enough , you could upgrade the slave , switchover , upgrade the master and pg_rewind the old master to follow the new slave, possibly minimising even more the downtime.

Requirements.

pg_rewind has a couple of requirements :


1. data page checksums enabled (initdb -k)
or
2. parameter wal_log_hints has to be enabled 
3. The old-master (the one we want to re-sync) to be properly shutdown or you'll get: 
target server must be shut down cleanly 
Failure, exiting

How does it work ?

pg_rewind searches the old-master’s data directory, finds the data blocks changed during the switchover and then copies only the nesessary blocks from the promoted slave. Keep in mind that the configuration files will also be copied from the new master so take a look for mis-configurations before you start it ! Also you have to have all the wal files since the last checkpoint  from the old master. Changes are identified by comparing the state of the data blocks present in the data directory with the changes logged in the wal files. When the deferrable blocks are identified, the wals are replayed.
For more info refer to the documentation here

How to do it

Recently i've posted about how to use Docker to setup a lab environment, the link can be found here. I'm going to use two containers for this blogpost using the image created from the dockerfile that i have. Keep in mind that in my Dockerfile, i make sure that initdb is run with -k option, that all the parameters are properly set for replication and that archiving is enabled but not doing anything : 

psql -c "show archive_command";
 archive_command
-----------------
 /bin/true
(1 row)

before i setup my slave, i create a wal_archive directory and i change the archive_command to simply archive there :
mkdir ~/wal_archive
psql -c "alter system set archive_command to 'cp %p /home/postgres/wal_archive/%f' ;" postgres
ALTER SYSTEM
psql -c "select pg_reload_conf() " postgres
 pg_reload_conf
----------------
 t
(1 row)
psql -c "select pg_switch_xlog() ;" postgres
 pg_switch_xlog
----------------
 0/7000078
(1 row)
psql -c "checkpoint;" postgres
CHECKPOINT

now , from the slave :

mkdir ~/wal_archive ; ~/mk_replica.sh
waiting for server to shut down.... done
server stopped
29284/29284 kB (100%), 1/1 tablespace
server starting
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
-[ RECORD 1 ]----+--------
lag_bytes        | 0
pid              | 106
application_name | a_slave

Next , we promote the slave :
pg_ctl promote ; sleep 5 ; psql -c "select pg_is_in_recovery();" postgres
server promoting
pg_is_in_recovery
-------------------
 f
(1 row)

And now, time to pg_rewind, from the old master :

pg_ctl stop
waiting for server to shut down.... done
server stopped
pg_rewind --target-pgdata=/home/postgres/pgdata --source-server="host=10.0.0.3 port=5432 user=postgres dbname=postgres"

servers diverged at WAL position 0/F000098 on timeline 1
rewinding from last common checkpoint at 0/F000028 on timeline 1
Done!

put a recovery.conf on the rewinded master (now slave) and start it up, check that it got connected to the new master and you should be all good.

postgres=# SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes,pid, application_name FROM pg_stat_replication;
 lag_bytes | pid | application_name
-----------+-----+------------------
         0 | 139 | old_master



Most of the production database clusters that i've come across don't use database checksums which is understandable , because of the performance penalty that checksums have, (as an alternative to checksums, in OmniTI we use ZFS on OmniOS A LOT, and we love it) 
 still pg_rewind is a nice feature , maybe its not for every database but for those who will use it, it might save a ton of time of waiting to resync an old master.


Thanks for reading
-- Vasilis Ventirozos

Wednesday, 14 December 2016

Adding columns with default value fast

I recently came across a discussion in #postgresql channel about how to add a new column that will represent record insert timestamps to a table with minimal locking.

Normally you would do something like this :

monkey=# alter table alter_test add column insert_timestamp timestamp without time zone default now();
ALTER TABLE
Time: 13643.983 ms
monkey=# create index concurrently insert_timestamp_idx on alter_test (insert_timestamp) where insert_timestamp is not null;
CREATE INDEX
Time: 1616.108 ms

This though , would require an AccessExclusiveLock on the table and the application would have to wait for the duration of alter.
What you could do instead is :

monkey=# alter table alter_test add column insert_timestamp timestamp without time zone;
ALTER TABLE
Time: 1.135 ms
monkey=# alter table alter_test alter column insert_timestamp set default now();
ALTER TABLE
Time: 1.238 ms
monkey=# create index concurrently insert_timestamp_idx on alter_test (insert_timestamp) where insert_timestamp is not null;
CREATE INDEX
Time: 1618.788 ms

As you can see the timings are completely different, in the first case the database has to change data in all pre-existing rows, which isn't really needed since the value will be dummy (it will default to the transaction time).
The second way will also require an AccessExclusiveLock but it wont have to change any data ignoring the old rows so the lock should be almost instant (assuming no blockers exist),
next step you alter the table adding the default value and from now on the new column will have the default value. If you don't want to have nulls, or you want to add a not null constraint to the column, you can update the old (null) rows in a more progressive and non intrusive way later on.

Thanks for reading
- Vasilis Ventirozos

Tuesday, 6 December 2016

Custom Postgres installation with Docker


Lately, i wanted to make a new test / lab environment that would be dispensable, fast to deploy and easily customisable. VM's are all nice for lab use and they served me well over the years, but they tend to be slow(er) to deploy, heavy on resources and lets face it,



So i've decided to play around with Docker. I've played with Docker in the past but i haven't done anything too serious, i just took the official postgres Dockerfile and run it to see how it works. This is how i started my whole Docker exploration, i took the official postgres docker file that can be found here and made an image. Immediately i saw that it wasn't exactly what i wanted, i wanted something less generic. The things that i would like are :

  • latest postgres version, compiled from source
  • easily customisable pgdata and prefix
  • custom postgresql.conf settings (mostly having it replication ready)
  • custom shell environment
  • some extensions
  • data page checksums
  • easy replica deployment
So i started creating my own Dockerfile that would fill my needs and at the same time i would learn some more about Docker, 2 birds one stone kind of thing. After several hours and some testing i came up with something that was working (link at the bottom). It still needs some work to be done but i plan maintaining it for my own use so feel free to use it if you like what i've done there.

Let me explain a bit how it works.
With :
ENV PGBINDIR=/home/postgres/pgsql
ENV PGDATADIR=/home/postgres/pgdata
I can customise the installation and the pgdata directories, I install all packages i need, note that comments on joe will be deleted! Then i add postgres user to sudoers (this dockerfile is not meant to be secure, ease of use is what i am looking for when it comes to a lab postgres instance).
I -always- get the latest postgres version sources, compile and install (including extensions , documentation etc) set some environmental variables, configure postgres (postgresql.conf , pg_hba.conf) , put a sample recovery.done there so slaves can get it from pg_basebackup , install some extensions in template1, initdb with checksums,  create a replication user and finally copy a very simple replica creation script to the docker image.

How to make it work
With Docker installed its pretty simple :

to create dockers own playground network
docker network create --subnet=10.0.0.0/16 garden   
to build the image, from the directory you have the Dockerfile downloaded
docker build -t pg_dev .  
to run a container with a specific ip in our kindergarten docker network:
docker run --name master --net garden --ip 10.0.0.2 -i -t pg_dev /bin/bash
docker run --name slave1 --net garden --ip 10.0.0.3 -i -t pg_dev /bin/bash

and docker ps should look like this :

CONTAINER ID        IMAGE             COMMAND           CREATED              STATUS                PORTS              NAMES
7a38d4b11769            pg_dev              "/bin/bash"         About a minute ago   Up About a minute   5432/tcp            slave1
2c0266f942ea             pg_dev              "/bin/bash"         About a minute ago   Up About a minute   5432/tcp            master

Keep in mind that i have started playing around with Docker just 2 days ago. There might be a better way to assign static ips that I'm not aware of (yet).

Now, say that you have put some data in master and you wanna create a replica on slave1, something like this should work :

From any slave container home dir :
postgres@7a38d4b11769:~$ ./mk_replica.sh
waiting for server to shut down.... done
server stopped
29044/29044 kB (100%), 1/1 tablespace
server starting
LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".

-[ RECORD 1 ]----+--------
lag_bytes               | 0
pid                         | 37
application_name  | a_slave


The Dockerfile can be found here and hopefully it can help some people on setting up a lab fast and easy.


Thanks for reading.
- Vasilis Ventirozos

Thursday, 19 May 2016

Repairing clog corruptions

Yesterday, i got a page from a client about a possible corruption issue to one of his clients, so i decided to share how i dealt with it. Before starting with how i approached the problem, i want to say that corruptions don't just happen in postgres, in OmniTI, we manage A LOT of databases, all with different major versions and in different operating systems and in my time, I haven't seen (many) cases that corruption happened because of a postgres bug (i've seen indexes getting corrupted but i don't remember ever seeing data being corrupted). What i want to say is that corruptions don't just happen, hardware failures, power outages, disks getting full are common reasons for corruptions. 
A replica and backups should always be in place and the server should be properly monitored. Unfortunately this server was not managed by us so none of the above was in place..

At first I saw in the logs entries like :


From the logs:
2016-05-18 15:50:06 EDT::@:[10191]: ERROR:  could not access status of transaction 128604706
2016-05-18 15:50:06 EDT::@:[10191]: DETAIL:  Could not open file "pg_clog/007A": No such file or directory.
2016-05-18 15:50:06 EDT::@:[10191]: CONTEXT:  automatic vacuum of table "pe.pg_toast.pg_toast_4409334"
...
much more to be expected...
...

At this point you know that something went wrong, ideally you want to copy your whole $PGDATA to somewhere else and work there while at the same time you start considering uncompressing your most recent backup. In my case,  there was no backup and not enough disk space to copy $PGDATA.
FUN TIMES :)

I started by fixing all clogs missing which i found  logs
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0114 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/00D1 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0106 bs=256k count=1
...
...
keep creating until logs are clean, they can be a lot, in my case they were more than 100....
...
...


From the logs i also found the tables :

pg_toast.pg_toast_18454
pg_toast.pg_toast_35597
pg_toast.pg_toast_35607
pg_toast.pg_toast_4409334
pg_toast.pg_toast_4409344
pg_toast.pg_toast_8817516


db=# select relname,relnamespace from pg_class where oid in (18454,35597,35607,4409334,4409344,8817516) order by relnamespace;
        relname       | relnamespace
------------------------+--------------
table_case1 |        16872
table_case1 |        16874
table_case2 |        16874
table_case2 |      4409063
table_case1 |      4409063
table_case2 |      8817221
(6 rows)

db=# select oid,nspname from pg_namespace where oid in (16872,16874,16874,4409063,8817221) order by oid;
   oid   |  nspname
---------+------------
   16872 | schema1
   16874 | schema2
 4409063 | schema3
 8817221 | schema4
(4 rows)

With a join i found schema.table:
schema1.table_case1
schema2.table_case1
schema2.table_case2
schema3.table_case2
schema3.table_case1
schema4.table_case2

Basically we have an application with multiple schemas and 2 kinds of tables were corrupted across 4 schemas.

For table_case1 (simple case, table not referenced by anyone):
for each schema :

set search_path to schema1;
create table badrows (badid int);

DO $f$
declare
    curid INT := 0;
    vcontent TEXT;
    badid INT;
begin
FOR badid IN SELECT id FROM table_case1 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
    begin
        SELECT id
        INTO vcontent
        FROM table_case1 where id = badid;
    exception
        when others then
            insert into badrows values (badid);
raise notice 'data for id % is corrupt', badid;
            continue;
    end;
end loop;
end;
$f$;

(This script was taken from Josh Berkus blog, and it was modified to fill my needs.)

create table table_case1_clean as select * from table_case1
where id not in (select badid from badrows);

TRUNCATE table_case1;
vacuum full verbose table_case1;
insert into table_case1 select * from table_case1_clean;
vacuum full analyze verbose table_case1;
drop table badrows;


For table_case2 (this one is being referenced by 2 other tables) 
F or each org (schema):

set search_path to schema2;

create table badrows (badid int);

DO $f$
declare
    curid INT := 0;
    vcontent TEXT;
    badid INT;
begin
FOR badid IN SELECT id FROM table_case2 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
    begin
        SELECT id
        INTO vcontent
        FROM table_case2 where id = badid;
    exception
        when others then
            insert into badrows values (badid);
            raise notice 'data for id % is corrupt', badid;
            continue;
    end;
end loop;
end;
$f$;


create table table_case2_clean as select * from table_case2
where id not in (select badid from badrows);

alter table reference_table1 drop constraint reference_table1_fkey;
delete from reference_table1 where table_case2_id in (select badid from badrows) ;

alter table reference_table2 drop constraint reference_table2_fkey;
delete from reference_table2 where table_case2_id in (select badid from badrows);

TRUNCATE table_case2;
vacuum full verbose table_case2;
insert into table_case2 select * from table_case2_clean;
vacuum full analyze verbose table_case2;

ALTER TABLE ONLY reference_table1
    ADD CONSTRAINT reference_table1_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id) ON DELETE CASCADE;

ALTER TABLE ONLY reference_table2
    ADD CONSTRAINT reference_table2_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id);

drop table badrows;

(please ignore or report any typos here, i replaced the real table names while i was writing this post so i might messed up with some names).


What we basically did here was to recreate the table without the corrupted rows.
After this, tables should be corruption free with the minimum possible data loss.
To ensure that you are corruption free you should either pg_dump and restore, or vacuum full everythingnormal vacuum will NOT show corruptions.


pe=# vacuum verbose schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
INFO:  index " schema1.table_case1_pkey" now contains 12175 row versions in 36 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.15 sec.
INFO:  " table_case1": found 0 removable, 12175 nonremovable row versions in 258 out of 258 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.17 sec.
INFO:  vacuuming "pg_toast.pg_toast_18454"
INFO:  index "pg_toast_18454_index" now contains 51370 row versions in 143 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_18454": found 0 removable, 51370 nonremovable row versions in 12331 out of 12331 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.05s/0.03u sec elapsed 0.09 sec.
VACUUM
pe=#
pe=#
pe=# vacuum full verbose  schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
ERROR:  missing chunk number 0 for toast value 9270408 in pg_toast_18454



Rows were lost, in my case that was acceptable and maybe your case is not the same as mine, but i hope this will provide some guidance in case you get into a similar situation..


Thanks for reading
- Vasilis