Tuning checkpoints

- October 15, 2019

We recently had the chance to help a customer with some IO related issues that ended up being unconfigured checkpoints. Something that may not always be obvious but can actually be somewhat common.

Let's start with how things roughly work.
Postgres smallest IO unit is a disk block that is 8kb (by default). Each time postgres needs a new block it will fetch it from the disks and load it to an area in RAM called shared_buffers.
When postgres needs to write, it does it in the same manner:

Fetches the block(s) from the disks if the block is not in shared_buffers
Changes the page in shared buffers.
Marks the page as changed (dirty) in shared buffers.
It writes the change in a "sequential ledger of changes" called WAL to ensure durability.

This basically means that the writes are not yet "on disk". This operation is taken care of by a postgres process called checkpointer. Checkpoints are how postgres guarantees that data files and index files will be updated with all the changes that happened before that checkpoint. In case of a crash, postgres will go back to the latest checkpoint record and it will start a REDO operation from WAL. Checkpoints are triggered every checkpoint_timeout (default at 5min) or when changes reach max_wal_size (default at 1GB). This is an IO intensive operation and postgres tries to spread this IO
with checkpoint_completion_target (default at 0.5).

checkpoint_timeout* is the maximum time between checkpoints in seconds.
min_wal_size minimum size of wals that will be recycled rather than removed
max_wal_size** maximum size allowed for wals between checkpoints
checkpoint_completion_target allows data changes to spread over a longer period of time, making the final fsync() much cheaper.
* Affects recovery time, change only after reviewing the documentation
** This is a softmax, it can exceed this value in special cases.

Best way to start is to set checkpoint_timeout value to something reasonable and set max_wal_size high enough so you won't reach the timeout. To make sense of what is reasonable, you can do the following: schedule (cronjob will do) something like this to run in short periods of time, say every minute:

psql -XqtA -c "copy (select now()::timestamp(0), pg_current_wal_insert_lsn()) to stdout with csv;" monkey >> current_lcn

Leave it running for as long as you see fit, from the result you can extract the difference of 2 locations in bytes like this:

monkey=# SELECT pg_size_pretty(pg_wal_lsn_diff('0/B1277248','0/59CEA2F8'));
pg_size_pretty
----------------
1398 MB
(1 row)

(protip: file_fdw + window function)

This function will calculate the difference of later location - earlier location in bytes, so having the location per minute can help calculate (or graph) the rate of changes over time.
With that number, your storage capabilities and your recovery needs in mind you should be able to come up with a good starting point.
max_wal_size should be set high enough so it won't be reached before the timeout. The rate of changes figure we calculated earlier should be a good indication of where to start. min_wal_size has to follow common sense and leave a small portion of wal files to be recycled for the next checkpoint.
checkpoint_completion_target often is just set to 0.9 but if you want to be more "precise" it follows the following rule:
(checkpoint_timeout - 2min) / checkpoint_timeout

To watch how checkpointer is working after the changes, you can query pg_stat_bgwriter. This is a very interesting view that probably deserves it's own blogpost - because of the data you can extract from it - but today we are going to concentrate on 2 columns, checkpoint_timed and checkpoint_req.

checkpoint_timed counts checkpoints that happened triggered by checkpoint_timeout and checkpoint_req is the count of the checkpoint that happened because max_wal_size was hit.
What you want to see there is the majority of the checkpoints to be timed.

And because graphs are cool, you can see an example of a production system before and after checkpoint configuration. The graph is write IOPS over time on an AWS RDS database.
In RDS specifically, i find it weird that checkpoint defaults are so low considering the high shared_buffers default it has.