Parallel pg_dump backups

As I told in a previous post, i'm planning to post about the features that 9.3 brought to the game, today i will explore parallel pg_dump, how it works, how can we benefit from it and how does it compare to classic pg_dump.

First of all, pg_dump supports parallel dump only if -Fd (directory) is used, this is because directory dump  is the only format that supports multiple processes to write data at the same time. Directory format will write one file per relation and a toc.dat file, it is similar to -Fc with the output being compressed and supports parallel and selective restore.
The pg_dump switch that implements parallel dump is -j <njobs>,
when used, pg_dump will create n connections +1 (the master process).

The test:
for this test i have created 10 tables with 20million rows each. as a baseline i will use -Fc and then i will use -Fd -j increasing the number of jobs by 1 each time. the disk that was used was a simple 7.2k rpm sata3 disk connected over usb3.

> time pg_dump -Fc -f out.sql -d dump
real    4m58.831s
---------
now with -Fd -j
> time pg_dump -Fd -j 1 -f out.sql1 -d dump
real    1m10.280s
> time pg_dump -Fd -j 2 -f out.sql2 -d dump
real    0m43.596s
> time pg_dump -Fd -j 3 -f out.sql3 -d dump
real    0m38.801s
> time pg_dump -Fd -j 4 -f out.sql4 -d dump
real    0m33.708s
> time pg_dump -Fd -j 5 -f out.sql5 -d dump
real    0m31.040s
> time pg_dump -Fd -j 6 -f out.sql6 -d dump
real    0m30.501s
> time pg_dump -Fd -j 60 -f out.sql60 -d dump
real    0m30.522s

Obviously the last run was just to show that you cannot use any degree of parallelism, on an I7 2600 (4 core + 4 multi threaded). After 6 concurrent jobs it had no difference. Still, we started with 1:10 and we end up with less than the half time.

Now, this feature is by nature not usable for everyone, in my test i had 10 equal tables with 20m rows each , if i had 1 large table and several small ones then it wouldn't scale the same. Also, I was talking with a friend about this feature and he said: Who does pg_dump now days anyway? Everyone is doing live backups
- That's partially true, but pg_dump always will have its use.
I think this feature is great, maybe not the greatest for terabytes of enterprise databases but its good for those who use pg_dump for backups. Test systems, tight on diskspace installations, databases with little network bandwidth.



Thanks for reading
- Vasilis

Comments

Post a Comment

Popular posts from this blog

Accessing PostgreSQL data from AWS Lambda

Tuning checkpoints

AWS Aurora Postgres, not a great first impression