Backing up PostgreSQL in HDFS
There
are number ways to backup a PostgreSQL database, some are standard
and some just demonstrate the power of open source and the things you
can do if you put in use your creativity and imagination. At OmniTI,
we use OmniPITR
tool to manage WAL files and running backup on secondary databases
instead of primary to reduce load during backup. In this post, I will
discuss OmniPITR and Hadoop to accomplish something very neat,
storing your backups into HDFS (Hadoop Distributed File System).
You
might be asking Why? HDFS is rock solid reliable, it has extremely
low cost per byte and it can get 2Gbit per computer, scalable up to
more than a TB per second. it is proven from internet giants for
running a big variety of different use-cases.
Let's
say that you have a 1TB database running, an uncompressed backup will
need 1TB of reliable storage just to keep one copy. HFDS has the
great advantage of using cheap hardware and being fault tolerant at
the same time.
Imagine
adding cheap SATA disks to the company workstations and keep the
database backups there. Now, imagine explaining that to the next SAN
salesman who will come to sell you storage. - FUN TIMES !
Let
me give some details about tools used:
Hadoop:
It's
an open-source software for reliable, scalable, distributed
computing.
The
project includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
OmniPITR
:
OmniPITR,
written and maintained by OmniTI
,
is a set of scripts to ease setting up WAL replication, and making
hot backups from both Master and Slave systems.
This
set of scripts has been written to make the operation seamless,
secure and as light on resources-usage as possible.
Now,
You must be wondering, why am I talking about it now? Depesz
has recently added a new feature called dst-pipes (pipe to program),
this feature is limited only by the users imagination, Depesz as an
example of this feature did gpg backups in one operation! i am going
to use the same feature to do something entirely different.
Setup
:
Software
Versions: PostgreSQL 9.3 (beta2) , Hadoop 1.2.1
For
the setup , I have used a VM for Hadoop and PostgreSQL, the reason
for me using just one VM is that actually there are no limitations in
the topology. Hadoop and Postgres are 2 entirely different ecosystems
so you have no limitations, the PostgreSQL might be running on a
cascading multi slave environment, While Hadoop is running as I said,
on desktops. Other than Postgres and Haoop, you have to git clone
OmniPITR
. Depesz has put effort on making this self sufficient so no extra
perl modules will be needed. The only thing that I have added to a
minimal (net install) Debian installation was rsync.
Configuration
:
In
postgresql.conf, enable archiving and on the archive_command put
something like :
archive_command
= '/usr/local/omnipitr/bin/omnipitr-archive -dl /mnt/archives/xlog -s
/usr/local/omnipitr/state -db /mnt/backup/xlog -l
/usr/local/omnipitr/log/omnipitr-^Y-^m-^d.log -v "%p"'
Now,
you may have a different archive_command, i really suggest using
OmniPITR, it really works well and it has been tested in very large
installation. you can do cool stuff like sending the archives to a
remote location gzip them, make combinations of local gzipped
copies and remote etc. You could read about OmniPITR features in
documentation on github.
Let's
make a script:
#!/usr/bin/env
bash
hadoop
dfs -put - /user/hduser/$1
This
script will allow Hadoop to get a file from stdin and put it into a
HDFS namespace
Bring
up postgreSQL with above archive_command and Hadoop running on the
node.
Let's
try to backup straight into Hadoop:
/usr/local/omnipitr/omnipitr-backup-master
-D $PGDATA -l dl-backup.log -dp ~/test/hadoop_bck.sh -x
/mnt/backup/xlog -v
-dp
is the key switch here, it uses the new feature of OmniPITR :
'dst-pipes' it will run the script and will output the backup as
input to the script we just made.
To
verify the backup is actually there :
$hadoop
dfs -ls /user/hduser
Found
2 items
-rw-r--r--
1 vasilis supergroup 115322880 2013-08-22 23:18
/user/hduser/dblab1-data-2013-08-22.tar
-rw-r--r--
1 vasilis supergroup 33566720 2013-08-22 23:18
/user/hduser/dblab1-xlog-2013-08-22.tar
You
might want to refer to Hadoop documentation for how to manipulate
these files, organize them etc. but the backup is there :
vasilis@dblab1:~/test$
hadoop dfs -copyToLocal /user/hduser/dblab1-data-2013-08-22.tar
vasilis@dblab1:~/test$
ls -l ~/dblab1-data-2013-08-22.tar
-rw-r--r--
1 vasilis vasilis 115322880 Aug 23 18:06
/home/vasilis/dblab1-data-2013-08-22.tar
To
restore the cluster, you would need something like :
hadoop
dfs -cat /user/hduser/dblab1-data-2013-08-22.tar |tar xv
For
our example, we had no need for compression, gzip'ing the output is
supported by OmniPITR and you may also do that on the script that
pushes the tar to Hdfs
This
is just the tip of the iceberg, just a proof of concept , the tools
are there and everything is stable.
Use
your imagination , save your company some money and be the hero !!!
Thanks
for reading !!
Comments
Post a Comment