Posts

Showing posts from 2013

pgreplay-ing logfiles

Recently I wanted to find something to rerun a workload (from logfiles) in postgres. the reasons ? benchmark , testing , troubleshooting. My colleague Denish reminded me a project called pgreplay . pgreplay reads a PostgreSQL logfile , extracts the SQL statements and executes them in the same order and with (or without) the original timing against a PG database. I download it, compiled it and went straight to the testing. The setup : log_min_messages = error  (or more)    (if you know that you have no cancel requests, 'log' will do) log_min_error_statement = log  (or more) log_connections = on log_disconnections = on log_line_prefix = '%m|%u|%d|%c|'  (if you don't use CSV logging) log_statement = 'all' lc_messages must be set to English (the encoding does not matter) bytea_output = escape  (from version 9.0 on, only if you want to replay                   ...

A Problem Geeked Into Big Data

Today my article got published in OmniTI Seeds . it was a very interesting project, i had a lot of fun while doing it, and even if its not exactly Postgres related, I wouldn't be able to do it without Postgres. It begun from the basic everyday operation that all dba's are doing (log parsing / analysis) but it got to a whole different level. I hope you like it. A Problem Geeked Into Big Data

cooking with fdw

Back in 9.1 a feature called foreign data wrappers was introduced, you could basically connect a PostgreSQL Server to remote data stores (other dbs, csv files etc). Today i plan to play with this FDW and see what and how we can use it. First of all there are many FDW's, a list can be found here . For the sake of this post i used the following data wrappers :      Name     | Version |   Schema   |                    Description                     --------------+---------+------------+----------------------------------------------------  file_fdw     | 1.0     | public     | foreign-data wrapper for flat file access  postgres_fdw | 1.0     | ...

scaling postgres with pgpool-II

Recently a new release of pgpool-II was released, 3.2.7 (released 2013-12-06). For connection pooling I mostly have experience with pgbouncer and in OmniTI we use it a lot. I heard that pgpool has some issues and because of its increased complexity i don't know if i would recommend it as a connection pool but pgpool does much more than that. What is pgpool-II (from pg-pool wiki) : pgpool-II is a middleware that works between PostgreSQL servers and a PostgreSQL database client. It is licensed under BSD license. It provides the following features. Connection Pooling pgpool-II saves connections to the PostgreSQL servers, and reuse them whenever a new connection with the same properties (i.e. username, database, protocol version) comes in. It reduces connection overhead, and improves system's overall throughput. Replication pgpool-II can manage multiple PostgreSQL servers. Using the replication function enables creating a realtime backup on 2 or more physic...

Parallel pg_dump backups

As I told in a previous post, i'm planning to post about the features that 9.3 brought to the game, today i will explore parallel pg_dump, how it works, how can we benefit from it and how does it compare to classic pg_dump. First of all, pg_dump supports parallel dump only if -Fd (directory) is used, this is because directory dump  is the only format that supports multiple processes to write data at the same time. Directory format will write one file per relation and a toc.dat file, it is similar to -Fc with the output being compressed and supports parallel and selective restore. The pg_dump switch that implements parallel dump is -j <njobs>, when used, pg_dump will create n connections +1 (the master process). The test: for this test i have created 10 tables with 20million rows each. as a baseline i will use -Fc and then i will use -Fd -j increasing the number of jobs by 1 each time. the disk that was used was a simple 7.2k rpm sata3 disk connected over usb3. >...

The cost of Reliability

At 9th of September PostgreSQL 9.3 was released with many exciting new features. This post is the first out of many (?) that will explore these new features and will try to evaluate them. I will start with data checksums, a feature that makes possible for PostgreSQL to checksum data pages and report corruptions. I always liked how PostgreSQL doesn't cut corners when it comes to reliability, even when it (and it usually does) comes with a cost in performance. but how does this work and how much is this cost ? Thats why I'm writing this post :) Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. I used 2 db clusters one with the feature enabled and one without it, same config file and 3 basic operations 1 insert 2 updates, the results : No checksums: monkey_no_sum=# insert into test_of_sum (id) select generate_series (1,10000000); INSERT 0 10000000 Time: 26226.654 ms monkey_no_sum=# update test_of_sum se...

Backing up PostgreSQL in HDFS

There are number ways to backup a PostgreSQL database, some are standard and some just demonstrate the power of open source and the things you can do if you put in use your creativity and imagination. At OmniTI , we use OmniPITR tool to manage WAL files and running backup on secondary databases instead of primary to reduce load during backup. In this post, I will discuss OmniPITR and Hadoop to accomplish something very neat, storing your backups into HDFS (Hadoop Distributed File System).  You might be asking Why? HDFS is rock solid reliable, it has extremely low cost per byte and it can get 2Gbit per computer, scalable up to more than a TB per second. it is proven from internet giants for running a big variety of different use-cases. Let's say that you have a 1TB database running, an uncompressed backup will need 1TB of reliable storage just to keep one copy. HFDS has the great advantage of using cheap hardware and being fault tolerant at the same time...

Setting shared_buffers the hard way

One of the main performance parameters in PostgreSQL is shared_buffers, probably the most important one, there are guidelines and rules of thumb that say just set it to 20-30% of your machine total memory. Don't get me wrong , these rules are generaly good and its a perfect starting point, but there are reasons why you should tune your shared buffers in more detail a. 30% might not be enough and you will never know if you dont know exactly how to set this parameter b. 30% might be a lot and you spend resources in vain. c. you want to be awesome and tune every bit of your DB the optimal way. What is shared buffers though ? Shared buffers defines a block of memory that PostgreSQL will use to hold requests that are awaiting attention from the kernel buffer and CPU. so basically PostgreSQL will put temporarily data blocks in the memory in order to process - EVERYTHING will go through the shared buffers. Why not set shared_buffers to 80% of ram in a DB dedicated se...

PostgreSQL backup and recovery

One of the main tasks of any kind of administrator is to make sure that the data that he's responsible for will be available if anything bad happens (asteroids, flood, locusts, hail) in order to do that you will need a high availability solution to obtain continuity and a good backup plan to support your system in case of human error ( -Hello Mr. admin, i just dropped the customer payment table) trust me, it happens ... a lot... I briefly covered high availability with the implementation of a hot standby in an earlier post and now its time to cover backup options available in PostgreSQL. There are 2 kinds of backups in PostgreSQL, physical and logical. The good thing with logical backups is that they are simple to implement and maintain, selective backup and restore (even in later PG versions) is easy. Usually the backup output consists in one file that of course can be compressed. the major con of this method is that lacks Point In Time Recovery. ( A database wi...

Migrating Oracle to PostgreSQL

I have an oracle dump from a database that i want to migrate to postgres for a project. I setup a 100gb VM, I setup oracle 11gr2 and i compiled postgres 9.2.3 on the same VM. i restored the dump to oracle and now i had to make a migration strategy. I had some ideas like getting the schema from the dump with expdp (sqlfile=) and manually translate it to postgres but i remembered a project that i saw some time ago called ora2pg. This cool thing is basically a perl program that needs except of perl (duh)  DBI and DBD for oracle and postgres. i set it up and i started experimenting first on the DDL's. the output was actually very good ! the only things that i had to change was some function based indexes and it was parse-able from postgres without problems, next was data. No problems there , it created a 20gb output file that all i had to do was to throw it to postgres. Just because i wasn't very proactive with my VM's disk space i faced some problems... 20Gb fr...

PostgreSQL Partitioning

Partitions are very usable when it comes to big tables, documentation suggests applying table partitioning when a table is bigger than 10Gb. In postgres there are 2 kinds of partitions  Range List implementation will just need the following steps. Enable constraint exclusion in config file Create a master table Create child tables WITHOUT overlapping table constraints Create indexes , pk's Create function and trigger to insert data to child tables  First thing to notice here is that partitioning is basically using table inheritance. but enough with how things are working in theory, lets create one. F irst , check into postgresql.conf for th is parameter: constraint_exclusion = partition   Now , lets create the master and child tables : CREATE TABLE sales (     sales_id serial NOT NULL,     sales_date DATE NOT NULL DEFAULT CURRENT_DATE,     description text ); CREATE TABLE sales_2013_p1 ( CHECK ( sale...

PostgreSQL, Binary replication in practice

 A couple of days ago I started making a short howto about streaming replication in PostgreSQL 9.2. Most of these things are well documented but in this howto i will also try to experiment with switch overs and switchbacks. It aims to show how easy it is to set it up right out of the box. Streaming replication PostgreSQL 9.2 For my example i will use 2 debian VM's 192.168.0.100 (pglab1) and 192.168.0.101 (pglab2) - not mandatory - exchange ssh keys for passwordless ssh ,might be used if we need to scp scripts , rsync or whatever. if you don't know how to do it, follow these steps : http://www.debian-administration.org/articles/152 - ON MASTER (PGLAB1)  - after you create a postgres cluster using initdb, edit master's postgresql.conf and change the following : listen_addresses = '*' wal_level = hot_standby #(could be archive too) max_wal_senders = 5 hot_standby = on create a replication user : create user repuser replication password 'passw...

PostgreSQL proper installation

Recently at work, i got assigned to upgrade some (about 10) postgres 8.1.x to 9, I always liked compiling basically because I like the flexibility that compile offers, and thats what i proposed to the guys that are in charge of the project. They gave me a test system (vm) to play with, in all fairness they were a bit skeptical with the idea of compiling the rdbms. Mostly for reliability issues (don't ask me why). I explained that upgrading from source would be much easier later on and that the last year PostgreSQL developers are doing so much work that it cannot be ignored (pg_basebackup for example).Since the latest centos package was 9.1.x they agreed and i started working. PostgreSQL is so easy to compile, no strange dependencies not many libraries needed. and about these reliability concerns ? Because developers are so neat, and i quote from http://www.postgresql.org/support/versioning "While upgrades always have some risk, PostgreSQL minor releases fix ...