Friday, 25 January 2019

PostgreSQL 12 : Where in copy from

5 days ago a commit made it to 12devel that implements WHERE clause in COPY FROM.
Today we're gonna see how it works and see how someone could achieve the same by using file_fdw.

To begin with, lets create a table and put some data in it.

create table test (
id int,
date timestamp without time zone,
);

insert into test (id,date) select generate_series (1,1000000)            ,'2015-01-01';
insert into test (id,date) select generate_series (1000001,2000000),'2016-01-01';
insert into test (id,date) select generate_series (2000001,3000000),'2017-01-01';
insert into test (id,date) select generate_series (3000001,4000000),'2018-01-01';
insert into test (id,date) select generate_series (4000001,5000000),'2019-01-01';

now, lets make this a bit larger than 170MB, dump the data in csv and truncate the table :

monkey=# insert into test select * from test;
INSERT 0 5000000
monkey=# insert into test select * from test;
INSERT 0 10000000
monkey=# select count(*) from test ;
  count
----------
 20000000
(1 row)

monkey=# copy test to '/home/vasilis/test.csv' with csv;
COPY 20000000

monkey=# truncate test;
TRUNCATE TABLE

vasilis@Wrath > ls -lh ~/test.csv
-rw-r--r-- 1 vasilis vasilis 759M Jan 25 12:24 /home/vasilis/test.csv

Our test file is about 750Mb, now with an empty table , lets import only the rows that are up to 2 years old :

monkey=# copy test from '/home/vasilis/test.csv' with csv where date >= now() - INTERVAL '2 year' ;
COPY 8000000
Time: 17770.267 ms (00:17.770)

It worked , awesome !

Now, lets compare to the alternative, file_fdw :

monkey=# CREATE EXTENSION file_fdw;
CREATE EXTENSION
Time: 7.288 ms

monkey=# CREATE SERVER pgtest FOREIGN DATA WRAPPER file_fdw;
CREATE SERVER
Time: 3.957 ms

monkey=# create foreign table file_test (id int, date timestamp without time zone, name text) server pgtest options (filename '/home/vasilis/test.csv', format 'csv');
CREATE FOREIGN TABLE
Time: 16.463 ms

monkey=# truncate test ;
TRUNCATE TABLE
Time: 15.123 ms

monkey=# insert into test select * from file_test where date >= now() - INTERVAL '2 year' ;
INSERT 0 8000000
Time: 21368.595 ms (00:21.369)


I automated the test, and copy was (obviously) always faster. I'd imagine that the key benefit here is not performance, which is inherited by COPY itself but extending the functionality of COPY, which is good, especially for workloads that export data , ETL and load to another DB, or for restoring specific rows from backups (assuming pg_dump backups is still a thing).




Thanks for reading
Vasilis Ventirozos
credativ LLC

No comments:

Post a Comment

Managing xid wraparound without looking like a (mail) chimp

My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE . Since this was P...