PostgreSQL 12 : Where in copy from

5 days ago a commit made it to 12devel that implements WHERE clause in COPY FROM.
Today we're gonna see how it works and see how someone could achieve the same by using file_fdw.

To begin with, lets create a table and put some data in it.

create table test (
id int,
date timestamp without time zone,
);

insert into test (id,date) select generate_series (1,1000000)            ,'2015-01-01';
insert into test (id,date) select generate_series (1000001,2000000),'2016-01-01';
insert into test (id,date) select generate_series (2000001,3000000),'2017-01-01';
insert into test (id,date) select generate_series (3000001,4000000),'2018-01-01';
insert into test (id,date) select generate_series (4000001,5000000),'2019-01-01';

now, lets make this a bit larger than 170MB, dump the data in csv and truncate the table :

monkey=# insert into test select * from test;
INSERT 0 5000000
monkey=# insert into test select * from test;
INSERT 0 10000000
monkey=# select count(*) from test ;
  count
----------
 20000000
(1 row)

monkey=# copy test to '/home/vasilis/test.csv' with csv;
COPY 20000000

monkey=# truncate test;
TRUNCATE TABLE

vasilis@Wrath > ls -lh ~/test.csv
-rw-r--r-- 1 vasilis vasilis 759M Jan 25 12:24 /home/vasilis/test.csv

Our test file is about 750Mb, now with an empty table , lets import only the rows that are up to 2 years old :

monkey=# copy test from '/home/vasilis/test.csv' with csv where date >= now() - INTERVAL '2 year' ;
COPY 8000000
Time: 17770.267 ms (00:17.770)

It worked , awesome !

Now, lets compare to the alternative, file_fdw :

monkey=# CREATE EXTENSION file_fdw;
CREATE EXTENSION
Time: 7.288 ms

monkey=# CREATE SERVER pgtest FOREIGN DATA WRAPPER file_fdw;
CREATE SERVER
Time: 3.957 ms

monkey=# create foreign table file_test (id int, date timestamp without time zone, name text) server pgtest options (filename '/home/vasilis/test.csv', format 'csv');
CREATE FOREIGN TABLE
Time: 16.463 ms

monkey=# truncate test ;
TRUNCATE TABLE
Time: 15.123 ms

monkey=# insert into test select * from file_test where date >= now() - INTERVAL '2 year' ;
INSERT 0 8000000
Time: 21368.595 ms (00:21.369)


I automated the test, and copy was (obviously) always faster. I'd imagine that the key benefit here is not performance, which is inherited by COPY itself but extending the functionality of COPY, which is good, especially for workloads that export data , ETL and load to another DB, or for restoring specific rows from backups (assuming pg_dump backups is still a thing).




Thanks for reading
Vasilis Ventirozos
credativ LLC

Comments

  1. Nice Updates
    Yaaron Studios is one of the rapidly growing editing studios in Hyderabad. We are the best Video Editing services in Hyderabad. We provides best graphic works like logo reveals, corporate presentation Etc. And also we gives the best Outdoor/Indoor shoots and Ad Making services.

    Best video editing services in Hyderabad,ameerpet
    Best Graphic Designing services in Hyderabad,ameerpet­
    Best Ad Making services in Hyderabad,ameerpet­

    ReplyDelete

Post a Comment

Popular posts from this blog

Accessing PostgreSQL data from AWS Lambda

Tuning checkpoints

AWS Aurora Postgres, not a great first impression