AWS Aurora Postgres, not a great first impression
Recently we had a customer facing some memory issues with an Aurora PostgreSQL. What was happening is that while loading data or creating GiST indexes they got the following error :
At the same time they had a major drop in freeable memory :
My first assumption was that this had something to do with GiST or with ip4r extension / datatypes combined with a GiST index.
Unfortunately all this investigation happened over a slack chat so i couldn't collect all the data i would like, the table was empty and this was affecting production so i suggested a vacuum full verbose that fixed the issue.
It has been bothering me and my colleagues that we never got to find the root cause so I started an aurora instance in AWS and tried to reproduce the problem.
I wrote a shell script that generated a lot of random rows and i've loaded about 4 mil rows in a table that looked like this :
I Did a bunch of tests, like bloating the gist index up to 80% and trying to insert or to use COPY to import data from external files, everything worked fine. Last i tried importing some data (about 300k rows) from my local machine with plain insert :
Aurora went mental, i couldn't cancel the client getting :
Could not send cancel request:
PQcancel() -- connect() failed: Connection refused
and after a while :
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Service was obviously dead.
The logs said :
I tried this test 3 times, two out of three the database restarted with the above message and one time i had to manually reboot the instance because it went completely unresponsive.
Now, at my home i just have 1mbit of upload so couldn't possibly send data fast enough to impact the server, cloudwatch was showing that everything was nice and low, except memory :
So basically, this is my progress so far. I might try to re-test this on a normal RDS instance or to try without the indexes, or i will simply file a case to Amazon :)
Obviously this post is not meant to say that Aurora is bad, but if i was about to migrate to Aurora and take advantage of its features i'd make sure to double and triple check that the application is working as it should. This product is very new, i think it was released at Nov 2017 and some problems should be expected, especially if your schema is a bit out of the ordinary.
Thanks for reading,
2018-01-04 14:03:06 UTC:171.32.45.250(35508):root@test:[20457]:ERROR: out of memory
2018-01-04 14:03:06 UTC:171.32.45.250(35508):root@test:[20457]:DETAIL: Failed on request of size 23.
2018-01-04 14:03:06 UTC:171.32.45.250(35508):root@test:[20457]:STATEMENT: INSERT INTO test_table1 VALUES
('domain1.com','Manual','[198.168.1.148,198.168.1.250]','t','f','2016-02-16 15:58:58.016626','',''), ('domain2.com','Manual','[192.168.1.1,192.168.1.174]','t','f','2016-02-16 15:58:58.023136','',''),
...
At the same time they had a major drop in freeable memory :
logs showed that :
TopMemoryContext: 1184792 total in 13 blocks; 14024 free (5 chunks); 1170768 used CFuncHash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used Type information cache: 24472 total in 2 blocks; 2840 free (0 chunks); 21632 used TopTransactionContext: 8192 total in 1 blocks; 7384 free (1 chunks); 808 used MessageContext: 8511544 total in 11 blocks; 2077632 free (0 chunks); 6433912 used Operator class cache: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used smgr relation context: 0 total in 0 blocks; 0 free (0 chunks); 0 used smgr relation table: 24576 total in 2 blocks; 13008 free (4 chunks); 11568 used TransactionAbortContext: 32768 total in 1 blocks; 32728 free (0 chunks); 40 used Portal hash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used PortalMemory: 8192 total in 1 blocks; 7880 free (0 chunks); 312 used PortalHeapMemory: 1024 total in 1 blocks; 840 free (0 chunks); 184 used ExecutorState: 24576 total in 2 blocks; 3328 free (13 chunks); 21248 used GiST scan context: 49245329112 total in 5881 blocks; 138720 free (3880 chunks); 49245190392 used ExprContext: 8192 total in 1 blocks; 8056 free (0 chunks); 136 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used ExprContext: 8192 total in 1 blocks; 7464 free (0 chunks); 728 used Relcache by OID: 24576 total in 2 blocks; 12960 free (4 chunks); 11616 used CacheMemoryContext: 516096 total in 6 blocks; 74648 free (1 chunks); 441448 used test_idx1: 1024 total in 1 blocks; 8 free (0 chunks); 1016 used test_idx2: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_index_indrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_constraint_conrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used pg_db_role_setting_databaseid_rol_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used pg_user_mapping_user_server_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_user_mapping_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_type_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_type_typname_nsp_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_ts_template_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_ts_template_tmplname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_ts_parser_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_ts_parser_prsname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_ts_dict_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_ts_dict_dictname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_ts_config_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_ts_config_cfgname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_ts_config_map_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_transform_type_lang_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_transform_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_statistic_relid_att_inh_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used pg_range_rngtypid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_proc_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_opfamily_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_opfamily_am_name_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_operator_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_namespace_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_namespace_nspname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_language_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_language_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_foreign_table_relid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_foreign_server_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_foreign_server_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_event_trigger_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_event_trigger_evtname_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_enum_typid_label_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_enum_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_default_acl_role_nsp_obj_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_conversion_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_constraint_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_conversion_name_nsp_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_collation_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_collation_name_enc_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_cast_source_target_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_amop_fam_strat_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_amop_opr_fam_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_aggregate_fnoid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used pg_opclass_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_index_indexrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used pg_class_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_authid_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_auth_members_member_role_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used pg_tablespace_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_shseclabel_object_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used pg_replication_origin_roname_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_database_datname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_replication_origin_roiident_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used pg_auth_members_role_member_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used pg_database_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used pg_authid_rolname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used WAL record construction: 49768 total in 2 blocks; 6584 free (0 chunks); 43184 used GWAL record construction: 1024 total in 1 blocks; 456 free (0 chunks); 568 used PrivateRefCount: 8192 total in 1 blocks; 2840 free (0 chunks); 5352 used Aurora WAL Context: 8192 total in 1 blocks; 8152 free (0 chunks); 40 used Aurora File Context: 8192 total in 1 blocks; 7512 free (0 chunks); 680 used MdSmgr: 0 total in 0 blocks; 0 free (0 chunks); 0 used LOCALLOCK hash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used Timezones: 104120 total in 2 blocks; 2840 free (0 chunks); 101280 used ErrorContext: 8192 total in 1 blocks; 8152 free (4 chunks); 40 used Grand total: 49256003552 bytes in 6012 blocks; 2466216 free (3912 chunks); 49253537336 used 2018-01-04 14:03:06 UTC:172.31.45.250(35508):root@test:[20457]:ERROR: out of memory 2018-01-04 14:03:06 UTC:172.31.45.250(35508):root@test:[20457]:DETAIL: Failed on request of size 23. 2018-01-04 14:03:06 UTC:172.31.45.250(35508):root@test:[20457]:STATEMENT: INSERT INTO test_table1 VALUES ('domain1.com','Manual','[198.168.1.148,198.168.1.250]','t','f','2016-02-16 15:58:58.016626','',''), ('domain2.com','Manual','[192.168.1.1,192.168.1.174]','t','f','2016-02-16 15:58:58.023136','',''), ...
My first assumption was that this had something to do with GiST or with ip4r extension / datatypes combined with a GiST index.
Unfortunately all this investigation happened over a slack chat so i couldn't collect all the data i would like, the table was empty and this was affecting production so i suggested a vacuum full verbose that fixed the issue.
It has been bothering me and my colleagues that we never got to find the root cause so I started an aurora instance in AWS and tried to reproduce the problem.
I wrote a shell script that generated a lot of random rows and i've loaded about 4 mil rows in a table that looked like this :
CREATE TYPE inet_range AS RANGE ( subtype = inet ); create table gist_test ( testtext1 citext not null, testtext2 citext not null, ip_range inet_range not null, testbool1 boolean not null default false, testbool2 boolean not null default false, created_at timestamp without time zone not null default now(), testemail1 citext, testemail2 citext); create index idx_test1 on gist_test (testtext1); create index idx_test2 on gist_test using GIST(ip_range);
I Did a bunch of tests, like bloating the gist index up to 80% and trying to insert or to use COPY to import data from external files, everything worked fine. Last i tried importing some data (about 300k rows) from my local machine with plain insert :
insert into gist_test (testtext1,testtext2,ip_range,testbool1,testbool2,testemail1,testemail2) values ('HyBrlWugbsGH4fusJ9czVneB+58IZaoG8Vx3NV0jgyQ','0SaIKhTlZyD2zlUFpSK0b2Y9gep6hzxO0q2crgm3iRM','[184.224.199.243,184.224.199.255]','t','f','',''), ... ... ('AZgoJrP+QUvhe3TkmaxqoX8ZKv3BQ4a2muvPGuJIqq8','rIWnCq4q7qwLWlPmbYqnk3voOQdA0cPB4AerbeV4lpA','[6.116.212.51,6.116.212.255]','t','f','','');
Aurora went mental, i couldn't cancel the client getting :
Could not send cancel request:
PQcancel() -- connect() failed: Connection refused
and after a while :
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Service was obviously dead.
The logs said :
2018-01-10 14:53:09 UTC::@:[6393]:FATAL: Storage initialization failed. result = 1436, errno = 0 2018-01-10 14:53:09 UTC::@:[6391]:LOG: Aurora Runtime process (PID 6393) exited with exit code 1 2018-01-10 14:53:09 UTC::@:[6391]:LOG: terminating any other active server processes 2018-01-10 14:53:09 UTC::@:[6391]:FATAL: Can't handle storage runtime process crash 2018-01-10 14:53:09 UTC::@:[6391]:LOG: database system is shut down 2018-01-10 14:53:09 UTC::@:[6751]:LOG: Waiting for runtime initialization complete...
I tried this test 3 times, two out of three the database restarted with the above message and one time i had to manually reboot the instance because it went completely unresponsive.
Now, at my home i just have 1mbit of upload so couldn't possibly send data fast enough to impact the server, cloudwatch was showing that everything was nice and low, except memory :
So basically, this is my progress so far. I might try to re-test this on a normal RDS instance or to try without the indexes, or i will simply file a case to Amazon :)
Obviously this post is not meant to say that Aurora is bad, but if i was about to migrate to Aurora and take advantage of its features i'd make sure to double and triple check that the application is working as it should. This product is very new, i think it was released at Nov 2017 and some problems should be expected, especially if your schema is a bit out of the ordinary.
Thanks for reading,
Vasilis Ventirozos
OmniTI Computer Consulting
That looks like memory leak. Does it happen only on Amazon? Could you try to reproduce it on plain PostgreSQL instance?
ReplyDeleteit only happens on amazon , i tried doing this on a normal postgres and it didn't sweat , which makes sense, such memory leaks in postgres are extinct for a very long time now.
DeleteHi Vasilis, I have faced the same problem after migrating to Aurora (coming from RDS).
ReplyDeleteI issued a bugreport with Amazon how solved it but in a private patch for us. I will be publicly available with Aurora 1.1 which is to be released sometime in January.
In the meantime you should file a bugreport and get the same private patch..
Wow , thank you very much for your reply, I will definitely do that :)
DeleteMy employer is also evaluating Aurora and we noticed that the default memory limits are much more aggressive than on classic RDS. We were told that the cause of this was that Aurora does not rely on or even use OS cache. Which means that instance RAM that is not provisioned to shared_buffers and not currently used for work_mem by queries is simply wasted. So default Aurora memory limits often caused less than 5% of freeable memory. So it's quite easy to crash it with some memory using queries.
ReplyDelete(Full disclosure: I am the Product Manager for Amazon Aurora with PostgreSQL compatibility). We set the default shared_buffers to 75% of total memory because, as you state, there is no file system cache with Aurora Storage. If you need to support large index creates, or queries that do sorts on large tables, etc, then you should consider increasing the work_mem parameter. However, you may need to concurrently decrease shared_buffers to avoid running out of memory. We are working on recommendations / best practices for setting Amazon Aurora PostgreSQL memory-related parameters based on your workload, and will publish those soon.
DeleteThanks for your reply Kevin, increasing work mem sounds reasonable but changing shared_memory would be tricky since it requires a restart. I've seen workloads where create index on temp tables is part of the application lifecycle, so i was wondering if you could share any insight on how performance with a low(er) shared buffers setting would be compared to the default.
DeleteDepending on the workload, performance with lower shared_buffers might be affected, or it might not. If the working set of data no longer fits into memory when you reduce shared_buffers, then performance will be affected; if the working set didn't fit into memory before reducing shared_buffers, as would be typical with an analytics-oriented workload, then it probably won't have much of an impact.
DeleteThis comment has been removed by the author.
ReplyDeleteI am one of the Product Managers for Amazon Aurora with PostgreSQL compatibility. Your problem description is similar to a GIST-related problem we have fixed in an upcoming patch. Please contact aurora-postgresql-support@amazon.com directly and I will be happy to help work through your issue
ReplyDelete