AWS Aurora Postgres, not a great first impression

Recently we had a customer facing some memory issues with an Aurora PostgreSQL. What was happening is that while loading data or creating GiST indexes they got the following error :


2018-01-04 14:03:06 UTC:171.32.45.250(35508):root@test:[20457]:ERROR:  out of memory
2018-01-04 14:03:06 UTC:171.32.45.250(35508):root@test:[20457]:DETAIL:  Failed on request of size 23.
2018-01-04 14:03:06 UTC:171.32.45.250(35508):root@test:[20457]:STATEMENT:  INSERT INTO test_table1 VALUES
('domain1.com','Manual','[198.168.1.148,198.168.1.250]','t','f','2016-02-16 15:58:58.016626','',''),   ('domain2.com','Manual','[192.168.1.1,192.168.1.174]','t','f','2016-02-16 15:58:58.023136','',''),
...

At the same time they had a major drop in freeable memory :



logs showed that :


TopMemoryContext: 1184792 total in 13 blocks; 14024 free (5 chunks); 1170768 used
  CFuncHash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  Type information cache: 24472 total in 2 blocks; 2840 free (0 chunks); 21632 used
  TopTransactionContext: 8192 total in 1 blocks; 7384 free (1 chunks); 808 used
  MessageContext: 8511544 total in 11 blocks; 2077632 free (0 chunks); 6433912 used
  Operator class cache: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  smgr relation context: 0 total in 0 blocks; 0 free (0 chunks); 0 used
    smgr relation table: 24576 total in 2 blocks; 13008 free (4 chunks); 11568 used
  TransactionAbortContext: 32768 total in 1 blocks; 32728 free (0 chunks); 40 used
  Portal hash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  PortalMemory: 8192 total in 1 blocks; 7880 free (0 chunks); 312 used
    PortalHeapMemory: 1024 total in 1 blocks; 840 free (0 chunks); 184 used
      ExecutorState: 24576 total in 2 blocks; 3328 free (13 chunks); 21248 used
        GiST scan context: 49245329112 total in 5881 blocks; 138720 free (3880 chunks); 49245190392 used
        ExprContext: 8192 total in 1 blocks; 8056 free (0 chunks); 136 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 8192 total in 1 blocks; 7464 free (0 chunks); 728 used
  Relcache by OID: 24576 total in 2 blocks; 12960 free (4 chunks); 11616 used
  CacheMemoryContext: 516096 total in 6 blocks; 74648 free (1 chunks); 441448 used
    test_idx1: 1024 total in 1 blocks; 8 free (0 chunks); 1016 used
    test_idx2: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_index_indrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_constraint_conrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_db_role_setting_databaseid_rol_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_user_mapping_user_server_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_user_mapping_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_type_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_type_typname_nsp_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_template_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_ts_template_tmplname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_parser_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_ts_parser_prsname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_dict_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_ts_dict_dictname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_config_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_ts_config_cfgname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_config_map_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_transform_type_lang_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_transform_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_statistic_relid_att_inh_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_class_relname_nsp_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_range_rngtypid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_proc_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_opfamily_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_opfamily_am_name_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_operator_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_namespace_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_namespace_nspname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_language_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_language_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_foreign_table_relid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_foreign_server_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_foreign_server_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_event_trigger_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_event_trigger_evtname_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_enum_typid_label_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_enum_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_default_acl_role_nsp_obj_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_conversion_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_constraint_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_conversion_name_nsp_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_collation_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_collation_name_enc_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_cast_source_target_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_amop_fam_strat_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_amop_opr_fam_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_aggregate_fnoid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_opclass_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_index_indexrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_class_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_authid_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_auth_members_member_role_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_tablespace_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_shseclabel_object_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_replication_origin_roname_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_database_datname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_replication_origin_roiident_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_auth_members_role_member_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_database_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_authid_rolname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
  WAL record construction: 49768 total in 2 blocks; 6584 free (0 chunks); 43184 used
    GWAL record construction: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
  PrivateRefCount: 8192 total in 1 blocks; 2840 free (0 chunks); 5352 used
  Aurora WAL Context: 8192 total in 1 blocks; 8152 free (0 chunks); 40 used
  Aurora File Context: 8192 total in 1 blocks; 7512 free (0 chunks); 680 used
  MdSmgr: 0 total in 0 blocks; 0 free (0 chunks); 0 used
  LOCALLOCK hash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  Timezones: 104120 total in 2 blocks; 2840 free (0 chunks); 101280 used
  ErrorContext: 8192 total in 1 blocks; 8152 free (4 chunks); 40 used
Grand total: 49256003552 bytes in 6012 blocks; 2466216 free (3912 chunks); 49253537336 used
2018-01-04 14:03:06 UTC:172.31.45.250(35508):root@test:[20457]:ERROR:  out of memory
2018-01-04 14:03:06 UTC:172.31.45.250(35508):root@test:[20457]:DETAIL:  Failed on request of size 23.
2018-01-04 14:03:06 UTC:172.31.45.250(35508):root@test:[20457]:STATEMENT:  INSERT INTO test_table1 VALUES
    ('domain1.com','Manual','[198.168.1.148,198.168.1.250]','t','f','2016-02-16 15:58:58.016626','',''),
    ('domain2.com','Manual','[192.168.1.1,192.168.1.174]','t','f','2016-02-16 15:58:58.023136','',''),
...

My first assumption was that this had something to do with GiST or with ip4r extension / datatypes combined with a GiST index.
Unfortunately all this investigation happened over a slack chat so i couldn't collect all the data i would like,  the table was empty and this was affecting production so i suggested a vacuum full verbose that fixed the issue.
It has been bothering me and my colleagues that we never got to find the root cause so I started an aurora instance in AWS and tried to reproduce the problem.

I wrote a shell script that generated a lot of random rows and i've loaded about 4 mil rows in a table that looked like this :

CREATE TYPE inet_range AS RANGE (
  subtype = inet
);

create table gist_test (
testtext1 citext not null,
testtext2 citext not null,
ip_range inet_range not null,
testbool1 boolean not null default false,
testbool2 boolean  not null default false,
created_at timestamp without time zone not null default now(),
testemail1 citext,
testemail2 citext);

create index idx_test1 on gist_test (testtext1);
create index idx_test2 on gist_test using GIST(ip_range);


I Did a bunch of tests, like bloating the gist index up to 80% and trying to insert or to use COPY to import data from external files, everything worked fine. Last i tried importing some data  (about 300k rows) from my local machine with plain insert :


insert into gist_test (testtext1,testtext2,ip_range,testbool1,testbool2,testemail1,testemail2) values
('HyBrlWugbsGH4fusJ9czVneB+58IZaoG8Vx3NV0jgyQ','0SaIKhTlZyD2zlUFpSK0b2Y9gep6hzxO0q2crgm3iRM','[184.224.199.243,184.224.199.255]','t','f','',''),
...
...
('AZgoJrP+QUvhe3TkmaxqoX8ZKv3BQ4a2muvPGuJIqq8','rIWnCq4q7qwLWlPmbYqnk3voOQdA0cPB4AerbeV4lpA','[6.116.212.51,6.116.212.255]','t','f','','');

Aurora went mental, i couldn't cancel the client getting :
Could not send cancel request: 
PQcancel() -- connect() failed: Connection refused
and after a while :
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Service was obviously dead.
The logs said :

2018-01-10 14:53:09 UTC::@:[6393]:FATAL: Storage initialization failed. result = 1436, errno = 0
2018-01-10 14:53:09 UTC::@:[6391]:LOG: Aurora Runtime process (PID 6393) exited with exit code 1
2018-01-10 14:53:09 UTC::@:[6391]:LOG: terminating any other active server processes
2018-01-10 14:53:09 UTC::@:[6391]:FATAL: Can't handle storage runtime process crash
2018-01-10 14:53:09 UTC::@:[6391]:LOG: database system is shut down
2018-01-10 14:53:09 UTC::@:[6751]:LOG: Waiting for runtime initialization complete...


I tried this test 3 times, two out of three the database restarted with the above message and one time i had to manually reboot the instance because it went completely unresponsive.
Now, at my home i just have 1mbit of upload so couldn't possibly send data fast enough to impact the server, cloudwatch was showing that everything was nice and low, except memory :




So basically, this is my progress so far. I might try to re-test this on a normal RDS instance or to try without the indexes, or i will simply file a case to Amazon :)
Obviously this post is not meant to say that Aurora is bad, but if i was about to migrate to Aurora and take advantage of its features i'd make sure to double and triple check that the application is working as it should. This product is very new, i think it was released at Nov 2017 and some problems should be expected, especially if your schema is a bit out of the ordinary.


Thanks for reading,

Vasilis Ventirozos
OmniTI Computer Consulting

Comments

  1. That looks like memory leak. Does it happen only on Amazon? Could you try to reproduce it on plain PostgreSQL instance?

    ReplyDelete
    Replies
    1. it only happens on amazon , i tried doing this on a normal postgres and it didn't sweat , which makes sense, such memory leaks in postgres are extinct for a very long time now.

      Delete
  2. Hi Vasilis, I have faced the same problem after migrating to Aurora (coming from RDS).

    I issued a bugreport with Amazon how solved it but in a private patch for us. I will be publicly available with Aurora 1.1 which is to be released sometime in January.

    In the meantime you should file a bugreport and get the same private patch..

    ReplyDelete
    Replies
    1. Wow , thank you very much for your reply, I will definitely do that :)

      Delete
  3. My employer is also evaluating Aurora and we noticed that the default memory limits are much more aggressive than on classic RDS. We were told that the cause of this was that Aurora does not rely on or even use OS cache. Which means that instance RAM that is not provisioned to shared_buffers and not currently used for work_mem by queries is simply wasted. So default Aurora memory limits often caused less than 5% of freeable memory. So it's quite easy to crash it with some memory using queries.

    ReplyDelete
    Replies
    1. (Full disclosure: I am the Product Manager for Amazon Aurora with PostgreSQL compatibility). We set the default shared_buffers to 75% of total memory because, as you state, there is no file system cache with Aurora Storage. If you need to support large index creates, or queries that do sorts on large tables, etc, then you should consider increasing the work_mem parameter. However, you may need to concurrently decrease shared_buffers to avoid running out of memory. We are working on recommendations / best practices for setting Amazon Aurora PostgreSQL memory-related parameters based on your workload, and will publish those soon.

      Delete
    2. Thanks for your reply Kevin, increasing work mem sounds reasonable but changing shared_memory would be tricky since it requires a restart. I've seen workloads where create index on temp tables is part of the application lifecycle, so i was wondering if you could share any insight on how performance with a low(er) shared buffers setting would be compared to the default.

      Delete
    3. Depending on the workload, performance with lower shared_buffers might be affected, or it might not. If the working set of data no longer fits into memory when you reduce shared_buffers, then performance will be affected; if the working set didn't fit into memory before reducing shared_buffers, as would be typical with an analytics-oriented workload, then it probably won't have much of an impact.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. I am one of the Product Managers for Amazon Aurora with PostgreSQL compatibility. Your problem description is similar to a GIST-related problem we have fixed in an upcoming patch. Please contact aurora-postgresql-support@amazon.com directly and I will be happy to help work through your issue

    ReplyDelete

Post a Comment

Popular posts from this blog

Tuning checkpoints

Setting up Postgres Automated Failover