Since this was PostgreSQL related i wanted to post about the technical aspect of it.
According to the company :
“Mandrill uses a sharded Postgres setup as one of our main datastores,”
the email explains.
“On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes.”
The email continues:
“The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.”
So, lets see what that "transaction id wraparound issue" is and how someone could prevent similar outages from ever happening.
PostgreSQL uses MVCC to control transaction visibility, basically by comparing transaction IDs (XIDs). A row with an insert XID greater than the current transaction XID shouldn't be visible to the current transaction. But since transaction IDs are not unlimited a cluster will eventually run out after
(2^32 transactions 4+ billion) causing transaction ID wraparound: transaction counter wraps around to zero, and all past transaction would appear to be in the future
This is being taken care of by vacuum that will mark rows as frozen, indicating that they were inserted by a transaction that committed far in the past that can be visible to all current and future transactions. To control this behavior, postgres has a configurable called autovacuum_freeze_max_age, which defaults at 200.000.000 transactions, a very conservative default that must be tuned in larger production systems.
It sounds complicated but its relatively easy not to get to that point,for most people just having autovacuum on will prevent this situation from ever happening. You can simply schedule manual vacuums by getting a list of the tables "closer" to autovacuum_freeze_max_age with a simple query like this:
SELECT 'vacuum analyze ' || c.oid::regclass ||' /* manual_vacuum */ ;' FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind = 'r' ORDER BY greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc LIMIT 10;
Even if you want to avoid manual vacuums, you can simply create a report or a monitoring metric based on age of relfrozenxid of pg_class combined with pg_settings, eg :
SELECT oid::regclass::text AS table,pg_size_pretty(pg_total_relation_size(oid)) AS table_size,age(relfrozenxid) AS xid_age, (SELECT setting::int FROM pg_settings WHERE name = 'autovacuum_freeze_max_age') - age(relfrozenxid) AS tx_for_wraparound_vacuum FROM pg_class WHERE relfrozenxid != 0 AND oid > 16384 ORDER BY tx_for_wraparound_vacuum ;
But, lets assume that you got to the point that you started seeing "autovacuum: VACUUM tablename (to prevent wraparound)"
This vacuum will more likely happen when you don't want it and even if you kill it, it will keep respawning.
If you already set autovacuum_freeze_max_age to a more viable production setting, we usually set it at 1.5bil, you can change autovacuum_freeze_max_age to a higher value, say 3 billion and immediately kick of a manual vacuum on the table with a high maintenance_work_mem value. This vacuum will be a "normal vacuum" hence live. If the table is so big that each manual vacuum take days to complete, then you should've partitioned it...Proper schema and proper tuning of autovacuum are really important, especially in write heavy workloads.
And to get back to the outage, the mail from the company insinuates that it was a problem with postgres, reality is that it wasn't, it was clearly an OPS oversight.
Thanks for reading