Managing xid wraparound without looking like a (mail) chimp

My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE.
Since this was PostgreSQL related i wanted to post about the technical aspect of it.
According to the company :

“Mandrill uses a sharded Postgres setup as one of our main datastores,”
the email explains.
“On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes.” 
The email continues:
The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.”

So, lets see what that "transaction id wraparound issue" is and how someone could prevent similar outages from ever happening.

PostgreSQL uses MVCC to control transaction visibility, basically by comparing transaction IDs (XIDs). A row with an insert XID greater than the current transaction  XID shouldn't be visible to the current transaction. But since transaction IDs are not unlimited a cluster will eventually run out after
(2^32 transactions 4+ billion) causing transaction ID wraparound: transaction counter wraps around to zero, and all past transaction would appear to be in the future

This is being taken care of by vacuum that will mark rows as frozen, indicating that they were inserted by a transaction that committed far in the past that can be visible to all current and future transactions. To control this behavior, postgres has a configurable called autovacuum_freeze_max_age, which defaults at 200.000.000 transactions, a very conservative default that must be tuned in larger production systems.

It sounds complicated but its relatively easy not to get to that point,for most people just having autovacuum on will prevent this situation from ever happening. You can simply schedule manual vacuums by getting a list of the tables "closer" to autovacuum_freeze_max_age with a simple query like this:


 SELECT 'vacuum analyze ' || c.oid::regclass ||' /* manual_vacuum */ ;'  
 FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid  
 WHERE c.relkind = 'r'   
 ORDER BY greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc  
 LIMIT 10;  

Even if you want to avoid manual vacuums, you can simply create a report or a monitoring metric based on age of relfrozenxid of pg_class combined with pg_settings, eg :


 SELECT oid::regclass::text AS table,pg_size_pretty(pg_total_relation_size(oid)) AS table_size,age(relfrozenxid) AS xid_age,   
 (SELECT setting::int FROM pg_settings WHERE   
  name = 'autovacuum_freeze_max_age') -   
  age(relfrozenxid) AS tx_for_wraparound_vacuum   
 FROM pg_class    
 WHERE relfrozenxid != 0   
 AND oid > 16384   
 ORDER BY tx_for_wraparound_vacuum ;   

But, lets assume that you got to the point that you started seeing "autovacuum: VACUUM tablename (to prevent wraparound)"

This vacuum will more likely happen when you don't want it and even if you kill it, it will keep respawning.
If you already set autovacuum_freeze_max_age to a more viable production setting, we usually set it at 1.5bil, you can change autovacuum_freeze_max_age to a higher value, say 3 billion and immediately kick of a manual vacuum on the table with a high maintenance_work_mem value. This vacuum will be a "normal vacuum" hence live. If the table is so big that each manual vacuum take days to complete, then you should've partitioned it...Proper schema and proper tuning of autovacuum are really important, especially in write heavy workloads.

And to get back to the outage, the mail from the company insinuates that it was a problem with postgres, reality is that it wasn't, it was clearly an OPS oversight.


Thanks for reading
Vasilis Ventirozos
credativ LLC






Comments

  1. > "autovacuum: VACUUM tablename (to prevent wraparound)"
    > This vacuum will more likely happen when you don't want it and even if you kill it, it will keep respawning.

    This doesn't fully describe what probably happened. Although the wraparound vacuum can cause some disruption due to the high I/O loads it generates, it won't cause the "halting" that Mailchimp stated.

    If this wraparound vacuum isn't able to get its job done, then postgreSQL *will* actually shut down -- for example when autovacuum is disabled entirely or autovacuum_freeze_max_age is set so high that once it kicks in, it no longer has enough time to freeze the rows. From the documentation:

    > If these warnings are ignored, the system will shut down and refuse to start any new transactions once there are fewer than 1 million transactions left until wraparound:
    >
    > ERROR: database is not accepting commands to avoid wraparound data loss in database "mydb"
    > HINT: Stop the postmaster and vacuum that database in single-user mode.

    However, reaching this point requires some serious neglect.

    I think the advice in this blog post to raise autovacuum_freeze_max_age is only for people who know PostgreSQL in depth and perform monitoring for wraparound. Otherwise, keep it at the default, even for production databases, or you risk the same fate as Mailchimp.

    ReplyDelete
    Replies
    1. I agree with most of what you are saying, especially that what mailchimp shared is not the full story (we all know that the database doesn't come down because of a vacuum), but if someone is running a production system then he should know these things. Afterall, i never suggested putting 1.5bil in autovacuum_freeze_max_age without monitoring :)

      Delete
  2. The recommendation to increase autovacuum_freeze_max_age to 1.5B feels ill-placed unless it is clear that the reader really (really) knows what they are doing.

    Besides, someone would have to (1) be on a long vacation and (2) not have alarms for some metrics that are mentioned in every Postgres FAQ, for the database to shutdown in the above scenario.

    ReplyDelete
    Replies
    1. Well , if someone does the rest , monitoring , alarming , manual vacuums , no matter how long he'll be on vacation he won't have this specific issue. Also if you read my post again you'll see that i said "We usually set it at 1.5bil" which comes in pair with setting monitoring / alarming , never suggested people to go start changing settings without knowing wtf they doing

      Delete

Post a Comment

Popular posts from this blog

Tuning checkpoints

AWS Aurora Postgres, not a great first impression

Setting up Postgres Automated Failover