Incident documentation/20170201-Phabricator

From Wikitech
Jump to: navigation, search

This is a template for an Incident Report. Replace notes with your own description.

Summary

Phabricator service was interrupted at 09:39:04 UTC on 2017-02-01. The master (db1043) and one of the slave (db1048) database servers crashed, however, db2012 didn't crash.


Timeline

Time Actor Activity
09:39 OOM Killer on db1043 Killed process 3272 (mysqld) total-vm:75835656kB, anon-rss:64323888kB, file-rss:0kB
09:39 Nemo_bis #wikimedia-tech: 1) #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement 2) A Troublesome Encounter! Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL).
09:40 Reedy #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement
09:40 MariaDB (log) Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file', Internal MariaDB error code: 1236
09:41 icinga-wm PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1
09:41 icinga-wm PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1
moments later marostegui Restored the service
10:05 marostegui Filed phab:T156905
... .... Then, later in the day:
18:58 icinga-wm PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1
18:58 icinga-wm PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1
18:59 icinga-wm PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:08 jynus scheduling 10 minutes of emergency downtime on phabricator
19:10 ostriches phabricator: now in read-only mode
19:11 jynus remaining 7 minute with phabricator up, but read-only
19:20 jynus reloading haproxy on dbproxy1003
19:25 twentyafterfour running puppet on iridium to activate the config change (this switched phabricator to elasticsearch back-end)
19:27 twentyafterfour disabled read-only in phabricator
19:59 twentyafterfour Freshening phabricator's elasticsearch index, currently 50% complete
database load graphs
db1043 graphs showing activity leading up to the crash

Conclusions

The Phabricator fulltext search backend combined with our current mariadb version / configuration can generate pathological queries which do not complete in a reasonable amount of time. This is compounded by user behavior: users will usually retry failed searches multiple times, generating more load on the database server, which eventually crumbles under the load of many long-running queries.

We have been aware of shortcomings with the way Phabricator uses MySQL for full-text search for many months: see the main status updates on 2015-05-15 (switch to MySQL backend), 2016-12-18 (incomplete index). For this reason, Wikimedia release engineering has been working on porting our Phabricator search index over to ElasticSearch (T146843). Given various recent incidents related to Phabricator databases, we expected this particular failure to repeat itself if we simply restored service. It was thus decided to immediately switch over to using ElasticSearch back-end for Phabricator search, in order to avoid another mysql failure.

After the immediate emergency was over, further investigation (with help from upstream developers, thanks epriestley!) revealed that optimizing the table that contains the fulltext index could help the situation, however, the performance after optimization was still not great. More improvement is needed and we will probably stick with ElasticSearch for now barring any problems that might cause.

Actionables

  • Switch to elasticsearch back-end for Phabricator's Full Text search engine. (already done)