Incidents/2017-02-01 Phabricator
This is a template for an Incident Report. Replace notes with your own description.
Summary
Phabricator service was interrupted at 09:39:04 UTC on 2017-02-01. The master (db1043) and one of the slave (db1048) database servers crashed, however, db2012 didn't crash.
Timeline
Time | Actor | Activity |
09:39 | OOM Killer on db1043 | Killed process 3272 (mysqld) total-vm:75835656kB, anon-rss:64323888kB, file-rss:0kB |
09:39 | Nemo_bis | #wikimedia-tech: 1) #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement 2) A Troublesome Encounter! Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). |
09:40 | Reedy | #1290: The MariaDB server is running with the --read-only option so it cannot execute this statement |
09:40 | MariaDB (log) | Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file', Internal MariaDB error code: 1236 |
09:41 | icinga-wm | PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1 |
09:41 | icinga-wm | PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 |
moments later | marostegui | Restored the service |
10:05 | marostegui | Filed phab:T156905 |
... | .... | Then, later in the day: |
18:58 | icinga-wm | PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 2 down 1 |
18:58 | icinga-wm | PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1 |
18:59 | icinga-wm | PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds |
19:08 | jynus | scheduling 10 minutes of emergency downtime on phabricator |
19:10 | ostriches | phabricator: now in read-only mode |
19:11 | jynus | remaining 7 minute with phabricator up, but read-only |
19:20 | jynus | reloading haproxy on dbproxy1003 |
19:25 | twentyafterfour | running puppet on iridium to activate the config change (this switched phabricator to elasticsearch back-end) |
19:27 | twentyafterfour | disabled read-only in phabricator |
19:59 | twentyafterfour | Freshening phabricator's elasticsearch index, currently 50% complete |
Conclusions
The Phabricator fulltext search backend combined with our current mariadb version / configuration can generate pathological queries which do not complete in a reasonable amount of time. This is compounded by user behavior: users will usually retry failed searches multiple times, generating more load on the database server, which eventually crumbles under the load of many long-running queries.
We have been aware of shortcomings with the way Phabricator uses MySQL for full-text search for many months: see the main status updates on 2015-05-15 (switch to MySQL backend), 2016-12-18 (incomplete index). For this reason, Wikimedia release engineering has been working on porting our Phabricator search index over to ElasticSearch (T146843). Given various recent incidents related to Phabricator databases, we expected this particular failure to repeat itself if we simply restored service. It was thus decided to immediately switch over to using ElasticSearch back-end for Phabricator search, in order to avoid another mysql failure.
After the immediate emergency was over, further investigation (with help from upstream developers, thanks epriestley!) revealed that optimizing the table that contains the fulltext index could help the situation, however, the performance after optimization was still not great. More improvement is needed and we will probably stick with ElasticSearch for now barring any problems that might cause.
Actionables
- Switch to elasticsearch back-end for Phabricator's Full Text search engine. (already done)