Incidents/20110926-s7switchover

What

Database master (S7) switchover caused outage at 18:23 for about 9 minutes.

Cause

To prepare for MediaWiki 1.18 release, Asher has been implementing those needed schema changes to the slave database the art week. The final step was to make one of the slaves (of each cluster) to be the new master. This outage happened during the switch over of the S7 cluster. Usually this goes automatically without incidents, but cluster s7 also contains the CentralAuth database, used by all wikis. Before changing masters, Asher set cluster s7 to read-only in the MediaWiki db configuration. As soon as he changed the s7 master to a new database server at 18:23, the new master got bombarded with MediaWiki "slave lag" queries, i.e. MediaWiki instances trying to determine the replication lag of the slaves compared to the master, to select which db slave to use.

Impact

Normally, this check is disabled when MediaWiki is in database read-only mode, but as was discovered later, this does not apply to CentralAuth queries - possibly because these are also issued by wikis outside the s7 cluster. Therefore, the new master was immediately overloaded with these queries, and interrupted normal queries and replication traffic, bringing down the sites.

Mitigation

Asher investigated the situation, and killed all outstanding queries on the new master, which fixed the problem at 18:32. He also filed a bugzilla ticket #31170 to improve MediaWiki's behavior during db read-only situations. Additional measures to avoid this problem in the future could be to split off CentralAuth into its dedicated cluster on dedicated hardware (we could use some old DB servers for that.