Incidents/2017-01-26 API Slowdown

Very WIP- still under heavy research

On the 2017-01-26, from 17:51 to 18:15 (all times UTC) there was a slowdown/increase in 500 responses on Wikimedia wikis' Mediawiki Action API. While there was scheduled maintenance at the time, no user impact should have been seen, the underlying cause is still being researched.

Summary

A core router started rebooting/behaving strangely since 12 January task T155875
DB and other services impact was mitigated by moving essential services away from the affected rack task T155875 (e.g. s1 master)
Maintenance started on router at 17:51- mediawiki should have just depooled affected services (dbs), and continue unaffected, as usual, but it didn't work/didn't work as expected
API latency/thoughput impact can be seen at:
Depooling affected API servers resolved the issue

Timeline

17:46 paravoid: stopping pybal on lvs1001/lvs1002/lvs1003
17:51 paravoid: replacing asw-c2-eqiad
17:57 elukey: boostrapping aqs1007-a cassandra instance
18:14 paravoid: rebooting newly provisioned asw-c2-eqiad to enable mixed mode
18:15 jynus@tin: Synchronized wmf-config/db-eqiad.php: Depool db1055, 56, 57, 59 (duration: 00m 54s)
18:32 paravoid: starting pybal on lvs1001/lvs1002/lvs1003

Conclusions

More research is needed to understand why the issue happened and how mediawiki model works, and if it has a bug for this particular scenario.

Actionables

task T156475