Incidents/2017-01-26 API Slowdown
(Redirected from Incident documentation/2017-01-26 API Slowdown)
- Very WIP- still under heavy research
On the 2017-01-26, from 17:51 to 18:15 (all times UTC) there was a slowdown/increase in 500 responses on Wikimedia wikis' Mediawiki Action API. While there was scheduled maintenance at the time, no user impact should have been seen, the underlying cause is still being researched.
Summary
- A core router started rebooting/behaving strangely since 12 January task T155875
- DB and other services impact was mitigated by moving essential services away from the affected rack task T155875 (e.g. s1 master)
- Maintenance started on router at 17:51- mediawiki should have just depooled affected services (dbs), and continue unaffected, as usual, but it didn't work/didn't work as expected
- API latency/thoughput impact can be seen at:
- https://grafana.wikimedia.org/dashboard/db/api-summary?from=1485431717947&to=1485475144648
- https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=saveTiming&from=1485431717947&to=1485475144648
- https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?from=1485431717947&to=1485475144648
- Depooling affected API servers resolved the issue
Timeline
- 17:46 paravoid: stopping pybal on lvs1001/lvs1002/lvs1003
- 17:51 paravoid: replacing asw-c2-eqiad
- 17:57 elukey: boostrapping aqs1007-a cassandra instance
- 18:14 paravoid: rebooting newly provisioned asw-c2-eqiad to enable mixed mode
- 18:15 jynus@tin: Synchronized wmf-config/db-eqiad.php: Depool db1055, 56, 57, 59 (duration: 00m 54s)
- 18:32 paravoid: starting pybal on lvs1001/lvs1002/lvs1003
Conclusions
More research is needed to understand why the issue happened and how mediawiki model works, and if it has a bug for this particular scenario.