Summary

After deployment of the train for 1.34-wmf.8 HTTP 5xx reqs/min started to rise. This affected the “morelike” endpoints (responsible for the “related article”/"read more' section in mobile apps and mobile web).

Impact

Readers of the mobile web wikipedias and users of the mobile apps (android/ios) stopped seeing the “related article”/”read more” section when viewing pages (it is unclear if it caused more visible problems). TODO: For large-scale outages, estimate: How many queries were lost? How many users affected? etc.

Detection

By icinga (HTTP 5xx reqs/min)

Timeline

(all times on UTC)

Conclusions

Careful warmup is necessary when activating the RelatedArticles feature. The system should have detected that this feature was broken, dcausse detected the problem by looking at the search dashboards but it was already too late.

The following sub-sections should have a couple brief bullet points each.

What went well?

  • for example: automated monitoring detected the incident, outage was root-caused quickly, etc
  • The PoolCounter prevented the elasticsearch cluster in eqiad from falling down allowing to still serve normal search (only affecting latencies)

What went poorly?

  • Confusion because CirrusSearch extension registration was activated in this version causing logspam.
  • Warming up the caches was too slow (50 minutes)
  • One elasticsearch cluster struggling, could have used elastic in codfw to speedup the warmup.

Where did we get lucky?

  • dcausse remembered about task T224879. The outage being caused by a bugfix (feature reactivated) this could have caused much confusion if not known.

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

  • Status:    TODO T225225: Create a browser test for this feature to detect problems on this feature earlier.
  • Status:    TODO Monitor & alert when morelike api usage goes down (TODO create task)
  • Status:    TODO T185473 (nice to have): Create a dedicated API endpoint for morelike queries