Incidents/20190606-CirrusSearch-MoreLike

Summary

After deployment of the train for 1.34-wmf.8 HTTP 5xx reqs/min started to rise. This affected the “morelike” endpoints (responsible for the “related article”/"read more' section in mobile apps and mobile web).

Impact

Readers of the mobile web wikipedias and users of the mobile apps (android/ios) stopped seeing the “related article”/”read more” section when viewing pages (it is unclear if it caused more visible problems). TODO: For large-scale outages, estimate: How many queries were lost? How many users affected? etc.

Detection

By icinga (HTTP 5xx reqs/min)

Timeline

(all times on UTC)

May 30: after wmf.7 train deploy: we stop receiving the usual amount of morelike queries (dropped from 160M-190M requests/day to 3M http://discovery.wmflabs.org/metrics/#morelike_search)
June 3: task T224879 is filed and a bug in the RelatedArticles extension is fixed, the backport of the bugfix is not deployed
May 30 - June 6: RelatedArticles are not shown from the mobile web causing the various cache layers to be emptied
June 6 after wmf.8 is deployed the RelatedArticles extension is working again on english wikipedia causing morelike api requests from mobile web to flow again to CirrusSearch & elasticsearch. These services are not able to handle the load causing 5xx.
13:48 (incident starts) PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
13:54: dcausse ships https://gerrit.wikimedia.org/r/#/c/513556/ as he believes pool counter failure are due to the recent activation of extension registration on CirrusSearch which changes how globals are created (affecting the pool counter key used by cirrus for english wikipedia)
14:03 dcausse remembers about T224879 but finds no better alternative than just wait for the caches to warmup.
14:29 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
14:49: cirrus errors are gone

Conclusions

Careful warmup is necessary when activating the RelatedArticles feature. The system should have detected that this feature was broken, dcausse detected the problem by looking at the search dashboards but it was already too late.

The following sub-sections should have a couple brief bullet points each.

What went well?

for example: automated monitoring detected the incident, outage was root-caused quickly, etc
The PoolCounter prevented the elasticsearch cluster in eqiad from falling down allowing to still serve normal search (only affecting latencies)

What went poorly?

Confusion because CirrusSearch extension registration was activated in this version causing logspam.
Warming up the caches was too slow (50 minutes)
One elasticsearch cluster struggling, could have used elastic in codfw to speedup the warmup.

Where did we get lucky?

dcausse remembered about task T224879. The outage being caused by a bugfix (feature reactivated) this could have caused much confusion if not known.

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Status: TODO T225225: Create a browser test for this feature to detect problems on this feature earlier.
Status: TODO Monitor & alert when morelike api usage goes down (TODO create task)
Status: TODO T185473 (nice to have): Create a dedicated API endpoint for morelike queries