Incidents/20190327-elasticsearch

Work in Progress!!!

Summary

Multiple issues were encountered during the upgrade of elasticsearch / cirrus from 5.6.4 to 6.5.4. All of those issues are reported here.

New feature in ES6 put indices in read only when disk space is low, this lead to update lag higher than usual
Mechanism which is used to freeze updates failed in unexpected way, leading to read-only period longer than expected and some update lag

For context, the ES6 upgrade process (simplified) is as follow (The detailed process is documented in its cookbook):

for each group of 3 servers:
- freeze writes
- depool
- upgrade elasticsearch and plugins
- pool
- unfreeze writes
- wait for cluster to recover

Freezing and thawing writes to an elasticsearch cluster is done by creating a freeze-everything document in the mw_cirrus_metastore index.

Impact

Wikidata saw an update lag higher than usual from around 2019-03-26 18:00 to around 2019-03-28 08:00. We don't measure the actual lag, but looking at the consumer lag shape, the maximum lag was probably < 18h.

Note that the update of the search cluster is asynchronous by design. For most wikis, updates are mainly incremental and high update lag is not perceived by the users. In normal operation, the lag is a few minutes. Update lag of multiple hours is expected, especially during maintenance operations. Wikidata has workflows that make this lag more apparent than other wikis.

Detection

frozen updates were detected by icinga check
read only indices were not detected by any automated check but reported by Addshore

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

2019/03/26

12:58: start of ES6 upgrade on codfw
18:05: thawing writes fails with an HTTP 404
18:10: manual checks show that writes are thawed anyway (we now know that this was not actually the case)
21:05: alert raised about frozen writes by icinga PROBLEM - ElasticSearch health check for frozen writes - 9643 on search.svc.codfw.wmnet is CRITICAL
21:07: first Icinga recovery for above page
approx 21:30: several attempts to delete the freeze document, manually deleting the "freeze-everything" document leads to this document reappearing
22:01: final Icinga page for frozen writes critical
22:03: final Icinga recovery notice for frozen writes
22:12: re-creating and deleting the "freeze-everything" document deletes it for good (this was done with a python3 REPL and the same spicerack code as used by the cookbook)

2019/03/27

07:44: a node in eqiad reaches 95% disk usage and triggers cluster.routing.allocation.disk.watermark.flood_stage causing all task TP8289 indices to go read-only.
07:49: elastic1017 disk usage is back to OK but indices stay read-only (indices must be re-activated manually)
08:27: Wikidata users report problem on-wiki
10:50: addshore pings search platform and creates task T219364
11:06: dcausse fixes indices and cluster settings to make index read-write again
20:14: thawing writes fails with an HTTP 404 after the last group of servers is updated
20:17: writes are thawed manually according to the procedure discovered before

Useful graphs

Conclusions

We're missing an alert on update lag.
The frozen write issue is as of yet unexplained. We suspect an issue related to the mixed cluster state (both 5.6.4 and 6.5.4 versions deployed at the same time), but can't verify it. It is unlikely that the same issue would happen on the next version upgrade (though we will probably have a different issue).

What went well?

automated monitoring worked for frozen write
asynchronous updates pipeline is robust, this incident had a fairly low impact (except Wikidata)

What went poorly?

The documentation associated with the frozen indexing alert is out of date -- Search#Pausing Indexing needs an update (DONE).
Unclear why Elasticsearch's replication was not working (possibly because of the cluster's major version skew?). It is unlikely that we'll have a definitive answer to this question.

Where did we get lucky?

thanks to addshore for realizing we had update lag on Wikidata!
No errors served, just possible staleness in the index (which is part luck, part good design)

Links to relevant documentation

Runbook: Pausing Indexing
Runbook: read only indices (to be created)

Actionables

create cookbook to reset frozen write state task T219638
create a cookbook to reset readonly status for all indices task T219799
make spicerack more robust when unfreezing writes task T219640
create check on Cirrus update lag task T219601