Incidents/2018-09-18 train

From Wikitech

Covers deployment of 1.32-wmf.22 (task T191068).

Summary

1.32-wmf.22 went mostly fine, there are a few post train actionables though.

Timeline

  • task T204669 Slow access to Special:Contributions on mediawiki.org
    • Requests to https://www.mediawiki.org/wiki/Special:Contributions were terribly slow. DBA ruled out any issue with the database. Brad Jorsh pointed out at the actor storage refactoring happening in MediaWiki. It has been enabled on Monday for group0 a small set of testing wikis which includes mediawiki.org. A few related issues have been filled and the feature flag has been promptly disabled. Actionables are now child tasks of task T188327 Deploy refactored actor storage
  • Echo and Translate caused some PHP notices, apparently without much user impact. They have been quickly hot fixed by the code owners.
  • task T204757 OAuth had a fatal error due to a misnamed method in a refactoring patch. Gergő Tisza had a patch ready before I have finished completing the task!
  • task T204961, a known issue. Upon deployment, some requests made to ORES timeout.
  • task T204871 Each scap action triggers a spike of web request took longer than 60 seconds and timed out errors. The requests taking longer than usual probably always happened, but previously there was no timeout on the web requests (enabled since task T97192#4561879).
  • task T204907 scap was checking canaries from the dormant datacenter (eqiad) instead of the active one (codfw). The list of hosts is hardcoded in puppet and has not been changed during the switch.
  • Canaries would not catch the web request took longer than 60 seconds which happens after scap canarie check window of 20 seconds.

Conclusions

  • New features introduced with feature flag / limited to a group are priceless.
  • We have proper web request timeout, which highlight an issue we probably had for ages.
  • Adding forceprofile=1 to a wiki url is a fast way to pin point PHP code slowness.
  • PHP notices are now reported in logstash at error level (thanks to Timo). They come with a stacktrace/context which is way better than the hhvm logbucket that simplies relays hhvm stdout/stderr without much infos.
  • the train is so automated, Antoine did not even notice the canaries check were pointing the wrong host.

Links to relevant documentation

Actionables

  • phab:T204871 Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out"
    • To be raised to Scrum of Scrum for investigation.
  • phab:T204907 Scap is checking canary servers in dormant instead of active-dc
    • Switchover process needs an update. Long term the list of canaries should be in conftool.
  • phab:T204961 ORES requests for wikidatawiki models=damaging end up with HTTP request timed out
    • Acknowledged by Amir Sarabadani