Heterogeneous deployment/Train deploys
Weekly steps
Monday: Sync up with your deployment partner
As of October 2019, there are two people assigned to each week's train: One as primary, and one as backup. These are rough guidelines for sharing the work, and should be improved as we learn more.
- On Monday, communicate with your partner and establish how you'll collaborate over the course of the week.
- Updates on IRC while your partner is working and updates on the train blocker ticket if they're offline seems to be a useful pattern.
- Liberal use of video chat for pairing on hard problems is encouraged.
- It seems to work well to have the primary do the work of cutting the branch, syncing wikis, etc., while the backup keeps an eye on logs, works on improvements to deploy tooling, and is generally an extra pair of eyes for the whole process.
- If you are in doubt about any part of the process and it's during your partner's working hours, consult them first and get their help in resolving your questions.
- If one member of the pair is in the European window and one is in the American window, both train deployment windows should be reserved on the Deployments calendar. This gives a backup deployer a defined window for moving the train forward outside the primary's working hours, if it becomes necessary.
- If the train is blocked or there are any other issues, communicate the transfer of responsibility on the train blocker ticket by assigning it to the responsible party and leaving a note.
Tuesday: New branch creation and deploy
Before the deploy window
All pre-deploy steps have been automated.
- Branch cut happens on releases-jenkins. (Note: The link to the branch cut job will report "Not Found" until you log into releases-jenkins). The changes that are part of a given branch can be found on the corresponding change log page on mediawiki.org.
scap stage-train auto
is run by a cron job
Refer to #Troubleshooting_automated_jobs if something goes wrong.
- During the deploy window
Step | host | command | example | |
---|---|---|---|---|
0-0 | Create and auto-merge/deploy the group0 patch | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train
____
|DD|_____T_
|_ |wmf.26|<
@-@-@-oo
=========================================================================
START testwikis group0 group1 group2
1.41.0-wmf.26 1.41.0-wmf.25 1.41.0-wmf.25 1.41.0-wmf.25
[0] [1] [2] [3] [4]
What station do you want the train to be at (0-4)?
Select the index corresponding to group 0 ([2]) and press enter. Now wait for scap to finish the deployment. | |
0-1 | Verify production has indeed switched | MediaWiki.org | Verify that mediawikiwiki has switched to the new version (Installed software, Product: MediaWiki, Version: VERSION) | |
0-2 | Monitor production logs | logstash etc. | Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage | |
0-3 | Update roadmap page | mw:MediaWiki 1.43/Roadmap | Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 0 (deployed to group0)
|
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|0}}
|
Wednesday: group0 to group1 deploy
- Meta / coordination
Attend the Train Log Triage meeting with members of the Core Platform Team and others.
Step | host | command | example | |
---|---|---|---|---|
1-0 | Create and auto-merge/deploy the group1 patch | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train
____
|DD|_____T_
|_ |wmf.26|<
@-@-@-oo
=========================================================================
START testwikis group0 group1 group2
1.41.0-wmf.26 1.41.0-wmf.26 1.41.0-wmf.25 1.41.0-wmf.25
[0] [1] [2] [3] [4]
What station do you want the train to be at (0-4)?
Select the index corresponding to group 1 ([3]) and press enter. Now wait for scap to finish the deployment. | |
1-1 | Verify production has indeed switched | English Wiktionary | Verify that the English Wiktionary (and other group1 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION) | |
1-2 | Monitor production logs | logstash etc. | Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage | |
1-3 | Update roadmap page | mw:MediaWiki 1.43/Roadmap | Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 1 (deployed to group1)
|
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|1}}
...
{{WMFReleaseTableFooter}}
|
Thursday: group{0,1} to all deploy
Step | host | command | example | |
---|---|---|---|---|
2-0 | Create and auto-merge/deploy the group2 patch | deploy1002 | USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train
____
|DD|_____T_
|_ |wmf.26|<
@-@-@-oo
=========================================================================
START testwikis group0 group1 group2
1.41.0-wmf.26 1.41.0-wmf.26 1.41.0-wmf.26 1.41.0-wmf.25
[0] [1] [2] [3] [4]
What station do you want the train to be at (0-4)?
Select the index corresponding to group 2 ([4]) and press enter. Now wait for scap to finish the deployment. | |
2-1 | Verify production has indeed switched | English Wikipedia | Verify that the English Wikipedia (and other group2 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION) | |
2-2 | Monitor production logs | logstash etc. | Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage | |
2-3 | Update roadmap page | mw:MediaWiki 1.43/Roadmap | Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 2 (deployed to all)
|
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|2}}
...
{{WMFReleaseTableFooter}}
|
Breakage
There will be times when this process does not go smoothly. There are guidelines for what to do when that happens.
In general, if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.
Rollback
To rollback a wikiversion change, it should be pretty quick. Go ahead and rollback production before you send patches up to gerrit since waiting on Jenkins may take a while:
USERNAME@deploy1002:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
USERNAME@deploy1002:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'
# Now that you've synced the revert, push patches up to gerrit, you have to run git commit --amend to get the changeid
# Ideally, you should also add the train blocker task id to the Bug: field for this commit
USERNAME@deploy1002:/srv/mediawiki-staging$ git commit --amend --no-edit
# [VERSION] below is the new version, e.g.: 1.43.0-wmf.6
USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2
Example:
USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=1.34.0-wmf.0,l=Code-Review+2
Alternatively, if the rollback doesn't need to happen immediately and you can afford a few minutes, you can simply run the scap train command again to go back to a previous stage (however when deploying to group0 this would also make test servers go back!):
USERNAME@deploy1002:/srv/mediawiki-staging$ scap train
- Wait for the patch to merge and the fetch back down to the deployment server
- #Update roadmap.
Troubleshoot Kubernetes deployment
To get events for the service mw-api-ext on eqiad:
kube-env mw-api-ext eqiad
kubectl get events
See Kubernetes/Troubleshooting#Troubleshooting_a_deployment.
Places to Watch for Breakage
Train deployers should check for breakage as they are rolling out the train as they are effectively the first line of defense for train deploys.
Given limited resources, it is not possible to monitor every dashboard during the train. There are a limited set of signals that are actively monitored. And a much larger set of signals which may be monitored.
See MediaWiki_Engineering/Guides/Monitor_production_errors for a detailed breakdown of the log triage process.
Places we monitor
These are the places Release Engineering actively monitor during the train.
- IRC
- Primary channel is #wikimedia-operations connect. This is where official deployment communications happen, alerts are broadcast, etc.
- For more channels see MediaWiki on IRC and IRC/Channels
- Logs
- Current mwlog (mwlog1001 or mwlog2002, depending on primary datacenter):
- logspam-watch
- Logfiles can be found in
/srv/mw-log
- Logstash
- mediawiki-errors dashboard gives the full firehose of almost all errors
- MediaWiki New Errors ECS is a workboard with known issues filtered out, useful for surfacing new breakage
- See the Wikimedia-production-error workboard for known issues
- Current mwlog (mwlog1001 or mwlog2002, depending on primary datacenter):
- Grafana
Other places to look
These links are not actively monitored by Release Engineering, but may be useful for troubleshooting and investigation of problems with the train.
- Logstash mw-client-errors dashboard
- New errors appearing more than 1000 times in a 12 hour period should be considered blockers
- See also Grafana dashboard with summary of average error rate over time
- Grafana
- Varnish http-errors dashboard (HTTP 5XX % should have 3+ 0s after the decimal point, e.g. 0.0001%)
- Frontend Responses NGINX vs Varnish
- Production Logging
- Minerva Client Errors - Browser JS errors count (only wikipedias on mobile)
If the train is blocked
- A task will be assigned to you, for example T191059 (1.32.0-wmf.13 deployment blockers) (you can see that week's task at https://train-blockers.toolforge.org)
- Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.
Checklist
If there are blocking tasks, please do the following:
- Make sure all tasks blocking train are set to
UBN!
priority in phabricator - Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
- Send e-mail to:
- ops@lists.wikimedia.org
- wikitech-l@lists.wikimedia.org
- Ping private #engineering-all Slack channel
- Subject:
[Train] {version} status update
- Body
The {version} version of MediaWiki is blocked[0]. The new version is deployed to {group(s){0,1,2}}[1], but can proceed no further until these issues are resolved: * {Phab task name} - {phab task link} Once these issues are resolved train can resume. If these issues are resolved on a Friday the train will resume Monday. Thank you for your help resolving these issues! -- Your humble train toiler [0]. <{link to phab task for train}> [1]. <https://versions.toolforge.org/>
- Add relevant people (see Developers/Maintainers) to the blocking task
- Ping relevant people in IRC
- Once train is unblocked be sure to thank the folks who helped unblock it
Troubleshooting automated jobs
What you're seeing | Likely problem | How to fix it |
---|---|---|
You received an email that indicates the automated branch cut job has failed. | The job has failed. | Follow the link in the email to the failed build. Inspect the console and continue below to troubleshoot. |
The failed build console includes the message <url> was rejected by a test failure
|
The branch-cut change for mediawiki/core has failed in CI.
|
Follow the link to the change in Gerrit. Remove any existing +2 vote and re-vote +2 to trigger gate-and-submit. If the change is merged, all is well (but you should report the flaky behavior). If it fails again, continue below to troubleshoot. |
The branch-cut change has failed in CI again (above). | This is a real test failure. | Yell for help from developers in Slack (#engineering-all) and/or on IRC (#wikimedia-releng ?). After a fix has been merged into the mainline branch and backported to the version branch, click rebuild last in Jenkins to rerun the branch-cut job. |
You received an email with subject line FAIL: train-presync | The systemd timer that runs scap stage-train auto has failed.
|
Continue below to troubleshoot. |
The email contains .gitmodules does not exist. Did the train branch commit get merged? .
|
The automated branch cut job has failed. | Head to the top of this table and troubleshoot the branch cut failure. Once you've solved the issue, re-run scap stage-train --yes auto on the deployment server.
|
The email contains ERROR: git am: error: Failed to merge in the changes .
|
Security patches have failed to apply cleanly. | Ping the Phabricator task for the security patch and ask for a rebase. Once they've resolved the issue, re-run scap stage-train --yes auto on the deployment server. This command will checkout the code on the deployment server and deploy to test wikis.
|
The email contains ssh: connect to host <host> port 22: Connection timed out .
|
? | ? |
The email contains error: insufficient permission for adding an object to repository database .git/objects .
|
? | ? |
Something else. | ??? | Get help from your backup conductor and fellow RelEngineers to troubleshoot the failure. Once you have solved the issue, be sure to update this section with: what you saw, the root problem, how you fixed it. |
Incident documentation
- If there were problems during the train, follow instructions at Incident documentation on incident reports and post-mortem review.
- Use
Create report
form to create a new page,train-[VERSION]
. Example: Incident documentation/20181212-Train-1.33.0-wmf.8. - For the Timeline section, events from SAL and Phabricator task are a good start.
See also
- For information about the current status of the versions deployed to the various wikis, see https://versions.toolforge.org/