Release Engineering/Drafts/Deployments/How to deploy the train

From Wikitech
Jump to navigation Jump to search


Bring new code in a fast, safe and efficient way!
Deployments

Initial setup

If this is your first time running the train, you need to do some initial configuration. Start by SSHing into deploy1002.eqiad.wmnet.

First, check out the mediawiki/tools/release repo:

USERNAME@deploy1002:~$ git clone https://gerrit.wikimedia.org/r/mediawiki/tools/release

Next, make sure you're using a full-featured Git prompt for Bash, by adding the following to your ~/.profile:

GIT_PS1_SHOWUNTRACKEDFILES=1
GIT_PS1_SHOWDIRTYSTATE=1
GIT_PS1_SHOWUPSTREAM="auto verbose"
. /etc/bash_completion.d/git-prompt
PS1='\u@\h \w$(__git_ps1 " (%s)") \$ '

Breakage

There will be times when this process does not go smoothly. There are guidelines for what do to when that happens.

In general, if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.

Rollback

It should be quick to roll back wikiversion changes. Rollback production before you send patches to Gerrit since waiting on CI may take a while:

USERNAME@deploy1002:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
USERNAME@deploy1002:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'
USERNAME@deploy1002:/srv/mediawiki-staging$ # Now that you've synced the revert, push patches to gerrit. You have to run git commit --amend to get the changeid:
USERNAME@deploy1002:/srv/mediawiki-staging$ git commit --amend
USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2

Example:

USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=1.34.0-wmf.0,l=Code-Review+2
  • Wait for the patch to merge and then fetch back down to the deployment server

Places to Watch for Breakage

Train deployers should check for breakage as they are rolling out the train, as they are effectively the first line of defense for train deploys. Some of the places to watch for breakage:

If the train is blocked

  • A task will be assigned to you, for example T191059 (1.32.0-wmf.13 deployment blockers) (you can see that week's task at https://train-blockers.toolforge.org)
  • Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.

Checklist

If there are blocking tasks, please do the following:

  • Make sure all tasks blocking train are set to Unbreak Now! priority in phabricator
  • Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
  • Send e-mail to:
    • ops@lists.wikimedia.org
    • wikitech-l@lists.wikimedia.org
    • Ping private #engineering-all Slack channel
    • Subject: [Train] {version} status update: {brief summary}
    • Body
      The {version} version of MediaWiki is blocked[0].
      
      The new version is deployed to {group(s){0,1,2}}[1], but can proceed no
      further until these issues are resolved:
      
      * {Phab task name} - {phab task link}
      
      Once these issues are resolved train can resume. If these issues are
      resolved on a Friday the train will resume Monday.
      
      Thank you for your help resolving these issues!
      
      -- Your humble train toiler
      
      [0]. <{link to phab task for train}>
      [1]. <https://versions.toolforge.org/>
      
  • Tag relevant teams and people (see Developers/Maintainers) on the blocking task
  • Ping relevant people in IRC
  • Once train is unblocked be sure to thank the folks who helped unblock it

Weekly steps

Monday: Sync up with your deployment partner

There are two people assigned to each week's train: One as primary, and one as backup.

The primary train conductor will be the assignee of the train blocker task in Phabricator. Backup conductor will be listed as Backup Conductor at the top of the task. On Monday, you should communicate briefly with your partner and establish how you'll collaborate over the course of the week.

See Release Engineering/Drafts/Deployments/How to pair on the train for an overview of helpful practices.

Tuesday: New branch creation and deploy

Before the deploy window

Depending on how practiced you are and where you choose to run commands (full clones of mediawiki-core from outside the cluster can take a while), the steps will typically take 45 to 90 minutes.

Short-form instructions
Step host command example
P-0 Verify branch cut job worked Your laptop The branch cut runs in a periodic jenkins job that runs on Tuesdays at 02:00 UTC on the releases-jenkins instance. Navigate to gerrit to find the branch commit that the job created.

If there are no open commits shown in gerrit using the link above, you can troubleshoot via the releases-jenkins job.
This will also build and post the changelog for you.

P-1 Note the MW core commit from which you've just created the branch IRC

(#wikimedia-operations)

!log [VERSION] was branched at [BRANCH POINT] for [TASK]
!log 1.35.0-wmf.14 was branched at fb16374c5bdb9d14729f358fb81638fc91640b4f for T233862
P-2 Merge the branch commit Gerrit (example) C+2 on the patch. It takes about 25 minutes for the branch to be tested and merged.
P-3 Enter screen (or tmux if you prefer)


Note[1]

deploy1002.eqiad.wmnet
USERNAME@deploy1002:~$ screen -D -RR train
P-4 Set local ssh-agent in session deploy1002
USERNAME@deploy1002:~$ eval $(ssh-agent)
USERNAME@deploy1002:~$ ssh-add .ssh/id_ed25519
P-5 Clone new branch in production (once the branch commit from P-2 has landed) deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging$ scap prep [VERSION]
USERNAME@deploy1002:/srv/mediawiki-staging$ scap prep 1.34.0-wmf.0
P-6 Apply security patches deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging$ scap apply-patches --train [VERSION]
  • See https://phabricator.wikimedia.org/T276237 for information about currently deploy security patches.
  • If a patch fails to apply, investigate whether it's due to a conflict (git status) or the patch having been merged since the new branch cut (search git log for the commit, etc.). If it turns out to be the latter, remove the patch file from the /srv/patches/[VERSION] directory.
  • If you need extra help, contact Security Team (Wikimedia Foundation, MediaWiki, Office Wiki), currently Brian (bawolff) and Sam (Reedy) in IRC.
P-7 Create and auto-merge/deploy the testwikis patch


🐌 Note: this step takes about 40 minutes.

deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote testwikis [VERSION]
Promote testwikis from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
P-8 Verify version change on testwiki testwiki Verify version change on testwiki (Installed software, Product: MediaWiki, Version: [VERSION]) and l10n cache (Special:Version should not look like Special:Version?uselang=qqx).
This is done automatically by the deploy script, but is worth verifying manually.
P-9 Decide what old stuff to prune deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ find . -mindepth 2 -maxdepth 2 -type f -path './php-*/README.md' -ctime +7 -exec dirname {} \;
P-10 Clean up old stuff

🐌 Note: this step runs a scap sync of the directory, and can take 30 minutes.

deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap clean --delete [some old version from find -ctime +7 output above]
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap clean --delete 1.34.0-wmf.0
Wait for the deploy window

During the deploy window

Short-form instructions
Step host command example
0-0 Create and auto-merge/deploy the group0 patch deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote group0
Promote group0 from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
0-1 Verify production has indeed switched MediaWiki.org Verify that mediawikiwiki has switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
0-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
0-3 Update roadmap page mw:MediaWiki 1.36/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 0 (deployed to group0)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|0}}
0-4 Kill ssh-agent deployment server
USERNAME@deploy1002:~$ pgrep -u "$USER" -laf ssh-agent # list all of your ssh-agent processes
USERNAME@deploy1002:~$ pkill -u "$USER" -f ssh-agent   # kill all your ssh-agent processes
USERNAME@deploy1002:~$ pgrep -u "$USER" -laf ssh-agent # did they all die?

Wednesday: group0 to group1 deploy

Meta / coordination

Attend the Train Log Triage meeting with members of the Core Platform Team and others.

Short-form instructions
Step host command example
1-0 Create and auto-merge/deploy the group1 patch deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote
Promote group1 from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
1-1 Verify production has indeed switched English Wiktionary Verify that the English Wiktionary (and other group1 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
1-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
1-3 Update roadmap page mw:MediaWiki 1.36/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 1 (deployed to group1)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|1}}
...
{{WMFReleaseTableFooter}}

Thursday: group{0,1} to all deploy

Short-form instructions
Step host command example
2-0 Create and auto-merge/deploy the group2 patch deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ ~/release/bin/deploy-promote all
Promote all from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
2-1 Verify production has indeed switched English Wikipedia Verify that the English Wikipedia (and other group2 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
2-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
2-3 Update roadmap page mw:MediaWiki 1.36/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 2 (deployed to all)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|2}}
...
{{WMFReleaseTableFooter}}

Incident documentation

See also

Footnotes

  1. If you need to leave in the middle you can do ctrl-a d to detach and screen -r train to attach.