User:Mobrovac/My Guide To The Galaxy

TODO

general intro
before first deployment

set up
mw-vagrant
beta cluster
new service request

deployment
service operation

You should start thinking about the deployment process more than a month before the actual date you want to see your service deployed in production for the first time. There are a number of steps to be completed before this can happen. This collection of documents will guide through this process.

MediaWiki Vagrant

MediaWiki Vagrant is a very convenient way for developers to rapidly set up a development environment containing a MediaWiki instance and any needed dependencies in a virtualised environment. Your service needs to be present there as well. Luckily, setting it up is very easy. First, in the vagrant directory, create the directory for your service's module and place this code inside <vagrant-dir>/puppet/modules/<service-name>/manifests/init.pp:

# == Class: <service-name>
#
# <a-short-description-of-the-service-here>
#
# === Parameters
#
# [*port*]
#   Port the service listens on for incoming connections.
#
# [*log_level*]
#   The lowest level to log (trace, debug, info, warn, error, fatal)
#
class <service-name>(
    $port,
    $log_level = undef,
) {

    service::node { '<service-name>':
        port      => $port,
        log_level => $log_level,
        config    => {},
    }

}

This is the minimum amount of code your service's Puppet module should have. As you can see, this definition does not provide any extra configuration for the service. If that is needed, simply add the configuration stanzas to the config hash as key/value pairs. Note that only configuration specific to your service should be listed here and not the whole configuration file, i.e. only the configuration parameters that your service code accesses via app.conf.*.

In order to configure the port (and any other parameters that you might have declared for the class), add the following contents to puppet/hieradata/common.yaml:

<service-name>::port: <service-port>

The last step is to create the role so that users may (de)activate it easily. Place the following Puppet code in puppet/module/role/<service-name>.pp:

# == Class: role::<service-name>
# This role installs <service-name>
#
class role::<service-name> {
    include ::<service-name>
}

Finally, the service's port must be exposed to the host environment; create the file puppet/modules/role/settings/<service-name>.yaml with:

forward_ports:
  <service-port>: <service-port>

You are done! You can now submit the patch for review and anybody will be able to profit from the service in the MediaWiki-Vagrant environment.

First Deployment

Repositories

We require that all services are hosted on our Gerrit servers. It does not have to be your primary development technique or tool, even though you are strongly encouraged to do so.

Because Node.js services use npm dependencies which can be binary, these need to be pre-built. Therefore, two repositories are needed; one for the source code of your service, and the other, so-called deploy repository. Both should be available as WM's Gerrit repositories with the paths mediawiki/services/your-service-name and mediawiki/services/your-service-name/deploy, respectively. When requesting them ask for the former to be a clone of the service template (or of your own service repository) and the latter to be empty.

It is important to note that the deploy repository is only to be updated directly before (re-)deploying the service, and not on each patch merge entering the master branch of the regular repository. In other words, the deploy repository mirrors the code deployed in production at all times.

The remainder of this guide assumes these two repositories have been created and that you have cloned them using your Gerrit account, i.e. not anonymously, with the following outline:

~/code/
  |- your-service
  -- deploy

This guide refers to these two repositories as the source repository and the deploy repository, respectively.

Source Repo Configuration

The service template includes an automation script which updates the deploy repository, but it needs to be configured properly in order to work.

package.json

The first part of the configuration involves keeping your source repository's package.json updated. Look for its deploy stanza. Depending on the exact machine on which your service will be deployed, you may need to set target to either ubuntu or debian (most likely and default value if missing).

If you want to specify a version of Node.JS, different from the official distribution package, set the value of the node stanza to the desired version, following nvm versions naming. To explicitly force official distribution package, "system" version can be used.

The important thing is keeping the dependencies field up to date at all times. There you should list all of the extra packages that are needed in order to build the npm module dependencies. The _all field denotes packages which should be installed regardless of the target distribution, but you can add other, distribution-specific package lists, e.g.:

"deploy": {
  "target": "ubuntu",
  "node": "system",
  "dependencies": {
    "ubuntu": ["pkg1", "pkg2"],
    "debian": ["pkgA", "pkgB"],
    "_all": ["pkgOne", "pkgTwo"]
  }
}

In this example, with the current configuration, packages pkg1, pkg2, pkgOne and pkgTwo are going to be installed before building the dependencies. If, instead, the target is changed to debian, then pkgA, pkgB, pkgOne and pkgTwo are selected.

As a rule of thumb, whenever you need to install extra packages into your development environment for satisfying node module dependencies, add them to deploy.dependencies to ensure the successful build and update of the deploy repository.

Local Git

The script needs to know where to find your local copy of the deploy repository. To that end, when in your source repository, run:

$ git config deploy.dir /absolute/path/to/deploy/repo

Using the aforementioned local outline, you would type:

$ git config deploy.dir /home/YOU/code/deploy

Deploy Repo Set-up

If you haven't yet done so, initialise the deploy repository:

$ cd ~/code/deploy
$ git review -s
$ touch README.md
$ git add README.md
$ git commit -m "Initial commit"
$ git push -u origin master  # or git review -R if this fails
# go to Gerrit and +2 your change, if needed and then:
$ git pull

Next, you need prepare the deploy repository for usage with Scap3. Create the scap directory inside your deploy repository and fill the contents of scap/scap.cfg with:

[global]
git_repo: <service-name>/deploy
git_deploy_dir: /srv/deployment
git_repo_user: deploy-service
ssh_user: deploy-service
server_groups: canary, default
canary_dsh_targets: target-canary
dsh_targets: targets
git_submodules: True
service_name: <service-name>
service_port: <service-port>
lock_file: /tmp/scap.<service-name>.lock

[wmnet]
git_server: tin.eqiad.wmnet

This represents the basic configuration needed by Scap3 to deploy the service. We still need to tell Scap3 on which nodes to deploy and which checks to perform after the deployment on each of the nodes. First, the list of nodes. Two files need to be created: scap/target-canary and scap/targets. In the former, you need to put the FQDN of the node that will act as the canary deployment node, i.e. the node that will first receive the new code, while in the latter file put the remainder of the nodes. For example, if your target nodes are in the SCB cluster, these files should look like this:

$ cat target-canary 
scb1001.eqiad.wmnet

$ cat targets
scb1002.eqiad.wmnet
scb2001.codfw.wmnet
scb2002.codfw.wmnet

Finally, enable the automatic checker script to check the service after each deployment by placing the following in scap/checks.yaml:

checks:
  endpoints:
    type: nrpe
    stage: promote
    command: check_endpoints_<service-name>

Commit your changes, send them to Gerrit for review and merge them.

The deployment process includes a script that builds the deployment repository using Docker containers, so make sure you have the latest version installed. Additionally, you need to add your user to the `docker` group after installation so that you don't need to use `sudo` when running the build script:

$ sudo usermod -a -G docker <your-user>

You need to log out of all of the terminals in order for the change to take effect.

New Service Request

There are various prerequisites that need to be taken care of on the operational side before your service can see the day of light in production: machine allocation, IPs, LVS, etc. In order to express the intent of deployment, you need to complete a new service request, by filing a task against the service-deployment-requests project in Phabricator. Be prepared to give the following information:

name: the name of the service to be deployed
description: a paragraph explaining clearly what the service does and why it is needed
timeline: the desired deployment timeline; note that you should allow a minimum of at least two to three weeks cadence
point person: the person responsible for the service; this is the person that will get called when there are problems with the service when running in production
technologies: additional information about the service itself, including, but not limited to, the language used for development and any frameworks used
request flow diagram: a link to a request flow diagram that explains the interaction between your service and any other parts of the operational stack inside the production cluster, such as requests made to MediaWiki, RESTBase, etc.

For some example tickets see task T105538, task T117560, task T128463.

Role and Module Creation

While you are waiting for the service request to be completed, do not fear: you still have useful things to do. You may start by creating your service's Puppet role and module in the operations/puppet repository. First, add your service's deploy repository to the list of repositories deployed in production by appending the following block to hieradata/common/role/deployment.yaml (note the extra spaces at the beginning of each line):

  <service-name>/deploy:
    upstream: https://gerrit.wikimedia.org/r/mediawiki/services/<service-name>/deploy
    checkout_submodules: true

Next, create modules/<service-name>/manifests/init.pp and put the following content in it:

# == Class: <service-name>
#
# Describe the service here ...
#
# === Parameters
#
# [*param_name1*]
#   Description of param_name1
#
# [*param_name2*]
#   Description of param_name2
#
class <service-name>(
    $param_name1 => 'def_val1',
    $param_name2 => 'def_val2',
) {

    service::node { '<service-name>':
        port            => <service-port>,
        config          => {
            param_name1 => $param_name1,
            param_name2 => $param_name2,
        },
        healthcheck_url => '',
        has_spec        => true,
        deployment      => 'scap3',
    }

}

Note that only configuration specific to your service should be listed here and not the whole configuration file, i.e. only the configuration parameters that your service code accesses via app.conf.*. Instead of in-lining it directly in the module, you can also store the configuration in form of an ERB YAML template in modules/<service-name>/templates/config.yaml.erb. Then, simply use it directly for the config parameter for the service::node resource like so:

        config          => template('<service-name>/config.yaml.erb'),

You will also need a role for your service. Put the following code fragment into manifests/role/<service-name>.pp:

# Role class for <service-name>
class role::<service-name> {

    system::role { 'role::<service-name>':
        description => 'short description',
    }

    include ::<service-name>
}

You can now submit the patch for review. Don't forget to mention the service request bug in your commit message.

Access Rights

As the service owner and maintainer, you need to be able to log onto the nodes where your service is running. Once the exact list of target nodes is known, you need to file an access request ticket with the following information:

Ttile: Access Request for <list-of-maintainers> for <service-name>
Description: <list-of-maintainers> needs access to <list-of-nodes> for operating <service-name>. We need to be able to read the logs at /srv/log/<service-name> and be able to start/stop/restart it. The task asking for the service's deployment is {<service-request-task-number>}

This request implies sudo rights on the target nodes, so you will need the approval from your manager on the task.

Beta Cluster

- TODO** at a later time...

Deployment

Regular Deployment

There are a lot of moving parts in our production stack -- MediaWiki, its extensions, various back-end services, HTTPS handlers, caches, just to name a few. It is thus important that you communicate your deployment schedules on the Deployments page.

The deployment process starts with updating the deploy repository. Go into your source repository and update it with:

$ ./server.js build --deploy-repo --force --review

The build script will update the pointer of the deploy repository's submodule, create a Docker container in which it will install the module dependencies and send the changes to Gerrit. Review them and merge. Next, log onto `deployment.eqiad.wmnet` and update the repo there:

$ cd /srv/deployment/<service-name>/deploy
$ git pull && git submodule update --init

In the #wikimedia-operations IRC channel announce the deployment by logging it into the Server Admin Log with !log <service-name> deploying <deploy-repo-sha1>. Now, proceed to do the dpeloyment from deployment.eqiad.wmnet:

$ deploy

Scap3 will deploy the code, restart the service and check its port and health. In case it detects some problems on the canary node, it will suggest to perform a roll-back. Otherwise it will proceed to deploying it to the rest of the nodes, which completes the deployment process.

Dealing with Problems

Deployment Debugging

Scap3 includes a utility which can be used to monitor the output of the commands executed on the target nodes. Fire up a second terminal, connect to deployment.eqiad.wmnet and execute the deploy-log command from /srv/deployment/<service-name>/deploy before starting the deployment. The output should help you figure out what went wrong.

If you haven't started an instance of deploy-log during the deploy, but it went badly, you can still recuperate the logs by running deploy-log --latest.

Reverting a Deployment

Sometimes the deployment process goes well, but the code that was deployed isn't functioning properly. To revert a deployment and bring the code on the target nodes to a previous state, find the deploy repository's SHA1 that contained the good code and then deploy it with:

$ deploy --rev <sha1>

Service Operation

Starting, Stopping, Restarting

If you have sudo rights on the target machines, then that's as simple as logging onto each of the targets and issuing the respective commands:

$ sudo service <service-name> start
$ sudo service <service-name> stop
$ sudo service <service-name> restart

Monitoring

Logs

The service's logs are stored locally in /srv/log/<service-name>/main.log. To take a look, simply tail it:

$ tail -f /srv/log/<service-name>/main.log

Since the log entries are JSON-formatted, you may want to see them in a more presentable form. Use bunyan for that:

$ tail -f /srv/log/<service-name>/main.log | /srv/deployment/<service-name>/deploy/node_modules/.bin/bunyan