Machine Learning/Onboarding

From Wikitech

Background

You will need lots of accounts, memberships and other secret keys to become a productive member of the Machine Learning team. Here's an overview of things you should do in the first week. Please update this document as you go along! Last but not the least, the most important thing: welcome to the team!

First Steps

This section is related to the first logistic steps to effectively join the Wikimedia's staff crew. Take your time to explore and look around, don't rush!

Wikimedia tech employee orientation

Starting point for each new Wikimedia employee: https://office.wikimedia.org/wiki/New_tech_employee_orientation

Wikimedia account

Your manager and the Wikimedia's IT department will help you open several accounts including your work email. Right after this step, you will be able to communicate and participate to the day to day discussions between staff members. Be patient and don't be scared about the huge amount of information and emails that you'll receive!

IRC

Most of our communication happens on IRC, you should set up an IRC nick

  1. Install an IRC client -- ask team members for recommendations (some would be quassel, irssi, pidgin, Hexchat, textual or adium if you're on a Mac)
  2. Follow instructions on meta:IRC/Cloaks to request an IRC cloak
  3. Connect to #wikimedia-ml connect
  4. Other channels you might be interested in: #wikimedia-cloud connect, #wikimedia-operations connect, #wikimedia-office connect

Mailing lists

If you are an SRE, join the SRE at large mailing list. It is used to disseminate information about maintenance work and downtime to all SREs (both on the main SRE team as well as those embedded with teams). Send an email to sre-mgmt@wikimedia.org to request access.

Office wiki

Make sure you have an employee account and that you can use the office wiki, your office wiki user will be given to you once you get your Wikimedia e-mail address.

https://office.wikimedia.org/wiki/Getting_Started_With_User_Info_and_Talk_Pages

Headset

Please buy a high-quality headset -- your colleagues will love you for this. For more tips see https://office.wikimedia.org/wiki/Office_IT/Projects/Telepresence

Culture

We are part of a movement with a unique culture. It's worth taking the time to read a bit about how our biggest project works. This policy could be a useful start, as it introduces the core concepts from a concrete point of view: https://en.wikipedia.org/wiki/Wikipedia:Biographies_of_living_persons

Getting permits

Please follow the next subsections (order matters!) to get permissions for various fundamental services. Access to Production will be covered in a separate section.

SSH keys

You are going to need three pair of SSH keys:

  • One to access the Cloud VPS / Openstack / Horizon environment (where we can create VMs, mostly for testing).
  • One to access the Production environment.
  • One to submit patches and pull repositories from Gerrit.

In order to generate a SSH key pair, you can check SRE/Production access#Generating your SSH key.

Wikitech/Cloud VPS

Cloud VPS is a cluster of virtual machines. Access is completely decoupled from production and different ssh keys should be used.

Cloud VPS is not production but we have several tools hosted on the cluster, accessing to Cloud VPS requires a Wikimedia developer account:

  1. Create a Wikimedia Developer account
  2. Log in
  3. You need to set up ssh keys (you should have generated a key pair beforehand).
  4. Upload your public SSH key. Please have in mind that Cloud VPS is a testing environment thus this ssh key should only be used in testing, if you need access to machines in the production cluster your ssh key should be different (see section below about Production access).
  5. Configure your ~/.ssh/config with bastion hosts. See Sample ssh config for a template.
  6. Ask someone in the team to add you to the relevant projects in Cloud VPS ("machine learning" project admins in Horizon for access to ml sandbox).
  7. Get familiar with the Cloud VPS environment, how to use the Horizon interface to spin up nodes, remove nodes, etc

At the end ask to any member of the team to add your new user to the Machine Learning project in Horizon, so you will be able to create VMs and inspect them.

To test your setup, try to ssh to ml-sandbox.machine-learning.eqiad1.wikimedia.cloud

LDAP

In order to access sites like turnilo and yarn (Data Engineering's UIs), you need to be added to the wmf group in LDAP. You can ask for this by opening a new task using the LDAP-Access-Requests tag in Phabricator.

MediaWiki

You may have received an email from the IT Services (ITS) department, which contains accounts that have been requested for you by your manager. One will be your Wikimedia account that has "WMF" at the end, in this case, you don't need to create an account here. It can be used to access different wikis, and also login to Phabricator (see section below).

  1. Create an account
  2. Log in

Phabricator

https://phabricator.wikimedia.org is the version of Phabricator that we use. Follow this page to log in for the first time (please use the sunflower icon as suggested by the tutorial to leverage the single sign on).

Gerrit

  1. Gerrit is the code review workflow we use, build on top of git.
  2. Log in to Gerrit using your Wikimedia developer account credentials.
  3. Add the public SSH key dedicated to Gerrit in https://gerrit.wikimedia.org/r/settings/#SSHKeys
  4. To verify everything works, clone a repository from Gerrit using SSH.

After this, ensure that you have access to the relevant repos in Gerrit (and/or GitLab). For example, you will need to be added to the machinelearning/liftwing/inference-services repo so you can +2 and merge patches there.

Accessing production infrastructure

With great power comes great responsibility. Please do read carefully the Wikimedia's SSH access guidelines and familiarize with your new SSH config before proceeding. Moreover we manage very sensitive data, please read Analytics/Data access to familiarize yourself with our procedures. Tickets are filed for the ops team to see and need to be approved by a manger (example: https://phabricator.wikimedia.org/T96053).

When opening the request for a new Machine Learning engineer in Phabricator, specify that:

  • You'd need access to the POSIX groups ml-team-admins, analytics-privatedata-users
  • You'd need a kerberos principal is needed for your user (to explore data provided by the Data Engineering team).

Talk with Luca or Tobias about how to submit your ssh public key. You would likely need to proxy your ssh connection from a know machine to access some of the hosts above. You should not use the same ssh key for Cloud VPS (testing) and production machines.

The easiest would be to ask some team member for their .ssh/config file and get the proxy setup.

Please bear in mind that different processes are required to access production machines and testing machines (Cloud VPS).

Sample ssh config

See SSH access#SSH configuration for sample SSH config. If you're in the Machine Learning team you will probably SSH into both Cloud VPS and Production, so add relevant config for both in your ~/.ssh/config file.

Logging in

Once you have your SSH setup in place and your credentials have been approved by Ops (using the Phabricator task created before) you will be able to explore the Machine Learning infrastructure.

To test that everything works, try to ssh to ml-serve1001.eqiad.wmnet and stat1008.eqiad.wmnet.

Misc

Operating system

The machines we deploy on are using Debian and it may be convenient for you to have Ubuntu, Debian, or any other Linux/UNIX based distribution installed on your workstation.

It will considerably facilitate your work. MacOS is also a very suitable choice for a workstation.

Further reading

Gerrit

Misc