Jump to content

Data Platform Engineering/Onboarding

From Wikitech


We are thrilled to welcome you to the team! This onboarding page will help you to set up the accounts, memberships, and other tools you will need as part of the Data Engineering team. Here's an overview of things you should do in the first week.

First Steps

This section relates to the first logistic steps to join the Wikimedia staff crew effectively. Take your time to explore and look around, there's no hurry!

Wikimedia tech employee orientation

The starting point for staff in the Wikimedia Foundation Technology department : https://office.wikimedia.org/wiki/Technology

Wikimedia account

Your manager and Wikimedia's IT department will help you open several accounts including your work email. Right after this step, you will be able to communicate and participate in the day-to-day discussions between staff members. You will receive several emails, be patient and take your time to process all of the new information. Remember to contact your onboarding buddy if you have questions.

E-mail lists

Reading mailing lists is important. All projects we build or use are open-source, and like most open-source projects, they have communities that come together on mailing lists. There is much knowledge to be gained from these mailing lists.

Once you have a Wikimedia e-mail address you should subscribe yourself to these e-mail lists:

  1. Mediawiki
  2. Mobile
  3. Mediawiki API


Most of our communication happens on IRC, you should set up an IRC nick

  1. Install an IRC client -- ask team members for recommendations (some would be quassel, irssi, pidgin, Hexchat, textual or adium if you're on a Mac)
  2. Follow instructions on meta:IRC/Cloaks to request an IRC cloak
  3. Connect to #wikimedia-analytics connect
  4. Other channels you might be interested in: #wikimedia-cloud connect, #wikimedia-operations connect, #wikimedia-office connect

Office wiki

Make sure you have an employee account and that you can use the office wiki, your office wiki user will be given to you once you get your wikimedia e-mail address.



Please buy a high-quality headset -- your colleagues will love you for this. For more tips see https://office.wikimedia.org/wiki/Office_IT/Projects/Telepresence


We are part of a movement with a unique culture. It's worth taking the time to read a bit about how our biggest project works. This policy could be a useful start, as it introduces the core concepts from a concrete point of view: https://en.wikipedia.org/wiki/Wikipedia:Biographies_of_living_persons

Getting permits

Please follow the next subsections (order matters!) to get permission for various fundamental services. Access to Production will be covered in a separate section.

Wikitech/Cloud VPS

Cloud VPS is a cluster of virtual machines. Access is completely decoupled from production and different ssh keys should be used.

Cloud VPS is not production but we have several tools hosted on the cluster, accessing to Cloud VPS requires a Wikimedia developer account:

  1. Create account
  2. Log in
  3. You need to set up ssh keys
  4. Upload your public SSH key. Please have in mind that Cloud VPS is a testing environment thus this ssh key should only be used in testing, if you need access to machines in the production cluster your ssh key should be different (see section below about Production access).
  5. Configure your ~/.ssh/config with bastion hosts. See Sample ssh config for a template.
  6. Ask someone in the team to add you to the relevant projects in Cloud VPS.
  7. Get familiar with the Cloud VPS environment, how to use the Horizon interface to spin up nodes, remove nodes, etc


https://phabricator.wikimedia.org is the version of Phabricator that we use. Follow this page to log in for the first time (please use the sunflower icon as suggested by the tutorial to leverage the single sign on).


In order to access sites like turnilo and yarn, you need to be added to the wmf group in LDAP. You can ask for this by opening a new task using the LDAP-Access-Requests tag in Phabricator.


  1. Create an account
  2. Log in

Accessing production infrastructure

We manage very sensitive data, please read Analytics/Data access to familiarize yourself with our procedures.

Shell access to Wikimedia cluster and production infrastructure

Tickets are filed for the ops team to see and need to be approved by a manger (example: https://phabricator.wikimedia.org/T96053).

You will also need to receive and acknowledge a legal disclaimer about data deletion. This is an important legal requirement for which we need to ping legal every time someone gains access to data with sudo permissions.

When opening the request for a new Analytics Dev in Phabricator, specify that a kerberos principal is needed for your user.

Talk with Andrew Otto or Ben Tullis about how to submit your ssh public key. You would likely need to proxy your ssh connection from a know machine to access some of the hosts above. You should not use the same ssh key for Cloud VPS (testing) and production machines.

The easiest would be to ask some team member for their .ssh/config file and get the proxy setup.

Please bear in mind that different processes are required to access production machines and testing machines (Cloud VPS).

Sample ssh config

See SSH access#SSH configuration for sample SSH config. If you're in the data engineering team you will probably SSH into both Cloud VPS and Production, so add relevant config for both in your ~/.ssh/config file.

Logging in

Once you have your SSH setup in place and your credentials have been approved by Ops (using the Phabricator task created before) you will be able to explore the Analytics infrastructure. Please start from Analytics and check the instruction for projects, for example:

Talk with the people of your team on IRC about their work and pointers to their projects, so you will get a more precise idea about who does what. Be patient, it will take a while to get a good overall picture!


  1. Gerrit is the code review workflow we use, build on top of git
  2. Log in to Gerrit using your Wikimedia developer account credentials.
  3. To verify everything works, clone a repo repo from https://gerrit.wikimedia.org/r/#/admin/projects/?filter=analytics using SSH.
  4. Take a look at how to deal with gerrit in different work scenarios: http://etherpad.wikimedia.org/p/analytics-gerrit


Google Calendar

Add the WMF Data Engineering Team Calendar to your default view. Ask a team mate https://calendar.google.org and add you:

  • My Calendars -> Settings
  • Click WMF Data Engineering calendar -> Share This Calendar
  • Add the new person

Optional accounts

You could consider creating accounts for:

Operating System

The machines we deploy on are using Debian and it may be convenient for you to have Ubuntu, Debian, or any other Linux/UNIX based distribution installed on your workstation.

It will considerably facilitate your work. MacOS is also a very suitable choice for a workstation.


This is a collection of things you might find useful in your work.

Design Documents

Take a look at the shared drive with technical design documents: https://drive.google.com/drive/u/0/folders/0AB5b7sFjfnJXUk9PVA

Sync tools

You may find the following tools useful for sync-ing files between your local machine and remote machines(one-way or two-way). You can also mount remote directories as if they were your local directories:

  1. sshfs
  2. rsync
  3. lsync
  4. unison
  5. scp

IDEs and editors

For Java development you might want to look at IDEA also. For remote development you may find vim to be useful(or a combination of a sync tool and your favorite editor/IDE). Other editors you might find useful may include Sublime Text, Emacs.


You may find the following tools useful to search through configuration files or code:

  1. Ack (mainly for grepping code. video presentation)
  2. grep
  3. GNU find
  4. codesearch for mediawiki

Environment simulation

It may be useful that you familiarize yourself with Vagrant and Puppet to be able to recreate smaller environments/conditions on your machine to test various software you're developing or contributing to.

Further Reading and Reference Material

Check out our Learning Materials page, where you will find some excellent talks and presentations about our data stack.