Ops Clinic Duty

From Wikitech
Jump to: navigation, search

The Ops Clinic Duty triage duty was established to ensure that tickets (and thus requests and projects) are triaged and processed in a timely fashion, providing feedback and regular updates to operations supported projects/responsibilities.

This is a duty that is fulfilled by a member of the Wikimedia Operations Team.

Schedule & Assignments

  • Monday to Monday
  • Generally volunteered/assigned during Operations Meetings, sometimes via mailing list.
  • During Clinic Duty the operations team member on duty should remain available in IRC & email.
  • Folks will follow up with the person on Ops Clinic Duty about existing tasks, as well as how to create new ones.
  • This duty is fairly intensive, and will interrupt a person's normal workflow on the week they are on duty..
  • This duty shouldn't normally require any adjustment to one's normal working schedule; if you work business hours in CET, then you wouldn't shift your hours on clinic duty for another time zone.
  • This should result in regularly having ops clinic duty coverage in most overlapping working timezones.

Hand-off / Takeover

  • Ideally all phabricator tasks are replied/commented upon in the process of reviewing and triaging, so no actual handoff of duties is required between weeks
  • Update the topic in IRC channel #wikimedia-operations, section 'On Ops duty:' with the person's name for that week.
  • Update the list in the Duty desk rotation - who is next? section below.
  • This is currently the public facing method of determining who is on duty.

Responsibilities

  • If Clinic Duty is a relaxing week for you, you are doing it wrong.
  • All incoming Clinic Duty tasks in phabricator can be viewed on the Ops Clinic Duty Dashboard
  • The idea is folks tend to have their own dashboard, which is fine when they are NOT on clinic duty. When you take clinic duty, you can install this dashboard to your homescreen during that time, and swap back to your own when finished.
  • Please try to refrain from editing the ops clinic dash to reflect non-clinic duties. There is a panel for 'tasks assigned to myself' at the bottom, since most of the ops clinic duty is triaging and knocking down tasks, but tend not to involve long-running personal tasks. However, even on clinic duty you need to see your tasks, so its at the bottom.

Review incoming tasks

  • Review all incoming tasks to the #ops-access-requests, #blocked-on-operations, #operations, #patch-for-review (when its also #operations), and #wmf-nda-requests (when its also #operations) projects.
  • These are all included on the | Ops Clinic Duty Dashboard
  • Escalate, update, and follow up as needed for any incoming tasks to ensure they are worked upon.
  • Assign a priority to tasks that come in.
  • Ask for more data from requester if needed in order to confirm the request, such as date it must be completed by, additional details, etc.
  • Assign to the proper person.
  • Communicate ETA to requester, based on the workload of the person working on it.
  • If the request is relatively quick, just do it yourself

Maintain the 'maint-announce' mails and calendar

Until recently we have still used RT for this but we just switched it to a Google Group. Here is how to use it.

  • Go to the Google group 'maint-announce'. [1]
  • Go to "Filters", click the radio button next to "All unresolved" and then "Apply filter". (screenshot)
  • Your task is to process all messages you see now until this screen is empty. [2]
  • Open the gcal shared with all WMF named 'Ops maintenance & contracts' in a second tab. [3][4]
  • Read each message and determine if it needs an action or not. "Add to Google calendar" is the only possible action besides "no action needed". [5]
  • If appropriate add an entry to the calendar.[6] From the calendar entry link back to the individual post in the group. You get the link from the context menu. (screenshot)
  • Go to "Actions" and select "Mark no action needed". (screenshot) [7]
  • Repeat until there are no mails left that are shown with the filter "unresolved". You are done.

[1]: You should have access either through individual membership or inherited permissions from being a member of the "technicaloperations" group. If not, ask an existing member to add you, they should have the permissions to do so even if not owner/manager of the group. (Only add other Ops people). Being a member gives you permissions to do things, it does _not_ necessarily mean you are also receiving emails to your personal inbox. It's entirely up to you whether you like to receive those mails in your personal inbox or just use the web interface while you're on duty.

[2]: Sometimes this doesn't seem to refresh and marked posts are not disappearing from your view immediately. If this happens, removing the filter and applying it again helps.

[3]: If you are not able to create events, ask an opsen to add you (calendar settings => share this calendar).

[4]: You probably want to add the GMT (not daylight) timezone to your calendar (calendar settings => general => add a timezone). In this way you'll be able to specify the correct timezone when creating events for planned maintenance (usually they are announced with UTC dates).

[5]: This is the case if it's a duplicate/reminder for an event that has already been added to calendar, if it's just an "FYI" kind of mail like "reason for outage", simple spam or anything else that doesn't warrant a calendar entry.

[6]: Copy the important part of the subject line or the summary and use it as the event title. If the mail contains important information like a circuit ID or details on what is affected, paste them into the body part of the calendar event. It's usually good enough to just use "all day" accuracy instead of taking the time to add exact start and end date and converting timezones because we are adding the link back from calendar to the full post with all the details. You don't need to worry about changing subjects or date formats anymore since posts will be sorted by date anyways. You also don't need to reply with a "added to calendar" message anymore and there are no other status changes, just "action needed" or not (done).

[7] It doesn't matter whether you added it to the calendar or determined it can be skipped, in either case _now_ there is "no action needed" (after you're done). We do it this way and don't use the "completed" status because the way Google groups works it forces you to actually _reply_ to a mail until it can be completed. We don't need that, that would just add unnecessary clicks and mail. Since both "no action needed" and "completed" are just different kinds of "resolution status" and the filter is based on "not resolved" the end result is the same and it is much simpler for us to just use that button.

Be a first contact

  • Follow up with ticket owners and requestors as needed on old tickets to resolve, re-assign, or escalate as needed.
  • Be a person of first contact, including on IRC (timezone/availability permitting).
  • Triage any mailing list requests for operations lists.

read mail to root@ =

  • Triage emails sent to root@ (if you don't receive them, you need to add your alias in the private repo). If you see a recurrent issue, please open a sub-task to T132324 and try to notify whoever you think can contribute to the task. Review the outstanding sub-tasks and follow up as needed.

misc

Tips

  • RT sometimes won't display the full ticket body, if you see an empty body click "show" on top right to show the message verbatim.
  • There is a clinic duty dashboard for Phabricator
  • You can search "to:alerts@wikimedia.org" in gmail to see all things that have paged people, independent of timezones and individual settings. This is used to fill the "pages for awareness"-section in Etherpad.

Manual

This is a manual for the current "op on duty" in charge of triaging the Phabricator #Operations project.

How to handle IRC requests

If somebody asks you to do something via IRC, if reasonable, politely ask requestor to turn their request into a Phabricator ticket and add the "Operations" tag to it.

Common, small "#Operations" tickets

Mail aliases

note: ops handles only role/group mail aliases, individual mail aliases are handled by OIT as outlined here [1]

note2: more recently many aliases have been moved from ops to OIT, and the goal is definitely NOT to add any new ones on our side unless they are strictly ops-internal like monitoring etc. you can help by moving even more over to OIT, see T122144

In the past these were handled manually on mchenry, but now they are puppetized. So please don't: Just ssh there and edit files in /etc/exim4/aliases/anymore.

Instead, go to the puppet master (puppetmaster1001) to the private puppet repo, cd to /root/private/modules/privateexim/files/, usually edit the file wikimedia.org and git commit. You have to do this as root. This will create a mail to ops about the commit. Please leave your username in the commit message, such as '(fnord) adding alias foobar@' because we are all root.

You can then run puppet on mx1001 and mx2001 to confirm your changes have been applied.

There are 3 types of domains:

a) domains that have their own alias file (wikimedia.org, wikipedia.org and a few others), you will find these files in ./modules/privateexim/files, just edit them there, git commit, and presto!!!, as with any other change in the private repo.

b) domains that just link to wikimediafoundation.org. These are just symlinks and puppet generates them. If you need to add a new one or change links, go to ./manifests/mail.pp. You will find it in class exim::aliases::private and should be self-explanatory.

c) domains that link to another domain. currently just wikivoyage.de to .org, same as in b) but a separate definition in the puppet class.

It is nice to add the corresponding Phab ticket number in a comment near changed aliases. Experience shows that it can be quite handy to be able to quickly answer questions like when exactly something has been changed and who requested it. There is one file or symlink per domain name. 95% of the time the requests are just regarding the "wikimedia.org" file. In other cases make sure you check for possible symlinks and realize which domains you are actually changing when editing a specific file.

Mailman mailing lists

Public mailing lists should typically be requested through Phabricator tagged with "Wikimedia-Mailing-lists". It would be nice if you look there for requests since these are not tagged with "Operations" but few others can do this besides Ops. Google mailing lists are managed by OIT. You know it's a mailman list if it's @lists.wikimedia.org. To check if an email address exists in Google you can do "exim4 -bt foo@wikimedia.org" on an MX server.

create a list

Follow the normal procedure to create a Mailman mailing list and if the requested list is a private list, ensure you complete all of step 7.

password reset

Another common task is requests for password resets, see the docs on Mailman#Reset_the_admin_password_of_a_list.

disable a list

When you get a request to disable a mailman list, you just have to run a shell script on the list server, see Mailman#Disable_or_re-enable_a_mailing_list. In addition it's nice if you login once using the master password and remove the former admins email addresses from the "list run by" field.

LDAP group changes

Access to a range of mostly web-based services is granted via the "wmf" and "nda" groups. The specific permissions are listed here: https://wikitech.wikimedia.org/wiki/LDAP_Groups The change should be tracked in a ticket.

  • WMF staff can be added to the "wmf" group on request (not everyone needs that kind of access)
  • Volunteers and researchers can be added to the "nda" group (this needs a valid NDA, everyone who's WMF staff is covered by the work contract NDA)

Before adding someone to LDAP, check whether there's an existing entry in puppet.git:modules/admin/data/data.yaml:

  • If the user already has shell access, no further change is needed. You can proceed with the LDAP change below:
  • If not, add the user to the ldap_only_users table at the end of the file:
    • Add the realname of the user (most Cloud VPS accounts don't have a real name set)
    • Add the email address of the users:
      • If the user is WMF staff use the email address of his/her Google account (usually the first letter of the first name and the surname, you can double-check the account name in the Gmail interface). Some users have aliases for their nickname e.g., don't use these, use the official Google account (this allows cross-checking data against corp LDAP)
      • If the user is a volunteer, a researcher or contractor without access to a wikimedia.org account, ask for a contact email address (to have a reliable contact e.g. in case of an account compromise)
    • If the user to be added is someone with a time-limited access (e.g. interns, researchers (who have time-limited MOUs) or short term contractor), add the estimated account end date as expiry_date (format is YYYY-MM-DD) and add a staff contact as expiry_contact

After having added the user to data.yaml, the change in LDAP can be done (this will be automated in a subsequent step):

  • Check if they are a member of the group from the Cloud VPS LDAP server: ldaplist -l group grpname | grep username
  • Add them if they are not there: modify-ldap-group --addmembers=username grpname
  • To remove someone from an ldap group you can modify-ldap-group --deletemembers=username grpname

For further instructions see Help:Access, LDAP and LDAP Groups.

Access requests

Access and reasoning for requesting it are documented on Requesting shell access. Please read and understand entirely before processing any access requests, as this very brief summary documentation may not cover all required points in the linked page.

If a request asks for things like new shell accounts, access to additional servers, log files, personal data, admin roles in systems like Mailman, Bugzilla, data center access, opening a firewall rule etc, then it is an access request and should be moved into the | Ops-Access-Requests Project. Once the initial request is made, a number of follow up steps must be confirmed:

  • User's direct supervisor has approved of access request via comment on phabricator task.
  • Approval from project lead where user's access will be granted via comment on phabricator task.
  • Confirmation that the user has read, comprehend, and signed the Acknowledgement of Wikimedia Server Access Responsibilities document.
  • ALL ACCESS REQUESTS REQUIRE AN NDA. An NDA must be on file for ALL users requesting shell access. This NDA has to be confirmed with the legal department. Currently, you can assign tasks to @RStallman-legalteam in phabricator for confirmation.
  • If non-sudo, a 3 day waiting period for security review must pass AFTER the task is moved into the | Ops-Access-Requests Project.
  • If sudo, restrict it down as much as possible and put on the agenda for the following weeks operations meeting for team review.
  • Approvals must be on the phabricator ticket.
  • SSH public key has to be submitted via gerrit patchset by user, or by some confirmed (non-email) method (suggestion: wiki user page).
  • Please update the Task in phabricator, as the requestor will get update.
  • Please raise any security concerns on ticket via comments.

Analytics Groups

  • There are multiple potential groups. They have been detailed on Analytics/Data_access#Access_Groups.
    • The clinic duty person can often link to this page for the person requesting access, and require the requestor to define which of the groups are required.

Creating new shell users

Please see instructions in the puppet admin module's README.

Some notable changes since February 2017:

  • Add the realname of the user (most Cloud VPS accounts don't have a real name set)
  • Add the email address of the users:
    • If the user is WMF staff use the email address of his/her Google account (usually the first letter of the first name and the surname, you can double-check the account name in the Gmail interface). Some users have aliases for their nickname e.g., don't use these, use the official Google account (this allows cross-checking data against corp LDAP)
    • If the user is a volunteer, a researcher or contractor without access to a wikimedia.org account, ask for a contact email address (to have a reliable contact e.g. in case of an account compromise)
  • If the user to be added is someone with a time-limited access (e.g. interns, researchers (who have time-limited MOUs) or short term contractor), add the estimated account end date as expiry_date (format is YYYY-MM-DD) and add a staff contact as expiry_contact

Renaming shell users

Sometimes we have to rename a shell user. This is typically when their shell name doesn't match their login name, and they have issues logging into items requiring LDAP credentials.

Renaming a user will require a few things happen, in a very specific order. Since many users keep data in their home directories, backups can sometimes be made, but not always. (Private data that isn't allowed to be copied off the cluster should not be backed up to laptops.) The existing username has to be removed from the host, since the new username will use the old username's UID.

  • Patchset is prepared, but not merged.
  • Using cumin for these batch commands, all hosts that have the existing (to be replaced) username should have puppet halted.
  • Affected hosts should have the user (to be replaced) deleted. DO NOT DELETE THE USER'S HOME DIRECTORY.
  • Merge patchset with username change (UID remains the same).
  • Run puppet on affected hosts, and they will create the new user (using the same UID.)
  • Batch move the contents of the old user home into the new user home.

IRC channel access

/query chanserv
help access
access #channel list
access #channel add *!*@wikimedia/cloak 
14:07 -ChanServ(ChanServ@services.)- Flags +Aiortv were set on ...

For people wanting to be a channel operator for #wikimedia-operations, first check they got nick protection enabled

  /msg nickserv info <nick>
  ...
  <nick> has enabled nick protection

and then

  /msg chanserv flags #wikimedia-operations <nick> +Aiotv

Removing access

Please check this section for accuracy.

Disabling an ssh key only

The most common case is that someone's laptop is gone and their ssh key must be disabled.

In admins.pp in puppet, change their ssh key from

ensure => present

to

ensure => absent

Because puppet is not always running on all hosts, you can remove the user's authorized keys file from any salt master by

cumin 'A:all' 'rm -f /home/<username>/.ssh/authorized_keys'

And because salt doesn't run on hardy, you can check and clean up those hosts manually if puppet is not running on one or more of them.

If the user had root access, their key will also be in root-authorized-keys in the private repo for puppet and you'll need to make the corresponding change there as well.

Removing the account

Make the puppet changes for all keys as described above.

Then in the user's account class in admins.pp in puppet change

$enabled = true

to

$enabled = false

which doesn't actually do anything more than the previous step; the user account and home directory will be untouched though ssh keys will be gone.

The previous caveats about checking hosts without puppet or salt apply.

Note: there is an optional parameter that is sometimes passed to unixaccount, which would cause the user account to be removed, i.e. enabled => $enabled. Do we want to start using this regularly or get rid of it everywhere we have it? Current use is inconsistent.

Powercycling / reboots

RT duty paging for reboots is usually due to hardware failure, or immediate concerns of exploits. Anything outside those issues would be handled by normal operations workflow, and would not necessarily fall to the RT triage duty person.

Powercycling requires a passing familiarity with the different out of band management options we use (based on vendor). Hardware type can be determined by looking up the hardware in question in Racktables; then you can determine the instructions from Platform-specific_documentation.

Duty desk rotation - who is next?

Currently we assign this to folks within the team who volunteer for the duty. Please keep this list with oldest date to bottom (easier to maintain and get most relevant info faster.)

  • 2017-10-16: Rob
  • 2017-09-11: Rob
  • 2017-09-04: Ema
  • ...
  • 2017-07-31: Arzhel
  • ...
  • 2017-07-10: Rob
  • 2017-06-19: Rob
  • 2017-06-12: Manuel
  • 2017-03-06: Andrew Otto (ottomata)
  • 2017-02-27: Emanuele (ema)
  • 2017-02-20: Rob
  • 2017-02-13: Rob
  • 2017-02-06: Rob
  • 2017-01-16: Moritz
  • 2017-01-09(week of wikidevsummit 2017+AllHands): Alex
  • 2017-01-02: RobH
  • 2016-12-26: Ariel
  • 2016-11-21: Riccardo (volans) (first time) with Giuseppe's help
  • 2016-11-14: Manuel (marostegui) (first time) + Daniel's and others help
  • 2016-11-07: Rob
  • ...
  • 2016-10-18: Luca (elukey)
  • 2016-10-11: Rob
  • 2016-10-03: Rob
  • ...
  • 2016-08-22: Chris Johnson (cmjohnson1)
  • 2016-08-22: Andrew Bogott (andrewbogott)
  • ...
  • 2016-08-01: Emanuele (ema)
  • 2016-07-19: Luca (elukey) and Guillaume (gehel)
  • ...
  • 2016-05-30: Jcrespo (jynus)
  • 2016-05-23: Robh
  • 2016-05-16: Giuseppe (_joe_)
  • ...
  • 2016-04-25: Filippo (godog)
  • 2016-04-18: Alex (akosiaris)
  • 2016-04-11: Andrew (andrewbogott)
  • ...
  • 2016-02-22: Emanuele (ema) harassing moritzm and robh for help
  • 2016-02-01: Elukey (First time doing it, so please be patient :) with Daniel
  • 2016-01-25: RobH
  • 2015-11-30: Filippo
  • 2015-11-23: Ariel
  • 2015-11-16: Andrew Bogott
  • 2015-11-09: Robh
  • ...
  • 2015-09-07: Jcrespo (jynus)
  • 2015-08-31: Rob (robh)
  • ...
  • 2015-05-11: Ariel
  • 2015-05-04: Marc-André (Coren)
  • 2015-04-27: Brandon
  • 2015-04-20: Rob (robh)
  • 2015-04-13: Gage (jgage)
  • 2015-04-06: Andrew Bogott
  • 2015-03-30: Filippo
  • 2015-03-23: Rob
  • 2015-03-16: Yuvi
  • 2015-03-02: Marc-André (Coren)
  • 2015-02-02: Alexandros Kosiaris
  • 2015-02-02: Andrew Bogott
  • 2015-01-26: Rob
  • 2015-01-19: Rob
  • 2015-01-12: Jeff Green
  • 2015-01-05: Chase
  • 2014-10-13: Faidon
  • 2014-10-06: Andrew Bogott
  • 2014-09-29: Marc-Andre
  • 2014-09-22: Daniel
  • 2014-09-15: Filippo
  • 2014-09-08: Ariel
  • 2014-09-01: Marc-Andre
  • 2014-08-25: Giuseppe
  • 2014-08-18: Andrew Otto
  • 2014-08-11: ?
  • 2014-08-04: Alex
  • 2014-07-28: Andrew Bogott
  • 2014-07-21: Chris
  • 2014-07-14: Ariel
  • 2014-07-07: Gage
  • 2014-06-30: Rob
  • 2014-06-23: Otto
  • 2014-06-16: Sean
  • 2014-06-09: Giuseppe (harassing Faidon for help)
  • 2014-05-19: Filippo (harassing Alex for help)
  • 2014-05-12: Gage
  • 2014-05-05: Jeff Green
  • 2014-04-28: Andrew Bogott
  • 2014-04-21: Marc-Andre
  • 2014-04-14: Ariel
  • 2014-04-07: Andrew O
  • 2014-03-31: Jeff Green
  • 2014-03-24: Rob
  • 2014-03-17: Alex
  • 2014-03-10: bblack
  • 2014-03-03: ottomata
  • 2014-02-24: apergos
  • 2014-02-17: Jeff Green
  • 2014-02-10: apergos
  • 2014-02-03:
  • 2014-01-27:
  • 2014-01-20: ottomata
  • 2014-01-13: Jeff_Green
  • 2014-01-06: LeslieCarr
  • 2013-12-30: akosiaris
  • 2013-12-23: andrewbogott
  • 2013-12-16: mutante
  • 2013-12-09: apergos
  • 2013-12-02: akosiaris
  • 2013-11-25: RobH
  • 2013-11-18: ottomata <-- Three weeks for real? The horror.
  • 2013-11-11: ottomata
  • 2013-11-04: Andrew Otto
  • 2013-10-28: Chris
  • 2013-10-20: Daniel
  • 2013-10-14: Andrew Bogott
  • 2013-10-07: Ariel Glenn
  • 2013-09-30: Leslie Carr
  • 2013-09-23: Andrew Otto
  • 2013-09-16: Chris
  • 2013-09-09: Rob
  • 2013-09-02: Chris
  • 2013-08-26: Marc-Andre
  • 2013-08-19: Asher Feldman
  • 2013-08-12 (Wikimania): Andrew Otto
  • 2013-08-05 (Wikimania): Rob
  • 2013-07-29: Faidon Liambotis
  • 2013-07-22: Ryan Lane
  • 2013-07-15: Ariel Glenn
  • 2013-04-22: Daniel Zahn
  • 2013-04-15: Ariel
  • 2013-04-08: Rob
  • 2013-04-01: Andrew B
  • 2013-03-07: Jeff
  • 2013-02-11: Peter
  • 2013-01-08: Andrew Otto
  • 2012-12-31: Faidon
  • 2012-12-24: Leslie
  • 2012-12-17: Ariel
  • 2012-10-29: Faidon
  • 2012-10-26: Asher
  • 2012-10-15: Peter