Wikimedia Cloud Services team/Clinic duties

The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties. Clinic Duty runs from one weekly meeting to the next. Your shift begins after the weekly meeting, and ends with the next.

In a similar fashion, we have two oncall duty rotations, that also run for one week (see the calendar).

Clinic duty

Start of clinic duty

Archive the weekly meeting etherpad:
- copy the entire content of the etherpad to https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/YYYY-MM-DD
- Create next week's etherpad at https://etherpad.wikimedia.org/p/WMCS-YYYY-MM-DD
- Copy-paste the content of the old Etherpad to the new Etherpad
- In the new Etherpad:
  - delete anything outdated, such as the "Round the table" updates from the previous meeting
  - modify the dates in the archive links at the top
  - update the "starts"/"ends" dates in the Clinic Duty section
  - remove the color highlight by clicking the "eye" icon in the Etherpad toolbar (Clear Authorship Colors)
- In the old (archived) Etherpad:
  - delete all the content except the first four lines with the links to the archive
Update the topic (title) of the IRC channel #wikimedia-cloud-admin ^connect:
- You need to first get op status on the channel: /msg chanserv op #wikimedia-cloud-admin (If that gives an error, you might not have the necessary permissions on the channel. Ask for help! :) )
- Edit the channel topic:
  - if you are using IRCCloud, click on the "cog" icon on the top right, then on "Set topic..."
  - you can also use /topic New topic... or /msg chanserv topic #wikimedia-cloud-admin New topic...
- The topic should point to the etherpad for the next meeting and indicate the nick of the person on clinic duty (usually it will be your nick if you're setting this message)
  - Example topic: https://etherpad.wikimedia.org/p/WMCS-XXXX-XX-XX | Channel is logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-admin/ | ping cteam | clinic duty: <irc nick>
- Remove your op status when you're done: /msg chanserv deop #wikimedia-cloud-admin

Meetings

Monday: attend the "SRE Staff Meeting" (if scheduled for that week)
- Report any WMCS updates that might be of interest to the wider SRE group:
  - Copy the "outgoing updates" from the WMCS Etherpad to the "SRE Staff Meeting" document (linked in the calendar event)
  - Feel free to add more items, or ask in IRC if you need suggestions
  - If you want to speak during the SRE meeting, put one or more items in bold, otherwise people will still see the items, but you will not be asked to speak about them
- Listen to anything that could be relevant for the WMCS team:
  - Take notes during the meeting and add them to the WMCS etherpad under "Report from SRE meeting"
Thursday (at the end of the clinic duty shift): Facilitate the Weekly WMCS team meeting

Phabricator

Complete tasks under "Clinic Duty" on Phabricator board
Help triage new / incoming tasks on phabricator
- TODO: is there a way to filter this list to not include WMCS team members?

Community Requests

Check for and respond to incoming requests. For new project requests or quota requests, please seek and obtain at least one other person's approval before approving and granting the request. Ensure this permission is explicitly documented on the phabricator ticket. For all floating IP requests, and any request you are unsure about, please bring up to the weekly meeting. Requests that represent an increase of more than double the quota or more than 300GB of storage should also be reviewed at the weekly meeting.

VPS Project Requests
- Related docs: Portal:Cloud VPS/Admin/Projects lifecycle#Creating a new project
- Related docs: Portal:Cloud VPS/Admin/Projects lifecycle#Deleting_a_project

DB Requests
- Related docs: Add_a_wiki#Cloud_Services

VPS Quota Requests
- Related docs: Portal:Cloud_VPS/Admin/Projects_lifecycle#Modifying_project_quotas
- Related docs: Portal:Cloud_VPS/Admin/Trove#Adjusting_per-project_Trove_quotas
- Related docs: Portal:Cloud_VPS/Admin/OpenTofu (adding/removing flavors)

Toolforge Quota Requests
- Related docs: Portal:Toolforge/Admin/Kubernetes#Quota management
- Related docs: Portal:Toolforge/Admin/Harbor#Quota management (for Harbor specifically)
- Related docs: Portal:Data_Services/Admin/Wiki_Replicas#User_connection_limits

Toolforge account requests
- Related docs: Portal:Toolforge/Admin#Users_and_community

IRC

#wikimedia-cloud ^connect monitoring
- Respond to help requests
- Watch for pings to other team members and intercept if appropriate
- Watch for pings to !help
- Call people out for poor behavior in the channel
- Praise people for helping constructively

Oncall duty

Oncall duty is a different shift than clinic duty, during your shift, you will be paged for critical issues, and you are expected to monitor and react to alerts, as well as highly prioritize working on tasks that improve the current alert/monitoring/stability of the platforms. See Decision Record for more information.

Alerts

phaultfinder (This bot automatically opens tasks for non-paging alerts). Please ensure all open requests get assigned and worked (whether yourself or someone else).
- Open tasks
- All tasks
alertmanager (team=wmcs)
- You might also find this dashboard useful to browse alert history
Icinga for WMCS hardware.
Watch for wmcs-related emails (cron, puppet failing on our projects, etc.) and fix

Cloud VPS alerts

These include things WMCS isn't directly responsible for. For this reason, most of these alerts aren't critical and aren't WMCS's problem to solve. However, projects for which WMCS is the owner / admin, like tools, admin, etc, are important and we should respond as the responsible party.

Alertmanager for Cloud VPS projects

Dashboards

This list isn't exhaustive. Dashboards can be utilized to debug or confirm issues within WMCS platforms.

Platform Health

Improvements

If nothing currently requires immediate attention, you should work on improving tooling in this area. Consider:

Moving alerts from Icinga to Alertmanager (e.g. novafullstack, ceph)
Adding new alerts or removing stale alerts (e.g. Adding neutron alert, Adding ceph alerts, Adding novafullstack alerts)
Improving runbooks and documentation
Writing cookbooks to automate tasks (e.g. remove grid errors, remove grid node, ceph_reboot, increase quotas )
Cleaning up puppet code, add tests
Improve/fix/upgrade the dashboards for the team in grafana.