Wikimedia Cloud Services team/Clinic duties
The WMCS team practices a clinic duty rotation that runs from one weekly team meeting to the next. Each team member takes a turn sequentially performing these duties. Clinic Duty runs from one weekly meeting to the next. Your shift begins after the weekly meeting, and ends with the next.
In a similar fashion, we have two oncall duty rotations, that also run for one week (see the calendar).
Start of clinic duty
- Change clinic duty in title of #wikimedia-cloud-admin connect to yourself. If this is your first time:
- You need to first get ops on the channel, see https://meta.wikimedia.org/wiki/IRC/Instructions#ChanServ_commands
- For the above to work, you need the necessary permissions on the channel. Ask for help :)
- Alternatively, this should work:
/cs topic #wikimedia-cloud-admin The whole topic message goes here...
- Archive the weekly meeting etherpad to https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Meeting_Notes/YYYY-MM-DD
- Create next week's etherpad by copy-pasting over the content of the archived one and deleting anything outdated, such as meeting-specific agenda
🦄 of the week duties
Meetings
- Monday: attend the "SRE Staff Meeting" (if scheduled for that week)
- Copy "outgoing updates" from our Etherpad to the "SRE Staff Meeting" document (linked in the calendar event)
- Consider if anything else is worth adding to what is already listed in the "outgoing updates" section (anything happening in WMCS that might be of interest to the wider SRE group)
- Wednesday (at the end of the clinic duty shift): Facilitate the Weekly WMCS team meeting
Phabricator
- Complete tasks under "Clinic Duty" on Phabricator board
- Help triage new / incoming tasks on phabricator
Community
IRC
- #wikimedia-cloud connect monitoring
- Respond to help requests
- Watch for pings to other team members and intercept if appropriate
- Watch for pings to
!help
- Call people out for poor behavior in the channel
- Praise people for helping constructively
Community Requests
Check for and respond to incoming requests. For new project requests or quota requests, please seek and obtain at least one other person's approval before approving and granting the request. Ensure this permission is explicitly documented on the phabricator ticket. For all floating IP requests, and any request you are unsure about, please bring up to the weekly meeting. Requests that represent an increase of more than double the quota or more than 300GB of storage should also be reviewed at the weekly meeting.
- DB Requests
- Related docs: Add_a_wiki#Cloud_Services
- Toolforge Quota Requests
- Related docs: Portal:Toolforge/Admin/Kubernetes#Quota management
- Related docs: Portal:Data_Services/Admin/Wiki_Replicas#User_connection_limits
Maintenance tasks (probably not all weeks)
Oncall Duty
During your shift, you are expected to monitor and react to alerts, as well as highly prioritize working on tasks that improve the current alert/monitoring/stability of the platforms. See Decision Record for more information.
Monitoring
Alerts
- phaultfinder (This bot automatically opens tasks for non-paging alerts). Please ensure all open requests get assigned and worked (whether yourself or someone else).
- alertmanager (team=wmcs)
- You might also find this dashboard useful to browse alert history
- Icinga for WMCS hardware.
- Watch for wmcs-related emails (cron, puppet failing on our projects, etc.) and fix
Cloud VPS alerts
These include things WMCS isn't directly responsible for. For this reason, most of these alerts aren't critical and aren't WMCS's problem to solve. However, projects for which WMCS is the owner / admin, like tools, admin, etc, are important and we should respond as the responsible party.
Dashboards
This list isn't exhaustive. Dashboards can be utilized to debug or confirm issues within WMCS platforms.
- Platform Health
Improvements
If nothing currently requires immediate attention, you should work on improving tooling in this area. Consider:
- Moving alerts from Icinga to Alertmanager (e.g. novafullstack, ceph)
- Adding new alerts or removing stale alerts (e.g. Adding neutron alert, Adding ceph alerts, Adding novafullstack alerts)
- Improving runbooks and documentation
- Writing cookbooks to automate tasks (e.g. remove grid errors, remove grid node, ceph_reboot, increase quotas )
- Cleaning up puppet code, add tests
- Improve/fix/upgrade the dashboards for the team in grafana.