Data Platform Engineering/Ops week

Ops week duty rotation is a way for the Data Platform Engineering team to have a designated responsible party for monitoring the applications and jobs that run on our infrastructure, as well as to handle unplanned, maintenance work that arises during a sprint. By having a single person responsible for much of this work, it allows the rest of the team to focus on planned work.

Ops week rotation is important because it:

Ensures that the team equally participates in the ad hoc reactive work. Previously, engineers in the earlier time zones were the first ones to observe the alerts and were disproportionately impacted
Is an excellent way to share knowledge across the various sub-teams within Data Platform Engineering
Helps to ensure that there is clear ownership for all the systems we maintain
Ensures that there is clear responsibility for certain tasks
Helps the rest of the team focus on planned work

This page can be a checklist of things to do during ops week. Please keep it concise and actionable, and for longer explanations open a new page and link to it from here.

Primary responsibilities are:

Monitoring of the data pipelines that run on the Analytics Cluster.
Monitoring of the pipelines that generate the publicly available Dumps.

Note that the monitoring of the dumps pipelines is a relatively new addition to the Ops Week responsibilities, so the documentation is still in the process of being updated accordingly.

Ops week assignment

The ops week rotation assignment is managed by the program manager and/or the engineering manager via a spreasheet and the DPE team calendar.

The calendar is updated for a couple of months in advance and should take into account team members' availability in any given week.

Hand off

Ops week ends and begins on the Thursday of every week. The 'Ops Week' column on the Data Engineering phabricator board contains actionable tasks for those on Ops Week duty.

An automatic slack alert has been set up to remind the team about the handoff. Please provide any relevant handoff information in the slack thread, and if necessary follow up in person with the next person who is taking over the ops duty.

All phabricator tasks should be updated with details of the work that was done. Any assigned tasks that have not been started should be unassigned. Likewise, any work in progress should either be completed or handed over to the next person.

Part of the hand off should be to document in this google doc the alerts received during the week, and how they were treated.

Communication

Email

Any decisions taken with a subset of team that all team members should know about need to be communicated to the rest of team via e-mail to:

data-engineering-alertswikimedia.org

Example: deployments happening not happening, issues we have decided not to fix and schedule through regular sprint work.

If you're not subscribed to the list, you can do so here.

IRC and the Server Admin Log (SAL)

Log any job re-runs to the Analytics SAL by going to #wikimedia-analytics on IRC and prefixing an entry with !log

If you're responding to Airflow alert emails or refine job failure emails or something similar, double check the SAL to see whether work has already been done to fix the problem.

Working in off hours

The developer ops rotation does not include working off hours as its focus is on "routine" ops. The bulk of analytics infra is tier-2 and jobs that fail, for example, can be restarted the next day. If there are system issues (like The namenode going down) there should be ICINGA alarms fired that will go to the SRE team.

The SRE team needs to makes sure that at all times there is 1 person available for ops, which, in practice means taking turns to take vacations.

Ops week duties

If it's your ops week, you should prioritize ops week tasks over regular work. If you are working on a time critical task or you are taking vacation you might need to swap ops weeks with someone else.

The ops duty is separate from the critical alerts that the SRE team receives and is responsible for.

Remember you're a first responder. This means you don't have to fix whatever is broken during your ops week, but you have to alert the team, check the status of failed stuff, and report what's happened, alerts need to be acknowledged promptly so other team members know they are being taken care of.

Ops week is an opportunity to learn, do not be shy in asking for help from others.

In summary the ops week duties include:

Checking #wikimedia-analytics on IRC if there is anything relevant ongoing that requires your attention
Removing and adding new users
Managing the deployment train: Deployment trains
Monitoring alerts
Attending to Manual tasks that need intervention and review
Monitor dashboards listed below
Monitoring the #data-engineering-collab slack channel to ensure that there is a response within 24 hours. It is up to the whole team to monitor this channel, but if someone does not respond the ops week person should acknowledge the question/message and assess what followup is needed if any

Things to do at some point during the week

Monitor Dashboards

Here is a list of relevant dashboards to bookmark and monitor:

https://yarn.wikimedia.org/cluster/scheduler - Stats about the Yarn production instance, including the state of the Yarn queues.
https://yarn.wikimedia.org/cluster/apps/RUNNING - All running applications on Yarn.
Grafana Dashboard with General Hadoop Cluster stats
Grafana Dashboard with General Cassandra Cluster stats
Grafana Dashboard with General MySQL servers stats
Grafana Dashboard with General Kafka cluster stats
Grafana Dashboard with General Gobblin stats (lands on stats for webrequest)
Grafana Dashboard for Flink Apps
Grafana Dashboard for Flink Cluster
Email Alerts Dashboard (Contact User:TChin_(WMF) if it seems broken)

Have any users left the Foundation?

We should check for users who are no longer employees of the foundation, and offboard them from the various places in our infrastructure where user content is stored. Ideally a Phabricator task should be open notifying the user and/or their team manager so that they have a chance to respond and move things. Generally, as part of the SRE offboarding script we will be notified when a user leaves and the task will be added to the on call column. We should remove the following:

Home directories in the stat machines.
Home directory in HDFS.
User hive tables (check if there is a hive database named after the user).
Ownership of hive tables needed by others (find tables with scripts below and ask what the new user should be, preferably a system account like analytics or analytics-search).

Use the check-user-leftovers script from your local shell (it uses your ssh keys), and copy the output into the Phabricator task. The tricky part is dropping data from the Hive warehouse. Before starting, let's remember that:

under /user/hive/warehouse we can find directories related to databases (usually suffixed with .db) that in turn contain directories for tables.
If the table is internal, then the data files will be found in a subdir of the HDFS Hive warehouse directory.
If the table is external, then the data files could potentially be anywhere in HDFS.
A Hive drop database command deletes metadata in the Hive Metastore, the actual HDFS files needs to be cleaned up manually.

The following command is useful to understand where is the location of the HDFS files of the tables belonging to a Hive database:

DATABASE_TO_CHECK=elukey
for t in $(hive -S -e "show tables from $DATABASE_TO_CHECK;" 2>/dev/null | grep -v tab_name); do echo "checking table: $DATABASE_TO_CHECK.$t"; hive -S -e "describe extended $DATABASE_TO_CHECK.$t" 2>/dev/null | egrep -o 'location:hdfs://[0-9a-z/\_-]+'; done

Full removal of files and Hive databases and tables

Once the list of things to remove has been reviewed, and it has been determined it is okay to remove, the following commands will help you do so:

Drop Hive database. From an-launcher1002 (or somewhere with the hdfs user account):

sudo -u hdfs kerberos-run-command hdfs hive
> DROP DATABASE <user_database_name_here> CASCADE;

Remove user's Hive warehouse database directory. From an-launcher1002:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/hive/warehouse/<user_database_name_here>.db

Remove HDFS homedir. From an-launcher1002:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/<user_name>

Remove homedirs from regular filesystem on all nodes. From cumin1001:

sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/<user_name>'

Archival of user files

If necessary, we archive user files in HDFS in /wmf/data/archive/user/<user_name>. Make a directory with the user's name and copy any files there as needed. Set appropriate permissions if the archive should be shared with other users. For example:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -mv /user/elvis /wmf/data/archive/user/
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R hdfs:analytics-privatedata-users /wmf/data/archive/user/elvis

Things to do every Tuesday

The Data Engineering deployment train 🚂

After the Data Engineering standup, around 5-6pm UTC, we deploy anything that has changes pending. Deployments normally include restarting of jobs, updating jar versions,...

What to deploy:

Read the Etherpad where someone could write down the deployment plan. You should also use it to document the deployment of the week. http://etherpad.wikimedia.org/p/analytics-weekly-train
Check for (eventually already merged) patches that need to be merged (if not already) & deployed on Wikistats. You should find the last deployment with a commit named "Release X.X.X".

Here are instructions on each project's deployment:

Notify the team of each deployment (using !log on IRC).

Things to check every day

(We're centralizing all scheduled manual work at Analytics/Systems/Manual_maintenance both as a reminder and as a TODO list for automation)

Have new wikis been created?

If a new wiki receives pageviews and it's not on our allowlist, the team will start receiving alerts about pageviews from an unregistered Wikimedia site. The new wikis must be added in two places: the pageview allowlist and the sqoop groups list.

Adding new wikis to the sqoop list

First check that the wiki has already replicas...

In production DB:

# connect to the analytics production replica for the project to check
analytics-mysql azywiki
# list tables available for that project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;

In labs DB:

# connect to the analytics labs replica for the project to check (needs the password for user s53272)
# you need to know which port to use, so first you check for that in production DB:
analytics-mysql --print-target azywiki
dbstore1003.eqiad.wmnet:3315

# Keep only the port (3315 in that example) and use if in connecting to cloudDB:
mysql --database azywiki_p -h an-redacteddb1001.eqiad.wmnet -P 3315 -u s53272 -p
# list available tables for the project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;

For wikis to be included in the next mediawiki history snapshot, they need to be added to the labs sqoop list. Add them to the last group (the one that contains the smallest wikis) in the grouped_wikis.csv. Once the cluster is deployed (see section above) the wikis will be included in the next snapshot.

Adding new wikis to the pageview allow list

First list in hive all the current pageview allow list exceptions:

select * from wmf.pageview_unexpected_values where year=... and month=...;

That will tell you what you need to add to the list. Take a look at it, make sure it makes sense, and make a Gerrit patch for it (example in gerrit) in analytics-refinery/static-data.

To stop the airflow alarms faster, you can merge your patch and sync the new file up to hdfsː

scp static_data/pageview/allowlist/allowlist.tsv an-launcher1002.eqiad.wmnet:
ssh an-launcher1002.eqiad.wmnet
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -put -f allowlist.tsv /wmf/refinery/current/static_data/pageview/allowlist/allowlist.tsv
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod +r /wmf/refinery/current/static_data/pageview/allowlist/allowlist.tsv

Note: beware when you are editing this file. It's a TSV. Your editor may replace tabs with spaces without telling you.

Have any alarms been triggered?

Airflow

Airflow jobs run on one of our Airflow instances. If you receive an email like <Taskinstance: ... [failed]> or SLA miss on DAG, the best way to investigate is the Airflow UI (see Airflow Instance-specific instructions). Once in the Airflow UI, you can check logs and retry or pause the DAG accordingly. If the error was transient or you can deploy a simple fix, you can retry by clearing the failed task instances. Otherwise, pausing the DAG will stop further alert emails. Airflow documentation is excellent and worth learning, you can find links in the UI header.

Canary Event Alerts

The canary_events airflow DAG produces canary events for every Event Platform stream multiple times each hour. It is important that at least one canary event is produced for each stream for each hour.

If you see an airflow alert in your email with a subject like

Airflow alert: <TaskInstance: canary_events.produce_canary_event ...

The canary_events DAG makes use of Airflow dynamic mapped tasks to generate the tasks to run. To rerun an individual mapped task, you must navigate through the Airflow UI to find the tasks to rerun.

You should try to rerun the failed airflow mapped tasks manually.

To rerun all failed mapped tasks in a task run: select the task with failures and click the blue 'Clear task' button. Select the "Only Failed" option. You should be shown the number of mapped tasks that will be rerun. Click the blue 'Clear' button to rerun all failed tasks.

To rerun a single mapped task: Go to the canary_events DAG in the Airflow UI, click on the failed task task run, click on the tasks [] Mapped Tasks tab, scroll to and click on the failed mapped task, then select the blue 'Clear task' button. This should rerun the failed task.

NOTE: there never harm in rerunning canary events tasks. What matters is that at least one canary event is produced for each stream topic every hour.

If a rerun still fails, there might be a deeper problem. E.g. if an event is invalid, you'll see an error message like:

Jan 12 19:15:04 an-launcher1002 produce_canary_events[15437]: HttpResult(failure) status: 207 message: Multi-Status. body: {"invalid":[{"status":"invalid","event":{"event":{"is_mobile":true,"user_editcount":123,"user_id":456,"impact_module_state":"activated","start_email_state":"noemail","homepage_pageview_token":"example token"},"meta":{"id":"b0caf18d-6c7f-4403-947d-2712bbe28610","stream":"eventlogging_HomepageVisit","domain":"canary","dt":"2021-01-12T19:15:04.339Z","request_id":"54df8880-61ce-4cf9-86fa-342c917ea622"},"dt":"2020-04-02T19:11:20.942Z","client_dt":"2020-04-02T19:11:20.942Z","$schema":"/analytics/legacy/homepagevisit/1.0.0","schema":"HomepageVisit","http":{"request_headers":{"user-agent":"Apache-HttpClient/4.5.12 (Java/1.8.0_272)"}}},"context":{"errors":[{"keyword":"required","dataPath":".event","schemaPath":"#/properties/event/required","params":{"missingProperty":"start_tutorial_state"},"message":"should have required property 'start_tutorial_state'"}],"errorsText":"'.event' should have required property 'start_tutorial_state'"}}],"error":[]}

The canary event is constructed from the schema's examples. In this error message, the schema at /analytics/legacy/homepagevisit/1.0.0 examples was missing a required field, and the event failed validation.

The fix will be dependent on the problem. In this specific case, we modified what should have been an immutable schema after eventgate had cached the schema, so eventgate needed a restart to flush caches.

Some common reasons canary_events DAG might fail:

A stream has been declared in EventStreamConfig, but its schema (indicated by the schema_title setting) has not been merged/deployed to schema.wikimedia.org.
A new schema or schema version for a stream has been merged, but the destination_event_service (eventgate cluster) does not use dynamic schema loading. In this case follow instructions at Event Platform/EventGate/Administration#eventgate-wikimedia schema repository change to resolve.
A stream has been declared in EventStreamConfig, but the destination_event_service (eventgate cluster) only loads stream config on service startup. In this case the eventgate cluster will need a restart. See Event Platform/EventGate/Administration#EventStreamConfig change for more info, and Event Platform/EventGate/Administration#Roll restart all pods for instructions on how to bounce all pods in the cluster. See Event Platform/EventGate#EventGate clusters to determine if the eventgate cluster in question will need this.

Ask #data-engineering-team for help with this if you encounter issues.

Anomaly detection alarms

There are three different types of alarms: outage/censorship, general oddities on refine system measured by calculating entropy of user agent and mobile pageview alarms.

Superset dashboard: https://superset.wikimedia.org/superset/dashboard/315/

Outage/censorship alarms. These alarms measure changes in traffic per city. When they raise look whether the overall volume of pageviews is OK, if it is issue might be due to an outage on a particular country. Traffic will troubleshoot further

Eventlogging refine alarms on navigationtiming data. Variations here might indicate a problem in the refine pipeline (like all user agents now being null) or rather an update of UA parser.

Mobile pageview alarms. An alarm might indicate a drop on mobile app or mobile web pageviews, do check numbers in turnilo. These alarms are not setup for desktop pageviews as nature of timeseries is quite different.

A through description of the system can be found at Analytics/Data_quality/Traffic_per_city_entropy.

Failed systemd/journald based jobs

Follow the page on managing systemd timers and see what you can do. Notify an SRE if you don't feel confident in what you're doing, or do not have the required rights.

Example: sqoop jobs, reportupdater jobs, and everything not yet migrated to Airflow

HDFS alarms

HdfsCorruptBlocks. To check if it's still a problem: [@an-launcher1002:...] $ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks

Sqoop failures

Sqoop is run on systemd timers right now (may be Airflow soon or we may migrate it to spark). If it fails, look at Analytics/Systems/Cluster/Edit_history_administration.

Data loss alarms

Follow the steps in Analytics/Systems/Dealing with data loss alarms

Mediawiki Denormalize checker alarms

If a mediawiki snapshot fails its Checker step you will get an alarm via e-mail, this is what to do: Analytics/Systems/Cluster/Edit_history_administration#QA:_Assessing_quality_of_a_snapshot

Reportupdater failures

Check the time that the alarm was triggered at and look for causes of the problems in the logs at

sudo journalctl -u reportupdater-interlanguage

(Replace interlanguage with the failed unit)

Druid indexation job fails

Take a look at the Admin console and the indexation logs. The logs show the detailed errors, but the console can be easier to look at and spot obvious problems.

Deletion script alarms

When data directories and hive partitions are out of sync the deletion scripts can fail. To resync directories and partitions execute msck repair table <table_name>; in Hive.

Refine failure report

An error like Refine failure report for refine_eventlogging_legacy will usually have some actual exception information in the email itself. Take a look and most likely rerun the hour(s) that failed: Analytics/Systems/Refine#Rerunning_jobs

As of 2024-09, there are 4 different refine_* jobs, each of which have a wrapper script used to launch the job.

Refine's failure report email should contain the flags you need to rerun whatever might have failed. The name of the script to run is in the subject line. E.g. if the subject line says for 'refine_eventlogging_legacy', that is the command you will need to run. You can copy and paste the flags from the failure email to this command, like:

sudo -u analytics kerberos-run-command analytics refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='contenttranslation' --since='2021-11-29T22:00:00.000Z' --until='2021-11-30T23:00:00.000Z'

Wait for the job to finish, and don't forget to check the logs to see that all went well:

sudo -u analytics yarn logs -applicationId <app_id_here> | grep Refine:
...
21/12/02 19:30:53 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslation/year=2021/month=11/day=30/hour=21 -> `event`.`contenttranslation` /wmf/data/event/contenttranslation/year=2021/month=11/day=30/hour=21. (# refined records: 66)

Note: The refine_eventlogging_analytics job is being decommissioned. As of 2024-09, there is only 1 remaining event stream that this job processes that we care about: MediaWikiPingback. You can ignore all email alerts from refine_eventlogging_analytics unless they are about MediaWikiPingback.

For errors like RefineMonitor problem report for job monitor_refine_sanitize_eventlogging_analytics, visit Backfilling sanitization.

Mediawiki page content change enrich alarms

A real time data processing application that consumes the mediawiki.page_change.v1 topic, performs a lookup join (HTTP) with the Action API to retrieve raw page content, and produces an enriched event into the mediawiki.page_content_change.v1 topic. The Event Platform team is responsible for this application.

Alerts will be trigger when SLI performance degrades, or the application stops running in k8s main.

Follow up steps and escalation are describe in the application SLO MediaWiki Event Enrichment/SLO/Mediawiki Page Content Change Enrichment#Troubleshooting.

Flink kubernetes operator alarms

Infrastructure for running Flink clusters on k8s. The Event Platform and Search team are responsible for this application. Alerts are raise if the operator is not running in k8s main.

HdfsRpcQueueLength alarms

HdfsRpcQueueLength alerts can be triggered when a particular job is creating an RPC storm, such as the case of a big non local rewrite of an Iceberg table. Such a situation can move multiple terabytes of data across the network, which can cause latency for other jobs. A transient queue length alert is not a problem, but a consistent spike over multiple days should definitely be investigated. To investigate for culprits for a long running issue, look at long running jobs at yarn.wikimedia.org.

Debug Java applications in trouble

When a java daemon misbehaves (like in https://phabricator.wikimedia.org/T226035) it is absolutely vital to get some info from it before a restart, otherwise it will be difficult to report a problem upstream. The jstack utility seems to be the best candidate for this job, and plenty of guides can be found on the internet. For a quick copy/paste tutorial:

use ps -auxff | grep $something to correctly identify the PID of the process
then run sudo -u $user jstack -l $PID > /tmp/thread_dump

The $user referenced in the last command is the user running the Java daemon. You should have the rights to sudo as that user, but if not pleas ping Luca or Andrew.

This guide is concise and neat for a more verbose explanation: https://helpx.adobe.com/it/experience-manager/kb/TakeThreadDump.html

Useful administration resources

Useful Kerberos commands

Since Kerberos was introduced, proper authentication has been enforced for all access to HDFS. Sometimes Analytics admins need to access resources of another user, for example to debug or fix something, so more powerful credentials are needed. In order to be like root on HDFS, you need to ssh to any of an-master1001, an-master1002 or an-coord1001 and run:

sudo -u hdfs kerberos-run-command hdfs command-that-you-want-to-execute

Special example to debug Yarn application logs:

sudo -u hdfs kerberos-run-command hdfs yarn logs -applicationId application_1576512674871_263229 -appOwner elukey

In the above case, Analytics admins will be able to pull logs for user elukey and app-id application_1576512674871_263229 without incurring in permission errors.

Restart daemons

On most of the Analytics infrastructure the following commands are available for Analytics admins:

sudo systemctl restart name-of-the-daemon
sudo systemctl start name-of-the-daemon
sudo systemctl stop name-of-the-daemon
sudo systemctl reset-failed name-of-the-daemon
sudo systemctl status name-of-the-daemon
sudo journalctl -u name-of-the-daemon

For example, let's say a restart for the Hive server is needed and no SREs are around:

sudo systemctl restart hive-server2

Airflow - backfilling and rerunning tasks

It is easy enough to find tasks in the Airflow UI and clear them so that the scheduler reruns them. However, if you have many tasks to re-run, using the CLI is much more efficient.

In Airflow terminology, 'backfilling' is distinctly different than 'rerunning'.

Backfilling - launching tasks for times before a DAG's start_date. Backfilling is not done by the Airflow scheduler.
Rerunning - Clearing task state, so that the Airflow scheduler reruns them.

The airflow CLI has a airflow dags backfill command. This command should only be used to schedule tasks that are not already managed by the Airflow scheduler. If you try to backfill a task that the scheduler managers, it will end up in a strange state. If you do this, delete the dagrun, and explicitly use airflow tasks run to re-create the dag run and run the task.

If you need to re-run already run tasks, use airflow tasks clear.