Data Engineering/Ops week

Ops week duty rotation is a way for the data engineering team to have a designated responsible party for monitoring the applications and jobs that run on our infrastructure, as well as to handle unplanned, maintenance work that arises during a sprint. By having a single person responsible for much of this work, it allows the rest of the team to focus on planned work.

Ops week rotation is important because it:

ensures that the team equally participates in the ad hoc reactive work. Previously, engineers in the earlier time zones were the first ones to observe the alerts and were disproportionately impacted
Is a way to share knowledge
Is orthogonal across values streams, ensuring that there is ownership for all the systems
Ensures that there is clear responsibility for certain tasks
Helps the rest of the team focus on planned work

This page can be a checklist of things to do during ops week. Please keep it concise and actionable, and for longer explanations open a new page and link to it from here.

Ops week assignment

The ops week rotation assignment is managed by the program manager and/or the engineering manager via the team calendar. The calendar is updated for a couple of months in advance and should take into account team members availability in that week.

We rotate ops duty in pairs. On any given week, a member of the team is the primary on-duty ops person. This entails looking after the list of tasks detailed here. At the end of the week, the primary person stays on as a secondary and another member becomes primary. The primary person is expected to monitor alarms and begin reacting to problems. The secondary person is there to pair, help with tasks that take more time, and importantly to decide what problems should be addressed immediately and what could be made into a task that gets prioritized as normal.

Hand off

Ops week ends and begins on the Thursday of every week. The 'Ops Week' column on the Data Engineering phabricator board contains actionable tasks for those on Ops Week duty.

An automatic slack alert has been set up to remind the team about the handoff. Please provide any relevant handoff information in the slack thread, and if necessary follow up in person with the next person who is taking over the ops duty.

All phabricator tasks should be updated with details of the work that was done. Any assigned tasks that have not been started should be unassigned. Likewise, any work in progress should either be completed or handed over to the next person.

Part of the hand off should be to document in this google doc the alerts received during the week, and how they were treated.

Communication

Email

Any decisions taken with a subset of team that all team members should know about need to be communicated to the rest of team via e-mail to:

data-engineering-alertslists.wikimedia.org

Example: deployments happening not happening, issues we have decided not to fix and schedule through regular sprint work.

If you're not subscribed to the list, you can do so here: https://lists.wikimedia.org/postorius/lists/data-engineering-alerts.lists.wikimedia.org/

IRC and the Server Admin Log (SAL)

Log any job re-runs to the Analytics SAL by going to #wikimedia-analytics on IRC and prefixing an entry with !log

If you're responding to Airflow alert emails or refine job failure emails or something similar, double check the SAL to see whether work has already been done to fix the problem.

Working in off hours

The developer ops rotation does not include working off hours as its focus is on "routine" ops. The bulk of analytics infra is tier-2 and jobs that fail, for example, can be restarted the next day. If there are system issues (like The namenode going down) there should be ICINGA alarms fired that will go to the SRE team.

The SRE team needs to makes sure that at all times there is 1 person available for ops, which, in practice means taking turns to take vacations.

Ops week duties

If it's your ops week, you should prioritize ops week tasks over regular work. If you are working on a time critical task or you are taking vacation you might need to swap ops weeks with someone else.

The ops duty is separate from the critical alerts that the SRE team receives and is responsible for.

Remember you're a first responder. This means you don't have to fix whatever is broken during your ops week, but you have to alert the team, check the status of failed stuff, and report what's happened, alerts need to be acknowledged promptly so other team members know they are being taken care of.

Ops week is an opportunity to learn, do not be shy in asking for help from others.

In summary the ops week duties include:

Checking #wikimedia-analytics on IRC if there is anything relevant ongoing that requires your attention
Removing and adding new users
Managing the deployment train: Deployment trains
Monitoring alerts
Attending to Manual tasks that need intervention and review
Monitor dashboards listed below
Monitoring the #data-engineering-collab slack channel to ensure that there is a response within 24 hours. It is up to the whole team to monitor this channel, but if someone does not respond the ops week person should acknowledge the question/message and assess what followup is needed if any

Things to do at some point during the week

Monitor Dashboards

Here is a list of relevant dashboards to bookmark and monitor:

https://yarn.wikimedia.org/cluster/scheduler - Stats about the Yarn production instance, including the state of the Yarn queues.
https://yarn.wikimedia.org/cluster/apps/RUNNING - All running applications on Yarn.
Grafana Dashboard with General Hadoop Cluster stats
Grafana Dashboard with General Cassandra Cluster stats
Grafana Dashboard with General MySQL servers stats
Grafana Dashboard with General Kafka cluster stats
Grafana Dashboard with General Gobblin stats (lands on stats for webrequest)
Grafana Dashboard for Flink Apps
Grafana Dashboard for Flink Cluster
Email Alerts Dashboard (Contact User:TChin_(WMF) if it seems broken)

Have any users left the Foundation?

We should check for users who are no longer employees of the foundation, and offboard them from the various places in our infrastructure where user content is stored. Ideally a Phabricator task should be open notifying the user and/or their team manager so that they have a chance to respond and move things. Generally, as part of the SRE offboarding script we will be notified when a user leaves and the task will be added to the on call column. We should remove the following:

Home directories in the stat machines.
Home directory in HDFS.
User hive tables (check if there is a hive database named after the user).
Ownership of hive tables needed by others (find tables with scripts below and ask what the new user should be, preferably a system account like analytics or analytics-search).

Use the check-user-leftovers script from your local shell (it uses your ssh keys), and copy the output into the Phabricator task. The tricky part is dropping data from the Hive warehouse. Before starting, let's remember that:

under /user/hive/warehouse we can find directories related to databases (usually suffixed with .db) that in turn contain directories for tables.
If the table is internal, then the data files will be found in a subdir of the HDFS Hive warehouse directory.
If the table is external, then the data files could potentially be anywhere in HDFS.
A Hive drop database command deletes metadata in the Hive Metastore, the actual HDFS files needs to be cleaned up manually.

The following command is useful to understand where is the location of the HDFS files of the tables belonging to a Hive database:

DATABASE_TO_CHECK=elukey
for t in $(hive -S -e "show tables from $DATABASE_TO_CHECK;" 2>/dev/null | grep -v tab_name); do echo "checking table: $DATABASE_TO_CHECK.$t"; hive -S -e "describe extended $DATABASE_TO_CHECK.$t" 2>/dev/null | egrep -o 'location:hdfs://[0-9a-z/\_-]+'; done

Full removal of files and Hive databases and tables

Once the list of things to remove has been reviewed, and it has been determined it is okay to remove, the following commands will help you do so:

Drop Hive database. From an-launcher1002 (or somewhere with the hdfs user account):

sudo -u hdfs kerberos-run-command hdfs hive
> DROP DATABASE <user_database_name_here> CASCADE;

Remove user's Hive warehouse database directory. From an-launcher1002:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/hive/warehouse/<user_database_name_here>.db

Remove HDFS homedir. From an-launcher1002:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/<user_name>

Remove homedirs from regular filesystem on all nodes. From cumin1001:

sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/<user_name>'

Archival of user files

If necessary, we archive user files in HDFS in /wmf/data/archive/user/<user_name>. Make a directory with the user's name and copy any files there as needed. Set appropriate permissions if the archive should be shared with other users. For example:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -mv /user/elvis /wmf/data/archive/user/
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R hdfs:analytics-privatedata-users /wmf/data/archive/user/elvis

Things to do every Tuesday

The Data Engineering deployment train 🚂

After the Data Engineering standup, around 5-6pm UTC, we deploy anything that has changes pending. Deployments normally include restarting of jobs, updating jar versions,...

What to deploy:

Read the Etherpad where someone could write down the deployment plan. You should also use it to document the deployment of the week. http://etherpad.wikimedia.org/p/analytics-weekly-train
Check for (eventually already merged) patches that need to be merged (if not already) & deployed on Wikistats. You should find the last deployment with a commit named "Release X.X.X".

Here are instructions on each project's deployment:

Notify the team of each deployment (using !log on IRC).

Things to check every day

(We're centralizing all scheduled manual work at Analytics/Systems/Manual_maintenance both as a reminder and as a TODO list for automation)

Is there a new Mediawiki History snapshot ready? (beginning of the month)

Usually around the second day of the month, an email will arrive to the Data Engineering Alerts list with a subject line like this:

[Data-engineering-alerts] Mediawiki_history for 2023-12 now available

There are two AQS 2.0 endpoints that need to be updated whenever a new mediawiki_history_reduced snapshot is available, so that the services read from the latest data. These are:

Edit Analytics
Editor Analytics

All AQS 2.0 services now run under Kubernetes, specifically on the WikiKube clusters.

The process of updating these two AQS 2.0 services that use the Mediawiki History Snapshot is a matter of following the standard Kubernetes deployment procedures.

Here is a reference patch, updating the two services at the start of January 2024 to use the December 2023 snapshot:

A quick reference for the commands to execute is as follows:, but please make sure that you understand the commands and run them sequentially, checking for error output at each stage.

ssh deployment.eqiad.wmnet
cd /srv/deployment-charts/helmfile.d/services/edit-analytics
git log -n 1 # check that the change has been merged
helmfile -e staging -i apply --context=5
helmfile -e codfw -i apply --context=5
helmfile -e eqiad -i apply --context=5
cd ../editor-analytics
helmfile -e staging -i apply --context=5
helmfile -e codfw -i apply --context=5
helmfile -e eqiad -i apply --context=5

The mediaiwki history process is documented here: Analytics/Systems/Data Lake/Administration/Edit/Pipeline

Notice there are 3 sqoop jobs, one for mediawiki-history from labs, one for cu_changes (geoeditors), and one for sqooping tables like actor and comment from the production replicas.

Have new wikis been created?

If a new wiki receives pageviews and it's not on our allowlist, the team will start receiving alerts about pageviews from an unregistered Wikimedia site. The new wikis must be added in two places: the pageview allowlist and the sqoop groups list.

Adding new wikis to the sqoop list

First check that the wiki has already replicas...

In production DB:

# connect to the analytics production replica for the project to check
analytics-mysql azywiki
# list tables available for that project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;

In labs DB:

# connect to the analytics labs replica for the project to check (needs the password for user s53272)
# you need to know which port to use, so first you check for that in production DB:
analytics-mysql --print-taget azywiki
dbstore1003.eqiad.wmnet:3315

# Keep only the port (3315 in that example) and use if in connecting to cloudDB:
mysql --database azywiki_p -h clouddb1021.eqiad.wmnet -P 3315 -u s53272 -p
# list available tables for the project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;

For wikis to be included in the next mediawiki history snapshot, they need to be added to the labs sqoop list. Add them to the last group (the one that contains the smallest wikis) in the grouped_wikis.csv. Once the cluster is deployed (see section above) the wikis will be included in the next snapshot.

Adding new wikis to the pageview allow list

First list in hive all the current pageview allow list exceptions:

select * from wmf.pageview_unexpected_values where year=... and month=...;

That will tell you what you need to add to the list. Take a look at it, make sure it makes sense, and make a Gerrit patch for it (example in gerrit) in analytics-refinery/static-data.

To stop the airflow alarms faster, you can merge your patch and sync the new file up to hdfsː

scp static_data/pageview/allowlist/allowlist.tsv an-launcher1002.eqiad.wmnet:
ssh an-launcher1002.eqiad.wmnet
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -put -f allowlist.tsv /wmf/refinery/current/static_data/pageview/allowlist/allowlist.tsv
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod +r /wmf/refinery/current/static_data/pageview/allowlist/allowlist.tsv

Note: beware when you are editing this file. It's a TSV. Your editor may replace tabs with spaces without telling you.

Have any alarms been triggered?

Airflow

Airflow jobs run on one of our Airflow instances. If you receive an email like <Taskinstance: ... [failed]> or SLA miss on DAG, the best way to investigate is the Airflow UI (see Airflow Instance-specific instructions). Once in the Airflow UI, you can check logs and retry or pause the DAG accordingly. If the error was transient or you can deploy a simple fix, you can retry by clearing the failed task instances. Otherwise, pausing the DAG will stop further alert emails. Airflow documentation is excellent and worth learning, you can find links in the UI header.

Canary alarms

The produce_canary_events job will fail if any event fails to be produced. EventGate returns the reason for failure in its response. This will be logged in the job's log file in /var/log/refinery/produce_canary_events/produce_canary_events.log on an-launcher1002 (as of 2021-01).

The produce_canary_events job runs every 15 minutes (4 times an hour). It is important that at least one of these jobs is successful. For streams that are mostly idle, we use the presence of a canary event to differentiate between and idle stream and a broken stream. Each stream needs at least one event per hour when we ingest that stream into the Hive event database, in order to know that the partition for that hour is complete.

As long a produce_canary_events recovers within an hour, you can probably ignore this alert. If it does not, users that depend on the presence of Hive hourly partitions may encounter timeouts waiting for a partition to be created. See this issue for an example.

If an event is invalid, in /var/log/refinery/produce_canary_events/produce_canary_events.log, you'll see an error message like: Jan 12 19:15:04 an-launcher1002 produce_canary_events[15437]: HttpResult(failure) status: 207 message: Multi-Status. body: {"invalid":[{"status":"invalid","event":{"event":{"is_mobile":true,"user_editcount":123,"user_id":456,"impact_module_state":"activated","start_email_state":"noemail","homepage_pageview_token":"example token"},"meta":{"id":"b0caf18d-6c7f-4403-947d-2712bbe28610","stream":"eventlogging_HomepageVisit","domain":"canary","dt":"2021-01-12T19:15:04.339Z","request_id":"54df8880-61ce-4cf9-86fa-342c917ea622"},"dt":"2020-04-02T19:11:20.942Z","client_dt":"2020-04-02T19:11:20.942Z","$schema":"/analytics/legacy/homepagevisit/1.0.0","schema":"HomepageVisit","http":{"request_headers":{"user-agent":"Apache-HttpClient/4.5.12 (Java/1.8.0_272)"}}},"context":{"errors":[{"keyword":"required","dataPath":".event","schemaPath":"#/properties/event/required","params":{"missingProperty":"start_tutorial_state"},"message":"should have required property 'start_tutorial_state'"}],"errorsText":"'.event' should have required property 'start_tutorial_state'"}}],"error":[]}

The canary event is constructed from the schema's examples. In this error message, the schema at /analytics/legacy/homepagevisit/1.0.0 examples was missing a required field, and the event failed validation.

The fix will be dependent on the problem. In this specific case, we modified what should have been an immutable schema after eventgate had cached the schema, so eventgate needed a restart to flush caches.

Ask ottomata for help with this if you encounter issues.

Anomaly detection alarms

There are three different types of alarms: outage/censorship, general oddities on refine system measured by calculating entropy of user agent and mobile pageview alarms.

Superset dashboard: https://superset.wikimedia.org/superset/dashboard/315/

Outage/censorship alarms. These alarms measure changes in traffic per city. When they raise look whether the overall volume of pageviews is OK, if it is issue might be due to an outage on a particular country. Traffic will troubleshoot further

Eventlogging refine alarms on navigationtiming data. Variations here might indicate a problem in the refine pipeline (like all user agents now being null) or rather an update of UA parser.

Mobile pageview alarms. An alarm might indicate a drop on mobile app or mobile web pageviews, do check numbers in turnilo. These alarms are not setup for desktop pageviews as nature of timeseries is quite different.

A through description of the system can be found at Analytics/Data_quality/Traffic_per_city_entropy.

Failed systemd/journald based jobs

Follow the page on managing systemd timers and see what you can do. Notify an SRE if you don't feel confident in what you're doing, or do not have the required rights.

Example: sqoop jobs, reportupdater jobs, and everything not yet migrated to Airflow

HDFS alarms

HdfsCorruptBlocks. To check if it's still a problem: [@an-launcher1002:...] $ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks

Sqoop failures

Sqoop is run on systemd timers right now (may be Airflow soon or we may migrate it to spark). If it fails, look at Analytics/Systems/Cluster/Edit_history_administration.

Data loss alarms

Follow the steps in Analytics/Systems/Dealing with data loss alarms

Mediawiki Denormalize checker alarms

If a mediawiki snapshot fails its Checker step you will get an alarm via e-mail, this is what to do: Analytics/Systems/Cluster/Edit_history_administration#QA:_Assessing_quality_of_a_snapshot

Reportupdater failures

Check the time that the alarm was triggered at and look for causes of the problems in the logs at

sudo journalctl -u reportupdater-interlanguage

(Replace interlanguage with the failed unit)

Druid indexation job fails

Take a look at the Admin console and the indexation logs. The logs show the detailed errors, but the console can be easier to look at and spot obvious problems.

Deletion script alarms

When data directories and hive partitions are out of sync the deletion scripts can fail. To resync directories and partitions execute msck repair table <table_name>; in Hive.

Refine failure report

An error like Refine failure report for refine_eventlogging_analytics will usually have some actual exception information in the email itself. Take a look and most likely rerun the hour(s) that failed: Analytics/Systems/Refine#Rerunning_jobs

Refine's failure report email should contain the flags you need to rerun whatever might have failed. The name of the script to run is in the subject line. E.g. if the subject line says for 'refine_eventlogging_analytics', that is the command you will need to run. You can copy and paste the flags from the failure email to this command, like:

sudo -u analytics kerberos-run-command analytics refine_eventlogging_analytics --ignore_failure_flag=true --table_include_regex='contenttranslation' --since='2021-11-29T22:00:00.000Z' --until='2021-11-30T23:00:00.000Z'

Wait for the job to finish, and don't forget to check the logs to see that all went well:

sudo -u analytics yarn logs -applicationId <app_id_here> | grep Refine:
...
21/12/02 19:30:53 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslation/year=2021/month=11/day=30/hour=21 -> `event`.`contenttranslation` /wmf/data/event/contenttranslation/year=2021/month=11/day=30/hour=21. (# refined records: 66)

For errors like RefineMonitor problem report for job monitor_refine_sanitize_eventlogging_analytics, visit Backfilling sanitization.

Mediawiki page content change enrich alarms

A real time data processing application that consumes the mediawiki.page_change.v1 topic, performs a lookup join (HTTP) with the Action API to retrieve raw page content, and produces an enriched event into the mediawiki.page_content_change.v1 topic. The Event Platform team is responsible for this application.

Alerts will be trigger when SLI performance degrades, or the application stops running in k8s main.

Follow up steps and escalation are describe in the application SLO MediaWiki Event Enrichment/SLO/Mediawiki Page Content Change Enrichment#Troubleshooting.

Flink kubernetes operator alarms

Infrastructure for running Flink clusters on k8s. The Event Platform and Search team are responsible for this application. Alerts are raise if the operator is not running in k8s main.

Debug Java applications in trouble

When a java daemon misbehaves (like in https://phabricator.wikimedia.org/T226035) it is absolutely vital to get some info from it before a restart, otherwise it will be difficult to report a problem upstream. The jstack utility seems to be the best candidate for this job, and plenty of guides can be found on the internet. For a quick copy/paste tutorial:

use ps -auxff | grep $something to correctly identify the PID of the process
then run sudo -u $user jstack -l $PID > /tmp/thread_dump

The $user referenced in the last command is the user running the Java daemon. You should have the rights to sudo as that user, but if not pleas ping Luca or Andrew.

This guide is concise and neat for a more verbose explanation: https://helpx.adobe.com/it/experience-manager/kb/TakeThreadDump.html

Useful administration resources

Useful Kerberos commands

Since Kerberos was introduced, proper authentication has been enforced for all access to HDFS. Sometimes Analytics admins need to access resources of another user, for example to debug or fix something, so more powerful credentials are needed. In order to be like root on HDFS, you need to ssh to any of an-master1001, an-master1002 or an-coord1001 and run:

sudo -u hdfs kerberos-run-command hdfs command-that-you-want-to-execute

Special example to debug Yarn application logs:

sudo -u hdfs kerberos-run-command hdfs yarn logs -applicationId application_1576512674871_263229 -appOwner elukey

In the above case, Analytics admins will be able to pull logs for user elukey and app-id application_1576512674871_263229 without incurring in permission errors.

Restart daemons

On most of the Analytics infrastructure the following commands are available for Analytics admins:

sudo systemctl restart name-of-the-daemon
sudo systemctl start name-of-the-daemon
sudo systemctl stop name-of-the-daemon
sudo systemctl reset-failed name-of-the-daemon
sudo systemctl status name-of-the-daemon
sudo journalctl -u name-of-the-daemon

For example, let's say a restart for the Hive server is needed and no SREs are around:

sudo systemctl restart hive-server2