Analytics/Ops week

From Wikitech
< Analytics(Redirected from Analytics/Team/Oncall)
Jump to navigation Jump to search

We rotate ops duty in pairs. On any given week, a member of the team is the primary on-duty ops person. This entails looking after the list of tasks detailed here. At the end of the week, the primary person stays on as a secondary and another member becomes primary. This page can be a checklist of things to do during ops week. Please keep it concise and actionable, and for longer explanations open a new page and link to it from here. The primary person is expected to monitor alarms and begin reacting to problems. The secondary person is there to pair, help with tasks that take more time, and importantly to decide what problems should be addressed immediately and what could be made into a task that gets prioritized as normal. This last duty will be somewhat temporary until the team has established formal SLOs and error budgets.

If it's your ops week, there will be a one hour timebox in your Google Calendar to fulfill your duties, which by default is set to 3pm UTC every weekday. Obviously not everyone's schedule will be able to accommodate this time, but make sure that you have time during your ops week to dedicate 100% to ops if needed. This means that if you are working on a time critical task or you are taking vacation you might need to swap ops weeks with someone else.

Remember you're a first responder. This means you don't have to fix whatever is broken during your ops week, but you have to alert the team, check the status of failed stuff, and report what's happened, alerts need to be acknowledged promptly so other team members know they are being taken care of.

Ops week is an opportunity to learn, do not be shy in asking for help from others.

Hand off

Ops week ends and begins on the Thursday of every week. We have a task grooming session scheduled for every Thursday. The beginning of this meeting is used to quickly groom tasks in the 'Ops Week' column on the Analytics Phabricator Board. The Ops Week column should then only contain actionable tasks for those on Ops Week duty for the next week.


Any decisions taken with a subset of team that all team members should know about need to be communicated to the rest of team via e-mail to analytics-alerts@. Example: deployments happening not happening, issues we have decided not to fix and schedule through regular kanban work. New, ahem, "fun discoveries" that you think might help the next opsen.

Working in off hours

The developer ops rotation does not include working off hours as its focus is on"routine" ops. The bulk of analytics infra is tier-2 and jobs that fail, for example, can be restarted the next day. If there are system issues (like The namenode going down) there should be ICINGA alarms fired that will go to the SRE team.

The SRE team needs to makes sure that at all times there is 1 person available for ops, which, in parctice means taking turns to take vacations.

Things to do at some point during the week

Ops week tasks

Check the ops week column in the Analytics phabricator board to check if the SRE people have left anything for you to do.

Have any users left the Foundation?

We should check for users who are no longer employees of the foundation, and offboard them from the various places in our infrastructure where user content is stored. Ideally a Phabricator task should be open notifying the user and/or their team manager so that they have a chance to respond and move things. Generally, as part of the SRE offboarding script we will be notified when a user leaves and the task will be added to the on call column. We should remove the following:

  • Home directories in the stat machines.
  • Home directory in HDFS.
  • User hive tables (check if there is a hive database named after the user).
  • Ownership of hive tables needed by others (find tables with scripts below and ask what the new user should be, preferably a system account like analytics or analytics-search).

Use the following script from your local shell (it uses your ssh keys), and copy the output into the Phabricator task:


if [ -z "$1" ]
    echo "You need to input the username to check"
    exit 1

for hostname in stat1004 stat1005 stat1006 stat1007 stat1008
  echo -e "\n====== $hostname ======"
  ssh $hostname.eqiad.wmnet "ls -l /srv/home/${1}"
  ssh $hostname.eqiad.wmnet "ls -l /var/userarchive | grep ${1}"

echo -e "\n======= HDFS ========"
ssh an-launcher1002.eqiad.wmnet "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -ls /user/${1}"

echo -e "\n====== Hive ========="
ssh an-launcher1002.eqiad.wmnet "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -ls /user/hive/warehouse/*" | grep ${1}

The tricky part is dropping data from the Hive warehouse. Before starting, let's remember that:

  • under /user/hive/warehouse we can find directories related to databases (usually suffixed with .db) that in turn contain directories for tables.
  • If the table is internal, then the data files will be found in a subdir of the HDFS Hive warehouse directory.
  • If the table is external, then the data files could potentially be anywhere in HDFS.
  • A Hive drop database command deletes metadata in the Hive Metastore, the actual HDFS files needs to be cleaned up manually.

The following command is useful to understand where is the location of the HDFS files of the tables belonging to a Hive database:

for t in $(hive -S -e "show tables from $DATABASE_TO_CHECK;" 2>/dev/null | grep -v tab_name); do echo "checking table: $DATABASE_TO_CHECK.$t"; hive -S -e "describe extended $DATABASE_TO_CHECK.$t" 2>/dev/null | egrep -o 'location:hdfs://[0-9a-z/\_-]+'; done

Suppose that we have a "elukey.db" database in the Hive warehouse, containing only internal tables. To clean it up:

  • Log into Hive and execute DROP DATABASE elukey CASCADE
  • Execute sudo -u hdfs hdfs dfs -rm -r /user/hive/warehouse/elukey.db

Things to do every Tuesday

The Analytics deployment train 🚂

After the Analytics standup, around 5-6pm UTC, we deploy anything that has changes pending. Take a look at the Ready to Deploy column in the Kanban to identify what needs to be released, and notify the team of each deployment (using !log on IRC).

Deployments normally include restarting of jobs and updating jar versions, while some of us have super powers and can keep all that their head others need the etherpad to write down the deployment plan, please use it and document the deployment of the week here:

Here are instructions on each project's deployment:

Deploy Eventlogging

Deploy Wikistats UI

Deploy AQS

Deploy refinery source

Deploy refinery

Deploy Event Streams (more involved, ask Otto, but this is part of it)

Things to check every day

Is there a new Mediawiki History snapshot ready? (beginning of the month)

We need to tell AQS manually that a new snapshot is available so that the service reads from the latest data. This will happen around the middle of the month. Follow these instructions to do so.

The mediaiwki history process is documented here: Analytics/Systems/Data Lake/Administration/Edit/Pipeline Notice there are 3 sqoop jobs, one for mediawiki-history from labs, one for cu_changes (geoeditors), and one for sqooping tables like actor and comment from the production replicas.

Have new wikis been created?

This one will be easy to spot. If a new wiki receives pageviews and it's not on our whitelist, the team will start receiving alerts about pageviews from an unregistered Wikimedia site. The new wikis must be added in two places: the pageview whitelist and the sqoop groups list.

Adding new wikis to the sqoop list

First check that the wiki has already replicas...

  • In production DB:
# connect to the analytics production replica for the project to check
analytics-mysql azywiki
# list tables available for that project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;
  • In labs DB:
# connect to the analytics labs replica for the project to check (needs the password for user s53272)
mysql --database azywiki_p -h labsdb1012.eqiad.wmnet -u s53272 -p
# list available tables for the project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;

For wikis to be included in the next mediawiki history snapshot, they need to be added to the labs sqoop list. Add them to the last group (the one that contains the smallest wikis) here. Once the cluster is deployed (see section above) the wikis will be included in the next snapshot.

Adding new wikis to the pageview whitelist

First list in hive all the current pageview whitelist exceptions:

select * from wmf.pageview_unexpected_values where year=... and month=...;

That will tell you what you need to add to the list. Take a look at it, make sure it makes sense, and make a patch for it (example in gerrit). To stop the oozie alarms faster, you can merge your patch and sync the new file up to hdfsː

scp static_data/pageview/whitelist/whitelist.tsv an-launcher1002.eqiad.wmnet:
ssh an-launcher1002.eqiad.wmnet
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -put -f whitelist.tsv /wmf/refinery/current/static_data/pageview/whitelist/whitelist.tsv

Have any alarms been triggered?

Canary alarms

The produce_canary_events job will fail if any event fails to be produced. EventGate returns the reason for failure in its response. This will be logged in the job's log file in /var/log/refinery/produce_canary_events/produce_canary_events.log on an-launcher1002 (as of 2021-01).

If an event is invalid, you'll see an error message like: Jan 12 19:15:04 an-launcher1002 produce_canary_events[15437]: HttpResult(failure) status: 207 message: Multi-Status. body: {"invalid":[{"status":"invalid","event":{"event":{"is_mobile":true,"user_editcount":123,"user_id":456,"impact_module_state":"activated","start_email_state":"noemail","homepage_pageview_token":"example token"},"meta":{"id":"b0caf18d-6c7f-4403-947d-2712bbe28610","stream":"eventlogging_HomepageVisit","domain":"canary","dt":"2021-01-12T19:15:04.339Z","request_id":"54df8880-61ce-4cf9-86fa-342c917ea622"},"dt":"2020-04-02T19:11:20.942Z","client_dt":"2020-04-02T19:11:20.942Z","$schema":"/analytics/legacy/homepagevisit/1.0.0","schema":"HomepageVisit","http":{"request_headers":{"user-agent":"Apache-HttpClient/4.5.12 (Java/1.8.0_272)"}}},"context":{"errors":[{"keyword":"required","dataPath":".event","schemaPath":"#/properties/event/required","params":{"missingProperty":"start_tutorial_state"},"message":"should have required property 'start_tutorial_state'"}],"errorsText":"'.event' should have required property 'start_tutorial_state'"}}],"error":[]}

The canary event is constructed from the schema's examples. In this error message, the schema at /analytics/legacy/homepagevisit/1.0.0 examples was missing a required field, and the event failed validation.

The fix will be dependent on the problem. In this specific case, we modified what should have been an immutable schema after eventgate had cached the schema, so eventgate needed a restart to flush caches.

Ask ottomata for help with this if you encounter issues.

Entropy alarms

There are three different types of alarms: outage/censorship, general oddities on refine system measured by calculating entropy of user agent and mobile pageview alarms.

Superset dashboard:

  • Outage/censorship alarms. These alarms measure changes in traffic per city. When they raise look whether the overall volume of pageviews is OK, if it is issue might be due to an outage on a particular country. Traffic will troubleshoot further
  • Eventlogging refine alarms on navigationtiming data. Variations here might indicate a problem in the refine pipeline (like all user agents now being null) or rather an update of UA parser.
  • Mobile pageview alarms. An alarm might indicate a drop on mobile app or mobile web pageviews, do check numbers in turnilo. These alarms are not setup for desktop pageviews as nature of timeseries is quite different.

A through description of the system can be found here

Failed oozie jobs

Check if any work is already being done in IRC and the analytics SAL. Follow the steps in Analytics/Systems/Cluster/Oozie/Administration to restart a job.

Log any job restart: If you're going through oozie alert emails or something similar, double check the SAL to see whether work has already been done to fix the problem.

Failed systemd/journald based jobs

Follow the page on managing systemd timers and see what you can do. Notify an ops engineer if you don't know what you're doing.

Data loss alarms

Follow the steps in Analytics/Systems/Dealing with data loss alarms

Mediawiki Denormalize checker alarms

If a mediawiki snapshot fails its Checker step you will get an alarm via e-mail, this is what to do: Analytics/Systems/Cluster/Edit_history_administration#QA:_Assessing_quality_of_a_snapshot

Reportupdater failures

Check the time that the alarm was triggered at and look for causes of the problems in the logs at

sudo journalctl -u reportupdater-interlanguage

(Replace interlanguage with the failed unit)

Druid indexation job fails

Take a look at the Admin console and the indexation logs. The logs show the detailed errors, but the console can be easier to look at and spot obvious problems.

Deletion script alarms

When data directories and hive partitions are out of sync the deletion scripts can fail. To resync directories and partitions execute msck repair table <table_name>; in Hive.

Camus failure report

An error like Error on topic mediawiki_ApiAction - Latest offset time is either missing or not after the previous run's offset time probably means that data is no longer produced and the job needs to be turned off. If that's not the case, then investigate why the job is not working. For the example here, we updated puppet to tell Camus not to import that topic any more:

Refine failure report

An error like Refine failure report for /wmf/data/raw/eventlogging -> /wmf/data/event will usually have some actual exception information in the email itself. Take a look and most likely rerun the hour(s) that failed: Analytics/Systems/Refine#Rerunning_jobs

For errors like Refine failure report for /wmf/data/event -> /wmf/data/event_sanitized, visit Backfilling sanitization.

There are several systemd timers on an-launcher1001 that check daily if any REFINE_FAILED flag has been left to fix, raising an icinga alarm in case. You can find the timers' logs in the usual places (/var/log/etc..) or you can use the tool directly like this:

spark2-submit --class /srv/deployment/analytics/refinery/artifacts/refinery-job.jar --config_file=/etc/refinery/refine/ --since 24

Please note that the config_file is not unique, but it depends on the what kind of spark refine you want to check (mediawiki-jobs, eventlogging, etc..). All the available ones are under /etc/refinery/refine/refine_failed_flags_* on an-launcher1001, but you can create your own config file like this:

elukey@stat1004:~$ cat /home/elukey/

elukey@stat1004:~$ spark2-submit --class /srv/deployment/analytics/refinery/artifacts/refinery-job.jar --config_file=/home/elukey/ --since 300

Debug Java applications in trouble

When a java daemon misbehaves (like in it is absolutely vital to get some info from it before a restart, otherwise it will be difficult to report a problem upstream. The jstack utility seems to be the best candidate for this job, and plenty of guides can be found on the internet. For a quick copy/paste tutorial:

  • use ps -auxff | grep $something to correctly identify the PID of the process
  • then run sudo -u $user jstack -l $PID > /tmp/thread_dump

The $user referenced in the last command is the user running the Java daemon. You should have the rights to sudo as that user, but if not pleas ping Luca or Andrew.

This guide is concise and neat for a more verbose explanation:

Useful administration resources

Useful Kerberos commands

Since Kerberos was introduced, proper authentication has been enforced for all access to HDFS. Sometimes Analytics admins need to access resources of another user, for example to debug or fix something, so more powerful credentials are needed. In order to be like root on HDFS, you need to ssh to any of an-master1001, an-master1002 or an-coord1001 and run:

sudo -u hdfs kerberos-run-command hdfs command-that-you-want-to-execute

Special example to debug Yarn application logs:

sudo -u hdfs kerberos-run-command hdfs yarn logs -applicationId application_1576512674871_263229 -appOwner elukey

In the above case, Analytics admins will be able to pull logs for user elukey and app-id application_1576512674871_263229 without incurring in permission errors.

Restart daemons

On most of the Analytics infrastructure the following commands are available for Analytics admins:

sudo systemctl restart name-of-the-daemon
sudo systemctl start name-of-the-daemon
sudo systemctl stop name-of-the-daemon
sudo systemctl reset-failed name-of-the-daemon
sudo systemctl status name-of-the-daemon
sudo journalctl -u name-of-the-daemon

For example, let's say a restart for the Hive server is needed and no SREs are around:

sudo systemctl restart hive-server2