Every member of the team (excluding ops engineers) is part of a weekly rotation of a series of tasks detailed here. You may use this page as a checklist of things to do when it's your ops week. Please keep it concise and actionable, and for longer explanations open a new page and link to it from here.
If it's your ops week, there will be a one hour timebox in your Google Calendar to fulfill your duties, which by default is set to 3pm UTC every weekday. Obviously not everyone's schedule will be able to accommodate this time, but make sure to allocate some time each day of your ops week to do your tasks.
Remember you're a first responder. This means you don't have to fix whatever is broken during your ops week, but you have to alert the team, check the status of failed stuff, and report what's happened.
Things to do at some point during the week
Ops week tasks
Check the ops week column in the Analytics phabricator board to check if the SRE people have left anything for you to do.
Have any users left the Foundation?
We should check for users who are no longer employees of the foundation, and offboard them from the various places in our infrastructure where user content is stored. Ideally a Phabricator task should be open notifying the user and/or their team manager so that they have a chance to respond and move things. Generally, as part of the SRE offboarding script we will be notified when a user leaves and the task will be added to the on call column. We should remove the following:
- Home directories in the stat machines.
- Home directory in HDFS.
- Home directories in the notebook machines (
- User hive tables (check if there is a hive database named after the user).
- Ownership of hive tables needed by others (find tables with scripts below and ask what the new user should be, preferably a system account like analytics or analytics-search).
Use the following script from your local shell (it uses your ssh keys), and copy the output into the Phabricator task:
#!/bin/bash if [ -z "$1" ] then echo "You need to input the username to check" exit 1 fi for hostname in stat1004 stat1006 stat1007 notebook1003 notebook1004 do echo -e "\n====== $hostname ======" ssh $hostname.eqiad.wmnet ls -l /srv/home/$1 done echo -e "\n======= HDFS ========" ssh stat1004.eqiad.wmnet hdfs dfs -ls /user/$1 echo -e "\n====== Hive =========" ssh stat1004.eqiad.wmnet hdfs dfs -ls /user/hive/warehouse/* | grep $1
The tricky part is dropping data from the Hive warehouse. Before starting, let's remember that:
/user/hive/warehousewe can find directories related to databases (usually suffixed with .db) that in turn contain directories for tables.
- If the table is internal, then the data files will be found in a subdir of the HDFS Hive warehouse directory.
- If the table is external, then the data files could potentially be anywhere in HDFS.
- A Hive drop database command deletes metadata in the Hive Metastore, the actual HDFS files needs to be cleaned up manually.
The following command is useful to understand where is the location of the HDFS files of the tables belonging to a Hive database:
DATABASE_TO_CHECK=elukey for t in $(hive -S -e "show tables from $DATABASE_TO_CHECK;" 2>/dev/null | grep -v tab_name); do echo "checking table: $DATABASE_TO_CHECK.$t"; hive -S -e "describe extended $DATABASE_TO_CHECK.$t" 2>/dev/null | egrep -o 'location:hdfs://[0-9a-z/\_-]+'; done
Suppose that we have a "elukey.db" database in the Hive warehouse, containing only internal tables. To clean it up:
- Log into Hive and execute
DROP DATABASE elukey CASCADE
sudo -u hdfs hdfs dfs -rm -r /user/hive/warehouse/elukey.db
Things to do every Wednesday
The Analytics deployment train 🚂
After the Scrum of Scrums meeting and the Analytics standup, around 5-6pm UTC, we deploy anything that has changes pending. Take a look at the Ready to Deploy column in the Kanban to identify what needs to be released, and notify the team of each deployment (using !log on IRC).
Deployments normally include restarting of jobs and updating jar versions, while some of us have super powers and can keep all that their head others need the etherpad to write down the deployment plan, please use it and document the deployment of the week here: http://etherpad.wikimedia.org/p/analytics-weekly-train
Here are instructions on each project's deployment:
Things to check every day
Is there a new Mediawiki History snapshot ready? (beginning of the month)
We need to tell AQS manually that a new snapshot is available so that the service reads from the latest data. This will happen around the middle of the month. Follow these instructions to do so.
The mediaiwki history process is documented here: Analytics/Systems/Data Lake/Administration/Edit/Pipeline Notice there are 3 sqoop jobs, one for mediawiki-history from labs, one for cu_changes (geoeditors), and one for sqooping tables like actor and comment from the production replicas.
Have new wikis been created?
This one will be easy to spot. If a new wiki receives pageviews and it's not on our whitelist, the team will start receiving alerts about pageviews from an unregistered Wikimedia site. The new wikis must be added in two places: the pageview whitelist and the sqoop groups list.
Adding new wikis to the sqoop list
First check that the wiki has already a replica in labs:
analytics-mysql azywiki --print-target
For wikis to be included in the next mediawiki history snapshot, they need to be added to the labs sqoop list. Add them to the last group (the one that contains the smallest wikis) here. Once the cluster is deployed (see section above) the wikis will be included in the next snapshot.
Adding new wikis to the pageview whitelist
First list in hive all the current pageview whitelist exceptions:
select * from wmf.pageview_unexpected_values where year=... and month=...;
That will tell you what you need to add to the list. Take a look at it, make sure it makes sense, and make a patch for it (example in gerrit). To stop the oozie alarms faster, you can merge your patch and sync the new file up to hdfsː
scp static_data/pageview/whitelist/whitelist.tsv stat1007.eqiad.wmnet: ssh stat1007.eqiad.wmnet sudo -u hdfs hdfs dfs -put -f whitelist.tsv /wmf/refinery/current/static_data/pageview/whitelist/whitelist.tsv
Have any alarms been triggered?
Failed oozie jobs
Check if any work is already being done in IRC and the analytics SAL. Follow the steps in Analytics/Systems/Cluster/Oozie/Administration to restart a job.
Log any job restart: https://tools.wmflabs.org/sal/analytics. If you're going through oozie alert emails or something similar, double check the SAL to see whether work has already been done to fix the problem.
Failed systemd/journald based jobs
Follow the page on managing systemd timers and see what you can do. Notify an ops engineer if you don't know what you're doing.
Data loss alarms
Follow the steps in Analytics/Systems/Dealing with data loss alarms
Mediawiki Denormalize checker alarms
If a mediawiki snapshot fails its Checker step you will get an alarm via e-mail, this is what to do: Analytics/Systems/Cluster/Edit_history_administration#QA:_Assessing_quality_of_a_snapshot
Check the time that the alarm was triggered at and look for causes of the problems in the logs at
sudo journalctl -u reportupdater-interlanguage
(Replace interlanguage with the failed unit)
Deletion script alarms
When data directories and hive partitions are out of sync the deletion scripts can fail.
To resync directories and partitions execute
msck repair table <table_name>; in Hive.
Camus failure report
An error like
Error on topic mediawiki_ApiAction - Latest offset time is either missing or not after the previous run's offset time probably means that data is no longer produced and the job needs to be turned off. If that's not the case, then investigate why the job is not working. For the example here, we updated puppet to tell Camus not to import that topic any more: https://github.com/wikimedia/puppet/commit/bddc53
Refine failure report
An error like
Refine failure report for /wmf/data/raw/eventlogging -> /wmf/data/event will usually have some actual exception information in the email itself. Take a look and most likely rerun the hour(s) that failed: Analytics/Systems/Refine#Rerunning_jobs
Debug Java applications in trouble
When a java daemon misbehaves (like in https://phabricator.wikimedia.org/T226035) it is absolutely vital to get some info from it before a restart, otherwise it will be difficult to report a problem upstream. The jstack utility seems to be the best candidate for this job, and plenty of guides can be found on the internet. For a quick copy/paste tutorial:
ps -auxff | grep $somethingto correctly identify the PID of the process
- then run
sudo -u $user jstack -l $PID > /tmp/thread_dump
The $user referenced in the last command is the user running the Java daemon. You should have the rights to sudo as that user, but if not pleas ping Luca or Andrew.
This guide is concise and neat for a more verbose explanation: https://helpx.adobe.com/it/experience-manager/kb/TakeThreadDump.html
Useful administration resources
Useful Kerberos commands
Since Kerberos was introduced, proper authentication has been enforced for all access to HDFS. Sometimes Analytics admins need to access resources of another user, for example to debug or fix something, so more powerful credentials are needed. In order to be like root on HDFS, you need to ssh to any of an-master1001, an-master1002 or an-coord1001 and run:
sudo -u hdfs kerberos-run-command hdfs command-that-you-want-to-execute
Special example to debug Yarn application logs:
sudo -u hdfs kerberos-run-command hdfs yarn logs -applicationId application_1576512674871_263229 -appOwner elukey
In the above case, Analytics admins will be able to pull logs for user elukey and app-id application_1576512674871_263229 without incurring in permission errors.
On most of the Analytics infrastructure the following commands are available for Analytics admins:
sudo systemctl restart name-of-the-daemon sudo systemctl start name-of-the-daemon sudo systemctl stop name-of-the-daemon sudo systemctl reset-failed name-of-the-daemon sudo systemctl status name-of-the-daemon sudo journalctl -u name-of-the-daemon
For example, let's say a restart for the Hive server is needed and no SREs are around:
sudo systemctl restart hive-server2