Data Platform/Systems/Managing systemd timers
In phab:T172532 we decided to move away as much as possible from cron jobs since we were struggling to alarm when a job failed. Most of our recurrent jobs are running on an-coord1001 (analytics cluster coordinator), and the majority of them are systemd timers.
Anatomy of a Systemd timer
A systemd timer is composed by two parts: a unit and the timer itself. Let's pick an example:
elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.timer
# /lib/systemd/system/hdfs-balancer.timer
[Unit]
Description=Periodic execution of hdfs-balancer.service
[Timer]
Unit=hdfs-balancer.service
# Accuracy sets the maximum time interval around the execution time we want to allow
AccuracySec=15sec
OnCalendar=*-*-* 06:00:00
RandomizedDelaySec=0
[Install]
WantedBy=multi-user.target
elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.service
# /lib/systemd/system/hdfs-balancer.service
[Unit]
Description=Run the HDFS balancer script to keep HDFS blocks replicated in the most redundant and efficient way.
[Service]
User=hdfs
ExecStart=/usr/local/bin/hdfs-balancer
In this case, we are checking how it is configured the HDFS balancer timer. The .timer file references a systemd unit (the .service file), that it is the one doing the work, and a pattern of execution (in this case, every day at 6 AM UTC). The .service file defines a command, together with other info like environment variables and username that needs to execute it. This command that .service points to can be generated by puppet, so peek into it if you need to see the details, for example: sudo -u hdfs cat /usr/local/bin/refinery-sqoop-mediawiki
.
We alarm on the return code of the .service Systemd unit: if it gets executed and the script mentioned by ExecStart returns a non-zero exit code, Nagios will alert us.
All these configuration files seem a bit confusing, but they are good from the operations point of view. A systemd .service can be started/restarted any time, basically forcing the execution of its script. In this case:
sudo systemctl start hdfs-balancer.service
forces the hdfs-balancer to execute. Where do we find logs? At the moment the majority of them are only on journald, but they will be logged on files as well as soon as possible. If you want to check journald for a particular .service:
# start from the first log registered
sudo journalctl -u hdfs-balancer
# start from a specific moment of time
sudo journalctl -u hdfs-balancer --since '3 days ago'
# show only a specific timeframe
sudo journalctl --since "2018-10-21 23:15:00" --until "2018-10-21 23:20:00"
# tail -f equivalent
sudo journalctl -u hdfs-balancer --follow
An example of debugging for db1107: https://phabricator.wikimedia.org/T207165
ExecStart and difference with bash
Please note that the command executed by the service unit (the one assigned to the ExecStart field) is parsed following a systemd specific format, that is similar to what bash offers but very different in may ways. If you don't keep in mind this rule you'll end up writing units that either fail due to weird argument parsing or even worse executing with wrong parameters. The fix in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530555/ is a good example: a shell redirection ( command > file.log) is not parsed as we expect by systemd, that 1) considers '>' as special char and 2) considers everything split by a whitespace (and not quoted) as separate argument.
Calculating OnCalendar interval
The OnCalender interval syntax is powerful, but isn't always the most straightforward to figure out. There is a command to help: systemd-analyze
.
To check the syntax of OnCalendar, you can run e.g.
$ systemd-analyze calendar '*-*-* *:05,35:00'
Normalized form: *-*-* *:05,35:00
Next elapse: Thu 2021-07-08 14:35:00 UTC
From now: 24min left
Operations
List all timers running on the host
systemctl list-timers
Example of output from stat1006:
elukey@stat1006:~$ systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
Wed 2019-03-27 06:43:00 UTC 2min 54s left Wed 2019-03-27 06:28:00 UTC 12min ago prometheus-node-exporter-apt.timer prometheus-node-exporter-apt.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-cx.timer reportupdater-cx.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-edit-beta-features.timer reportupdater-edit-beta-features.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-ee-beta-features.timer reportupdater-ee-beta-features.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-ee.timer reportupdater-ee.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-flow-beta-features.timer reportupdater-flow-beta-features.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-flow.timer reportupdater-flow.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-language.timer reportupdater-language.service
To check on a specific timer, use status: systemctl status refinery-sqoop-whole-mediawiki
It is easy to see all the timings for a certain systemd timer and the units it runs. For example, the reportupdater-cx
job has two units:
reportupdater-cx.timer
-> this is basically the equivalent of the cron definition (you can inspect it viasystemctl cat reportupdater-cx.timer
and see its config)reportupdater-cx.service
-> this is the command to run together with its environment etc.., that is called/managed by the above.timer
unit.
List command timers are actually running
nuria@an-launcher1002:~$ systemctl cat drop-el-unsanitized-events.service
# /lib/systemd/system/drop-el-unsanitized-events.service
[Unit]
Description=Drop unsanitized EventLogging data from the event database after retention period.
[Service]
User=analytics
Environment=PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python
ExecStart=/usr/local/bin/kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-drop-older-than --some
Check which user runs a timer
Everything always starts with a systemctl list-timers
(as explained above) to find the .timer unit to check. If we keep the stat1006 example above, let's say we want to check what user runs reportupdater-cx.timer
: we'll need to inspect the related .service unit (since that one is the "real" command executed, the .timer one is only the cron-like definition and we don't care who runs it):
elukey@stat1006:~$ systemctl cat reportupdater-cx
# /lib/systemd/system/reportupdater-cx.service
[Unit]
Description=Report Updater job for cx
[Service]
User=stats
SyslogIdentifier=reportupdater-cx
ExecStart=/usr/bin/python /srv/reportupdater/reportupdater/update_reports.py -l info /srv/reportupdater/jobs/limn-language-data/cx /srv/reportupdater/output/metrics/cx
You need to look for the User=something
definition. In this case, the unit is run by stats
.
Disable a Systemd timer
sudo systemctl disable name-of-the-timer.timer
sudo systemctl stop name-of-the-timer.timer
The former prevents the timer to be started after boot, the latter removes it from the regular scheduling. After you have done the above steps, verify with the following command that the timer is not listed anymore:
systemctl list-timers | grep name-of-the-timer
Finally don't forget to check if an icinga alarm was fired, and follow the next section accordingly if needed.
Shut off / Reset Icinga alert already fired
It might happen that an alert has fired for a timer but you just want to shut it off, because you have already been taken care of the problem (for example, executing a manual run, etc..). This is what is needed to do:
sudo systemctl reset-failed name-of-the-timer.service
It is important to reset the service unit, not the timer one, since it is only the former that causes the alarm (the timer unit's state is not important at this stage).
In the case of timers that run multiple commands, like refinery-sqoop-whole-mediawiki, see if you can finish running the commands manually and reset the alert as above. There's nothing special about these timers that can help resume operation or anything like that.
Check logs for a systemd timer
The first step is always to use systemctl list-timers, as explained above, to gather the name of the timer to check. In the above stat1006 example, let's pick reportupdater-cx.timer
. The logs will be available simply doing:
# Note that we are not using the .timer unit, but the .service one!
# The .service suffix is assumed if no one is specified.
sudo journalctl -u reportupdater-cx
Usually for important jobs we log stuff to disk, since journald by default logs to a tmpfs partition (that stays in memory and gets flushed on each reboot). In order to see what is the log configuration, two things can be done:
- Check the puppet definition of the systemd timer, it will list the log directory.
- Check rsyslog.d configuration on the host itself.
Example for the former:
systemd::timer::job { "reportupdater-${title}":
ensure => $ensure,
description => "Report Updater job for ${title}",
command => "/usr/bin/python ${::reportupdater::path}/update_reports.py -l info ${query_path} ${output_path}",
interval => {
'start' => 'OnCalendar',
'interval' => $interval
},
user => $::reportupdater::user,
monitoring_enabled => $monitoring_enabled,
monitoring_contact_groups => 'analytics',
logging_enabled => true,
logfile_basedir => $::reportupdater::log_path, <==================
logfile_name => 'syslog.log',
logfile_owner => $::reportupdater::user,
logfile_group => $::reportupdater::user,
logfile_perms => 'all',
syslog_force_stop => true,
syslog_identifier => "reportupdater-${title}",
}
About the latter: why rsyslog? Wasn't it journald? We have a rsyslog hooked in into journald so we can add specific rules to create log files. For example, let's check the reportupdater-cx
job:
elukey@stat1006:~$ cat /etc/rsyslog.d/
10-exporter.conf 20-reportupdater-edit-beta-features.conf 20-reportupdater-flow-beta-features.conf 20-reportupdater-mt-engines.conf 20-reportupdater-published-cx2-translations.conf
10-puppet-agent.conf 20-reportupdater-ee-beta-features.conf 20-reportupdater-flow.conf 20-reportupdater-page-creation.conf 30-remote-syslog.conf
20-reportupdater-cx.conf 20-reportupdater-ee.conf 20-reportupdater-language.conf 20-reportupdater-pingback.conf README
elukey@stat1006:~$ cat /etc/rsyslog.d/20-reportupdater-cx.conf
# rsyslogd(8) configuration file for services.
# This file is managed by Puppet.
:programname, startswith, "reportupdater-cx" /srv/reportupdater/log/reportupdater-cx/syslog.log
& stop
elukey@stat1006:~$ ls -l /srv/reportupdater/log/reportupdater-cx/syslog.log
-rw-r--r-- 1 stats stats 647027 Mar 27 06:00 /srv/reportupdater/log/reportupdater-cx/syslog.log
In this case, the journald logs are also replicated to /srv/reportupdater/log/reportupdater-cx/syslog.log
Failed service procedure (ops week)
What to do when it's your ops week and an email has been sent to analytics-alert about a failed-timer?
Checking logs
Depending on jobs, logs are stored only in journald
logs (see above), or split between journald
and a log-file defined in the job command-line.
Usefull command:
sudo journalctl -u SERVICE_NAME
Resetting service status to normal
There are two options to reset a failed service to normal status (after investigation, potential correction etc):