Data Platform/Systems/Managing systemd timers

In phab:T172532 we decided to move away as much as possible from cron jobs since we were struggling to alarm when a job failed. Most of our recurrent jobs are running on an-coord1001 (analytics cluster coordinator), and the majority of them are systemd timers.

Timers are different from cron-jobs noticeably in that they automatically restart failed jobs. If you need a timer to be stopped and executed later, first disable the timer (see section below), then kill the currently running job, and finally re-enable the timer at will.

Anatomy of a Systemd timer

A systemd timer is composed by two parts: a unit and the timer itself. Let's pick an example:

elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.timer
# /lib/systemd/system/hdfs-balancer.timer
[Unit]
Description=Periodic execution of hdfs-balancer.service

[Timer]
Unit=hdfs-balancer.service
# Accuracy sets the maximum time interval around the execution time we want to allow
AccuracySec=15sec
OnCalendar=*-*-* 06:00:00
RandomizedDelaySec=0

[Install]
WantedBy=multi-user.target


elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.service
# /lib/systemd/system/hdfs-balancer.service
[Unit]
Description=Run the HDFS balancer script to keep HDFS blocks replicated in the most redundant and efficient way.

[Service]
User=hdfs
ExecStart=/usr/local/bin/hdfs-balancer

In this case, we are checking how it is configured the HDFS balancer timer. The .timer file references a systemd unit (the .service file), that it is the one doing the work, and a pattern of execution (in this case, every day at 6 AM UTC). The .service file defines a command, together with other info like environment variables and username that needs to execute it. This command that .service points to can be generated by puppet, so peek into it if you need to see the details, for example: sudo -u hdfs cat /usr/local/bin/refinery-sqoop-mediawiki.

We alarm on the return code of the .service Systemd unit: if it gets executed and the script mentioned by ExecStart returns a non-zero exit code, Nagios will alert us.

All these configuration files seem a bit confusing, but they are good from the operations point of view. A systemd .service can be started/restarted any time, basically forcing the execution of its script. In this case:

sudo systemctl start hdfs-balancer.service

forces the hdfs-balancer to execute. Where do we find logs? At the moment the majority of them are only on journald, but they will be logged on files as well as soon as possible. If you want to check journald for a particular .service:

# start from the first log registered
sudo journalctl -u hdfs-balancer

# start from a specific moment of time
sudo journalctl -u hdfs-balancer --since '3 days ago'

# show only a specific timeframe
sudo journalctl --since "2018-10-21 23:15:00" --until "2018-10-21 23:20:00"

# tail -f equivalent
sudo journalctl -u hdfs-balancer --follow

An example of debugging for db1107: https://phabricator.wikimedia.org/T207165

ExecStart and difference with bash

Please note that the command executed by the service unit (the one assigned to the ExecStart field) is parsed following a systemd specific format, that is similar to what bash offers but very different in may ways. If you don't keep in mind this rule you'll end up writing units that either fail due to weird argument parsing or even worse executing with wrong parameters. The fix in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530555/ is a good example: a shell redirection ( command > file.log) is not parsed as we expect by systemd, that 1) considers '>' as special char and 2) considers everything split by a whitespace (and not quoted) as separate argument.

Calculating OnCalendar interval

The OnCalender interval syntax is powerful, but isn't always the most straightforward to figure out. There is a command to help: systemd-analyze.

To check the syntax of OnCalendar, you can run e.g.

$ systemd-analyze calendar '*-*-* *:05,35:00'
Normalized form: *-*-* *:05,35:00
    Next elapse: Thu 2021-07-08 14:35:00 UTC
       From now: 24min left

Operations

List all timers running on the host

systemctl list-timers

Example of output from stat1006:

elukey@stat1006:~$ systemctl list-timers
NEXT                         LEFT          LAST                         PASSED       UNIT                                           ACTIVATES
Wed 2019-03-27 06:43:00 UTC  2min 54s left Wed 2019-03-27 06:28:00 UTC  12min ago    prometheus-node-exporter-apt.timer             prometheus-node-exporter-apt.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-cx.timer                         reportupdater-cx.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-edit-beta-features.timer         reportupdater-edit-beta-features.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-ee-beta-features.timer           reportupdater-ee-beta-features.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-ee.timer                         reportupdater-ee.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-flow-beta-features.timer         reportupdater-flow-beta-features.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-flow.timer                       reportupdater-flow.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-language.timer                   reportupdater-language.service

To check on a specific timer, use status: systemctl status refinery-sqoop-whole-mediawiki

It is easy to see all the timings for a certain systemd timer and the units it runs. For example, the reportupdater-cx job has two units:

reportupdater-cx.timer -> this is basically the equivalent of the cron definition (you can inspect it via systemctl cat reportupdater-cx.timer and see its config)
reportupdater-cx.service -> this is the command to run together with its environment etc.., that is called/managed by the above .timer unit.

List command timers are actually running

nuria@an-launcher1002:~$ systemctl cat drop-el-unsanitized-events.service

# /lib/systemd/system/drop-el-unsanitized-events.service
[Unit]
Description=Drop unsanitized EventLogging data from the event database after retention period.

[Service]
User=analytics
Environment=PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python
ExecStart=/usr/local/bin/kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-drop-older-than --some

Check which user runs a timer

Everything always starts with a systemctl list-timers (as explained above) to find the .timer unit to check. If we keep the stat1006 example above, let's say we want to check what user runs reportupdater-cx.timer: we'll need to inspect the related .service unit (since that one is the "real" command executed, the .timer one is only the cron-like definition and we don't care who runs it):

elukey@stat1006:~$ systemctl cat reportupdater-cx
# /lib/systemd/system/reportupdater-cx.service
[Unit]
Description=Report Updater job for cx

[Service]
User=stats
SyslogIdentifier=reportupdater-cx
ExecStart=/usr/bin/python /srv/reportupdater/reportupdater/update_reports.py -l info /srv/reportupdater/jobs/limn-language-data/cx /srv/reportupdater/output/metrics/cx

You need to look for the User=something definition. In this case, the unit is run by stats.

Disable a Systemd timer

Very important: to keep the timer disabled/stopped puppet needs to be disabled for the time being.

sudo systemctl disable name-of-the-timer.timer
sudo systemctl stop name-of-the-timer.timer

The former prevents the timer to be started after boot, the latter removes it from the regular scheduling. After you have done the above steps, verify with the following command that the timer is not listed anymore:

systemctl list-timers | grep name-of-the-timer

Finally don't forget to check if an icinga alarm was fired, and follow the next section accordingly if needed.

Shut off / Reset Icinga alert already fired

It might happen that an alert has fired for a timer but you just want to shut it off, because you have already been taken care of the problem (for example, executing a manual run, etc..). This is what is needed to do:

sudo systemctl reset-failed name-of-the-timer.service

It is important to reset the service unit, not the timer one, since it is only the former that causes the alarm (the timer unit's state is not important at this stage).

In the case of timers that run multiple commands, like refinery-sqoop-whole-mediawiki, see if you can finish running the commands manually and reset the alert as above. There's nothing special about these timers that can help resume operation or anything like that.

Check logs for a systemd timer

The first step is always to use systemctl list-timers, as explained above, to gather the name of the timer to check. In the above stat1006 example, let's pick reportupdater-cx.timer. The logs will be available simply doing:

# Note that we are not using the .timer unit, but the .service one!
# The .service suffix is assumed if no one is specified.
sudo journalctl -u reportupdater-cx

Usually for important jobs we log stuff to disk, since journald by default logs to a tmpfs partition (that stays in memory and gets flushed on each reboot). In order to see what is the log configuration, two things can be done:

Check the puppet definition of the systemd timer, it will list the log directory.
Check rsyslog.d configuration on the host itself.

Example for the former:

    systemd::timer::job { "reportupdater-${title}":
        ensure                    => $ensure,
        description               => "Report Updater job for ${title}",
        command                   => "/usr/bin/python ${::reportupdater::path}/update_reports.py -l info ${query_path} ${output_path}",
        interval                  => {
            'start'    => 'OnCalendar',
            'interval' => $interval
        },
        user                      => $::reportupdater::user,
        monitoring_enabled        => $monitoring_enabled,
        monitoring_contact_groups => 'analytics',
        logging_enabled           => true,
        logfile_basedir           => $::reportupdater::log_path,      <==================
        logfile_name              => 'syslog.log',
        logfile_owner             => $::reportupdater::user,
        logfile_group             => $::reportupdater::user,
        logfile_perms             => 'all',
        syslog_force_stop         => true,
        syslog_identifier         => "reportupdater-${title}",
    }

About the latter: why rsyslog? Wasn't it journald? We have a rsyslog hooked in into journald so we can add specific rules to create log files. For example, let's check the reportupdater-cx job:

elukey@stat1006:~$ cat /etc/rsyslog.d/
10-exporter.conf                                  20-reportupdater-edit-beta-features.conf          20-reportupdater-flow-beta-features.conf          20-reportupdater-mt-engines.conf                  20-reportupdater-published-cx2-translations.conf
10-puppet-agent.conf                              20-reportupdater-ee-beta-features.conf            20-reportupdater-flow.conf                        20-reportupdater-page-creation.conf               30-remote-syslog.conf
20-reportupdater-cx.conf                          20-reportupdater-ee.conf                          20-reportupdater-language.conf                    20-reportupdater-pingback.conf                    README
elukey@stat1006:~$ cat /etc/rsyslog.d/20-reportupdater-cx.conf
# rsyslogd(8) configuration file for services.

# This file is managed by Puppet.
:programname, startswith, "reportupdater-cx" /srv/reportupdater/log/reportupdater-cx/syslog.log
& stop
elukey@stat1006:~$ ls -l /srv/reportupdater/log/reportupdater-cx/syslog.log
-rw-r--r-- 1 stats stats 647027 Mar 27 06:00 /srv/reportupdater/log/reportupdater-cx/syslog.log

In this case, the journald logs are also replicated to /srv/reportupdater/log/reportupdater-cx/syslog.log

Failed service procedure (ops week)

What to do when it's your ops week and an email has been sent to analytics-alert about a failed-timer?

Checking logs

Depending on jobs, logs are stored only in journald logs (see above), or split between journald and a log-file defined in the job command-line.

Usefull command:

sudo journalctl -u SERVICE_NAME

Resetting service status to normal

There are two options to reset a failed service to normal status (after investigation, potential correction etc):

Monitoring/systemd_unit_state