Analytics/Systems/Managing systemd timers

From Wikitech
Jump to navigation Jump to search

In phab:T172532 we decided to move away as much as possible from cron jobs since we were struggling to alarm when a job failed. Most of our recurrent jobs are running on an-coord1001 (analytics cluster coordinator), and the majority of them are systemd timers.

Anatomy of a Systemd timer

A systemd timer is composed by two parts: a unit and the timer itself. Let's pick an example:

elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.timer
# /lib/systemd/system/hdfs-balancer.timer
[Unit]
Description=Periodic execution of hdfs-balancer.service

[Timer]
Unit=hdfs-balancer.service
# Accuracy sets the maximum time interval around the execution time we want to allow
AccuracySec=15sec
OnCalendar=*-*-* 06:00:00
RandomizedDelaySec=0

[Install]
WantedBy=multi-user.target


elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.service
# /lib/systemd/system/hdfs-balancer.service
[Unit]
Description=Run the HDFS balancer script to keep HDFS blocks replicated in the most redundant and efficient way.

[Service]
User=hdfs
ExecStart=/usr/local/bin/hdfs-balancer

In this case, we are checking how it is configured the HDFS balancer timer. The .timer file references a systemd unit (the .service file), that it is the one doing the work, and a pattern of execution (in this case, every day at 6 AM UTC). The .service file defines a command, together with other info like environment variables and username that needs to execute it. This command that .service points to can be generated by puppet, so peek into it if you need to see the details, for example: sudo -u hdfs cat /usr/local/bin/refinery-sqoop-mediawiki.

We alarm on the return code of the .service Systemd unit: if it gets executed and the script mentioned by ExecStart returns a non-zero exit code, Nagios will alert us.

All these configuration files seem a bit confusing, but they are good from the operations point of view. A systemd .service can be started/restarted any time, basically forcing the execution of its script. In this case:

sudo systemctl start hdfs-balancer.service

forces the hdfs-balancer to execute. Where do we find logs? At the moment the majority of them are only on journald, but they will be logged on files as well as soon as possible. If you want to check journald for a particular .service:

# start from the first log registered
sudo journalctl -u hdfs-balancer

# start from a specific moment of time
sudo journalctl -u hdfs-balancer --since '3 days ago'

# show only a specific timeframe
sudo journalctl --since "2018-10-21 23:15:00" --until "2018-10-21 23:20:00"

# tail -f equivalent
sudo journalctl -u hdfs-balancer --follow

An example of debugging for db1107: https://phabricator.wikimedia.org/T207165

ExecStart and difference with bash

Please note that the command executed by the service unit (the one assigned to the ExecStart field) is parsed following a systemd specific format, that is similar to what bash offers but very different in may ways. If you don't keep in mind this rule you'll end up writing units that either fail due to weird argument parsing or even worse executing with wrong parameters. The fix in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/530555/ is a good example: a shell redirection ( command > file.log) is not parsed as we expect by systemd, that 1) considers '>' as special char and 2) considers everything split by a whitespace (and not quoted) as separate argument.

Operations

List all timers running on the host

systemctl list-timers

Example of output from stat1006:

elukey@stat1006:~$ systemctl list-timers
NEXT                         LEFT          LAST                         PASSED       UNIT                                           ACTIVATES
Wed 2019-03-27 06:43:00 UTC  2min 54s left Wed 2019-03-27 06:28:00 UTC  12min ago    prometheus-node-exporter-apt.timer             prometheus-node-exporter-apt.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-cx.timer                         reportupdater-cx.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-edit-beta-features.timer         reportupdater-edit-beta-features.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-ee-beta-features.timer           reportupdater-ee-beta-features.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-ee.timer                         reportupdater-ee.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-flow-beta-features.timer         reportupdater-flow-beta-features.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-flow.timer                       reportupdater-flow.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-language.timer                   reportupdater-language.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-mt_engines.timer                 reportupdater-mt_engines.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-page-creation.timer              reportupdater-page-creation.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-pingback.timer                   reportupdater-pingback.service
Wed 2019-03-27 07:00:00 UTC  19min left    Wed 2019-03-27 06:00:00 UTC  40min ago    reportupdater-published_cx2_translations.timer reportupdater-published_cx2_translations.service
Wed 2019-03-27 07:09:00 UTC  28min left    Wed 2019-03-27 06:39:00 UTC  1min 6s ago  phpsessionclean.timer                          phpsessionclean.service
Wed 2019-03-27 07:11:26 UTC  31min left    Tue 2019-03-26 07:11:26 UTC  23h ago      systemd-tmpfiles-clean.timer                   systemd-tmpfiles-clean.service
Wed 2019-03-27 07:49:31 UTC  1h 9min left  Wed 2019-03-27 02:59:18 UTC  3h 40min ago apt-daily.timer                                apt-daily.service
Thu 2019-03-28 06:37:18 UTC  23h left      Wed 2019-03-27 06:20:06 UTC  19min ago    apt-daily-upgrade.timer                        apt-daily-upgrade.service

It is easy to see all the timings for a certain systemd timer and the units it runs. For example, the reportupdater-cx job has two units:

  • reportupdater-cx.timer -> this is basically the equivalent of the cron definition (you can inspect it via systemctl cat reportupdater-cx.timer and see its config)
  • reportupdater-cx.service -> this is the command to run together with its environment etc.., that is called/managed by the above .timer unit.

Check which user runs a timer

Everything always starts with a systemctl list-timers (as explained above) to find the .timer unit to check. If we keep the stat1006 example above, let's say we want to check what user runs reportupdater-cx.timer: we'll need to inspect the related .service unit (since that one is the "real" command executed, the .timer one is only the cron-like definition and we don't care who runs it):

elukey@stat1006:~$ systemctl cat reportupdater-cx
# /lib/systemd/system/reportupdater-cx.service
[Unit]
Description=Report Updater job for cx

[Service]
User=stats
SyslogIdentifier=reportupdater-cx
ExecStart=/usr/bin/python /srv/reportupdater/reportupdater/update_reports.py -l info /srv/reportupdater/jobs/limn-language-data/cx /srv/reportupdater/output/metrics/cx

You need to look for the User=something definition. In this case, the unit is run by stats.

Disable a Systemd timer

sudo systemctl disable name-of-the-timer.timer
sudo systemctl stop name-of-the-timer.timer

The former prevents the timer to be started after boot, the latter removes it from the regular scheduling. After you have done the above steps, verify with the following command that the timer is not listed anymore:

systemctl list-timers | grep name-of-the-timer

Finally don't forget to check if an icinga alarm was fired, and follow the next section accordingly if needed.

Shut off / Reset Icinga alert already fired

It might happen that an alert has fired for a timer but you just want to shut it off, because you have already been taken care of the problem (for example, executing a manual run, etc..). This is what is needed to do:

sudo systemctl reset-failed name-of-the-timer.service

It is important to reset the service unit, not the timer one, since it is only the former that causes the alarm (the timer unit's state is not important at this stage).

In the case of timers that run multiple commands, like refinery-sqoop-whole-mediawiki, see if you can finish running the commands manually and reset the alert as above. There's nothing special about these timers that can help resume operation or anything like that.

Check logs for a systemd timer

The first step is always to use systemctl list-timers, as explained above, to gather the name of the timer to check. In the above stat1006 example, let's pick reportupdater-cx.timer. The logs will be available simply doing:

# Note that we are not using the .timer unit, but the .service one!
# The .service suffix is assumed if no one is specified.
sudo journalctl -u reportupdater-cx

Usually for important jobs we log stuff to disk, since journald by default logs to a tmpfs partition (that stays in memory and gets flushed on each reboot). In order to see what is the log configuration, two things can be done:

  • Check the puppet definition of the systemd timer, it will list the log directory.
  • Check rsyslog.d configuration on the host itself.

Example for the former:

    systemd::timer::job { "reportupdater-${title}":
        ensure                    => $ensure,
        description               => "Report Updater job for ${title}",
        command                   => "/usr/bin/python ${::reportupdater::path}/update_reports.py -l info ${query_path} ${output_path}",
        interval                  => {
            'start'    => 'OnCalendar',
            'interval' => $interval
        },
        user                      => $::reportupdater::user,
        monitoring_enabled        => $monitoring_enabled,
        monitoring_contact_groups => 'analytics',
        logging_enabled           => true,
        logfile_basedir           => $::reportupdater::log_path,      <==================
        logfile_name              => 'syslog.log',
        logfile_owner             => $::reportupdater::user,
        logfile_group             => $::reportupdater::user,
        logfile_perms             => 'all',
        syslog_force_stop         => true,
        syslog_identifier         => "reportupdater-${title}",
    }

About the latter: why rsyslog? Wasn't it journald? We have a rsyslog hooked in into journald so we can add specific rules to create log files. For example, let's check the reportupdater-cx job:

elukey@stat1006:~$ cat /etc/rsyslog.d/
10-exporter.conf                                  20-reportupdater-edit-beta-features.conf          20-reportupdater-flow-beta-features.conf          20-reportupdater-mt-engines.conf                  20-reportupdater-published-cx2-translations.conf
10-puppet-agent.conf                              20-reportupdater-ee-beta-features.conf            20-reportupdater-flow.conf                        20-reportupdater-page-creation.conf               30-remote-syslog.conf
20-reportupdater-cx.conf                          20-reportupdater-ee.conf                          20-reportupdater-language.conf                    20-reportupdater-pingback.conf                    README
elukey@stat1006:~$ cat /etc/rsyslog.d/20-reportupdater-cx.conf
# rsyslogd(8) configuration file for services.

# This file is managed by Puppet.
:programname, startswith, "reportupdater-cx" /srv/reportupdater/log/reportupdater-cx/syslog.log
& stop
elukey@stat1006:~$ ls -l /srv/reportupdater/log/reportupdater-cx/syslog.log
-rw-r--r-- 1 stats stats 647027 Mar 27 06:00 /srv/reportupdater/log/reportupdater-cx/syslog.log

In this case, the journald logs are also replicated to /srv/reportupdater/log/reportupdater-cx/syslog.log