OTRS

From Wikitech
Jump to: navigation, search

OTRS is installed on mendelevium.wikimedia.org.

  • URL is https://ticket.wikimedia.org/otrs/index.pl
  • The root user/pass is in the ops password repo
  • We use mod_perl with ModPerl::Registry, so whenever you change a file, you need to /etc/init.d/apache2 reload
  • The "News" messages on the main OTRS login screen can be editing by modifying /opt/otrs/Kernel/Output/HTML/Templates/Standard/Motd.dtl.

There is no need to update config files to add email addresses to OTRS; Inbound MX servers will automatically see that the queue exists or has disappeared. However it is possible (due to negative caching at the secondary mail exchangers) that new addresses will take up to two hours to begin working.

Configuration

The OTRS source is in /opt/otrs

The primary configuration file is /opt/otrs/Kernel/Config.pm

Configuration can be done by examining Kernel/Config/Files/ZZZAuto.pm and Kernel/Config/Files/ZZZAAuto.pm for default values, and then making a corresponding addition to Kernel/Config.pm. When you change the config, reload apache to clear the mod_perl cache.

OTRS configuration is meant to be primarily done via the Kernel/Config/Files/*.xml files. These are XML files edited by the SysConfig module. Viewing the package manager or running bin/otrs.RebuildConfig.pl will read the XML files and then regenerate the ZZZ*.pm files. Then at runtime, the ZZZ*.pm files are read, followed by the Config.pm overrides. So far so good.

The problem is that the XML files are a mixture of user-level configuration, technical data such as core module registrations, and interface text. They are distributed with OTRS, and trying to use old XML files with new OTRS versions will break horribly, because there is no other registry of defaults (except for a few special cases in the code), so vital interface text and module registrations will be missing.

Diffing and merging is difficult due to the lack of comments and the spurious changes introduced when changing things in the web interface.

So save yourself the hassle and edit Config.pm.

Local Patches

The codebase has some local customizations, which were applied as OTRS packages. The packages are stored in the database in otrs.package_repository and can be reapplied after update via the web UI. The packages can also be downloaded from within the OTRS web interface, when logged in as an admin user:

Admin-->Package Manager-->{packages listed under Local Repository}-->Download

Database backend

The primary database is on the m2 shard, database named 'otrs'.

Troubleshooting

Mail delivery

  • When user/group for the exim pipe were incorrect, otrs.Postmaster.pl logged permissions to /var/log/mail.log about errors on attempts to write in /opt/otrs/var/tmp/CacheFileStorable. We fixed this by configuring the exim pipe to use group=www-data.
  • Mail hosts need mysql access to the otrs database. If mx IP addresses change or the database is inaccessible, mail defers on whichever mx is trying to do an address lookup. When we saw this happen, exim wasn't very informative about why. When in doubt double-check mysql access from the mx's command line.
  • Spamassassin runs locally, and logs in /var/log/mail.log.

Apache permissions errors

  • as user otrs run:
/opt/otrs/bin/otrs.SetPermissions.pl --otrs-user=otrs --otrs-group=otrs --web-user=www-data --web-group=www-data /opt/otrs

SpamAssassin stops reporting Bayes results

  • This happened 2014-04-24, 2016-08-06, 2016-12-21 and we discovered it was unhappy about the bayes database.
  • /var/log/syslog was full of this:
bayes db version 0 is not able to be used, aborting! at /usr/share/perl5/Mail/SpamAssassin/BayesStore/DBM.pm line 203, <GEN88>
  • We tried backup/restore the database (the verify failed), and the database shrank from ~24M to ~14M and SA stopped complaining. But SA continued to pass through mail with no Bayes results (the BAYES_XX where XX in [00,90] Header added to the message was missing)
  • So we moved aside the old database, modifed otrs.TicketExport2Mbox.pl not to skip previously-seen messages, and created one-time GenericAgent [within OTRS] jobs to re-export a couple of days worth of ham/spam. Then we ran train_spamassaassin manually to train on all this data. Note otrs.TicketExport2Mbox now has --rebuild mode to support this process.
  • On the 2016-08-06 incident the log statement below was found in the logs:
  Aug  6 09:56:59.752 [1619] dbg: bayes: not available for scanning, only 126 ham(s) in bayes DB < 200

after running a:

 sudo -u debian-spamd spamassassin -D bayes < /tmp/sample_email.eml

and a

 sudo -u debian-spamd sa-learn --dump magic

confirmed it.

The fix was the re-exporting and training of spamassasion as mentioned above. Extra care should be take to make sure the spam/ham messages exported are above 200 in every case.

  • On the 2016-12-21 incident both the hams and the spams in the database were below 200. That was not logged however as the message about the hams above, leading the investigation off track for a while. Exporting quite a few messages and training spamassasin on them as above fixed the issue. The database in this case was NOT being marked as corrupted by db_verify but it was truncated in the end just for good measure manually

Mail setup

E-mail is sent and received through a special Exim instance on the hosting server. Its configuration follows the lines of the setup described in Mail, but OTRS specific configuration is listed below.

Spam and Malware scanning

SpamAssassin and ClamAV are used for spam/malware scanning, in Exim ACL which is run at the DATA phase during the SMTP connection. Should SpamAssassin fail for some reason, mail is let through.

acl_check_data:
    # skip spam-check for locally-submitted messages
    accept hosts = +relay_from_hosts
        set acl_m0 = trusted relay

    # skip if message is too large (>4M)
    accept condition = ${if >{$message_size}{4M}}
        set acl_m0 = n/a
        set acl_m1 = skipped, message too large

    # skip if whitelisted in exim
    accept condition = ${if eq{$acl_m2}{skip_spamd}}
        set acl_m0 = n/a
        set acl_m1 = skipped, exim whitelist

    # add spam headers...
    warn spam = nonexistent:true
        set acl_m0 = $spam_score ($spam_bar)
        set acl_m1 = $spam_report
        set acl_m3 = $spam_score_int

    # silently drop spam at high scores (> 12)
    discard log_message = spam detected ($spam_score)
        condition = ${if >{$spam_score_int}{120}{1}{0}}

    # silently discard messages with malware attached
    discard log_message = malware detected ($malware_name)
        demime = *
        malware = *

    accept

Message tagging

We use Exim filters to tag messages with headers that OTRS can match for automatic queue routing. The Exim filters are in /etc/exim4/system_filter (see the inline comments):

# Exim filter

if first_delivery then
    # Remove headers that control OTRS - we don't want these
    headers remove X-OTRS-Priority:X-OTRS-Queue:X-OTRS-Lock:X-OTRS-Ignore:X-OTRS-State
    if $acl_m0 is not "trusted relay" then
        # Remove any SpamAssassin headers and add local ones
        headers remove X-Spam-Score:X-Spam-Report:X-Spam-Checker-Version:X-Spam-Status:X-Spam-Level:X-Spam-Flag
    endif
    if $acl_m0 is not "" and $acl_m0 is not "trusted relay" then
        headers add "X-Spam-Score: $acl_m0"
        headers add "X-Spam-Report: $acl_m1"
        # Add header for OTRS filters
        if $acl_m1 is not "" and $acl_m1 begins "yes" then
            headers add "X-Spam-Flag: YES"
        # overload X-Spam-Flag since OTRS doesn't do numeric comparison
        elif $acl_m3 is not "" and $acl_m3 is above 20 then
            headers add "X-Spam-Flag: MAYBE"
        else
            headers add "X-Spam-Flag: NO"
        endif
        # add a hook for OTRS to filter list mail
        if
            ($message_headers contains "\nList-Id:" or
            $message_headers contains "\nList-Help:" or
            $message_headers contains "\nList-Subscribe:" or
            $message_headers contains "\nList-Unsubscribe:" or
            $message_headers contains "\nList-Post:" or
            $message_headers contains "\nList-Owner:" or
            $message_headers contains "\nList-Archive:") and
            $header_precedence: does not match "^(bulk|junk|list)"
        then
            headers remove Precedence
            headers add "Precedence: bulk"
        endif
    endif
endif

OTRS mail routing

Mail destined for OTRS is served by a simple accept router otrs, which does a MySQL database query to determine the validity of the recipient address being routed, similar to the check done earlier by mchenry.

# Mail destined for OTRS

otrs:
        driver = accept
        condition = ${lookup mysql{SELECT value0 FROM system_address WHERE value0='${quote_mysql:$local_part@$domain}'}{true}fail}
        transport = otrs

On success, the message is handed over to the otrs pipe transport:

# OTRS pipe transport

otrs:
        driver = pipe
        command = OTRS_POSTMASTER
        current_directory = OTRS_HOME
        home_directory = OTRS_HOME
        user = OTRS_USER
        group = OTRS_GROUP
        freeze_exec_fail
        log_fail_output
        timeout = 1m
        timeout_defer

This transport pipes the full contents of the message to the command/path specified in the macro OTRS_POSTMASTER (defined at the top of the file). A current and home directory will be set as specified, and the command will be run as the otrs user and group. If the actual execution/invocation fails for some reason, the message will be frozen on the queue with a warning message sent to root. If the command invocation succeeds, but the return code is EX_TEMPFAIL (e.g. when OTRS cannot access the database), the message is deferred/queued, and will be retried later. Any output will be logged.

Outbound mail

Any mail destined for an address that is not an OTRS address, e.g. mail submitted by OTRS itself, will be forwarded to an outbound MX.

ClamAV

The server runs its own ClamAV instance, using the stock clamav-daemon package. The daemon runs as user clamav which has read access to the mail queue via membership in group Debian-exim. Per the stock config, the freshclam daemon is used to update virus definitions.

Exim accesses ClamAV via unix socket at /var/run/clamav/clamd.ctl and silently drops and logs messages containing an infected attachment.

SpamAssassin

The server runs its own SpamAssassin instance. The stock spamassassin package is used, with daily updates enabled. Stock rules/scores are kept and we make a few local modifications which are listed below.

Multiple user profiles are not used, SpamAssassin reads global configuration settings and runs as user otrs. Training databases are stored in that user's homedir.

/etc/default/spamassassin

# Change to one to enable spamd
ENABLED=1

# Options
# See man spamd for possible options. The -d option is automatically added.

# SpamAssassin uses a preforking model, so be careful! You need to
# make sure --max-children is not set to anything higher than 5,
# unless you know what you're doing.

OPTIONS="--max-children 8 --nouser-config --listen-ip=127.0.0.1 -u otrs -g otrs"

# Pid file
# Where should spamd write its PID to file? If you use the -u or
# --username option above, this needs to be writable by that user.
# Otherwise, the init script will not be able to shut spamd down.
PIDFILE="/var/run/spamd.pid"

# Set nice level of spamd
NICE="--nicelevel 10"

# Cronjob
# Set to anything but 0 to enable the cron job to automatically update
# spamassassin's rules on a nightly basis
CRON=1

/etc/spamassassin/local.cf

Non-stock sections are shown here:

#   Set which networks or hosts are considered 'trusted' by your mail
#   server (i.e. not spammers)
#
trusted_networks 91.198.174.0/24 208.80.152.0/22 2620:0:860::/46 10.0.0.0/8

# short-format report template, starting with Yes/No, used for OTRS filters
clear_report_template
report _YESNO_, score=_SCORE_ | host: _HOSTNAME_ | scores: _TESTSSCORES(,)_ | autolearn=_AUTOLEARN_

#   Set file-locking method (flock is not safe over NFS, but is faster)
#
lock_method flock

#   Set the threshold at which a message is considered spam (default: 5.0)
#
required_score 3.5
score RP_MATCHES_RCVD -0.500
score RCVD_IN_RP_SAFE 2.000
score RCVD_IN_RP_CERTIFIED 2.000
score SPF_SOFTFAIL 2.000
score SUSPICIOUS_RECIPS 2.000

SpamAssassin Training

There a few steps to spam training:

  1. user moves spammy messages to the Junk queue
  2. OTRS Generic Agent:"Export_Spam" runs nightly, filtering for tickets which are not in state "Closed successful" passing MessageIDs to otrs.TicketExport2Mbox.pl
  3. otrs.TicketExport2Mbox.pl writes the messages to /var/spool/spam/spam, and changes the ticket's state to "Closed successful"
  4. /usr/local/bin/train_spamassassin picks up /var/spool/spam/spam and feeds it to sa-learn as spam

Ham training is similar:

  1. OTRS Generic Agent:"Export_Ham" runs nightly, filtering for tickets in non-Junk queues which are in states Open, or Closed successful" and feeding those TicketID's to otrs.TicketExport2Mbox.pl
  2. otrs.TicketExport2Mbox.pl writes the messages to /var/spool/spam/ham
  3. /usr/local/bin/train_spamassassin picks up /var/spool/spam/ham and feeds it to sa-learn as ham

The two scripts mentioned above are custom and are installed by puppet from operations/puppet/files/otrs/* in the git repository.

Upgrading

This section is general guidance for a patchlevel update only. Upgrades can be complicated by database schema changes and other issues. There's really no way around reading the upgrade documentation, and testing the updates on a real system based on our existing configuration and database.

  1. fetch new otrs, install to /opt/otrs-X.Y.Z
  2. stop puppet, apache, exim, and otrs cronjobs
    •  :~# puppet agent --disable
    •  :~# service apache2 stop
    •  :~# service exim4 stop
    •  :~# service cron stop
  3. create backups of existing code and database
    •  :/opt# tar czf backup/otrs-PREVIOUS_VERSION.tgz otrs-PREVIOUS_VERSION
    • stop a db slave and start dumping the otrs db there
  4. switch symlink, copy config into new code tree, fix permissions
    •  :/opt# rm otrs && ln -s otrs-VERSION otrs
    •  :/opt# cp otrs-PREVIOUS_VERSION/Kernel/Config.pm otrs/Kernel/
    •  :/opt# ./otrs/bin/otrs.SetPermissions.pl --otrs-user=otrs --otrs-group=otrs --web-user=www-data --web-group=www-data /opt/otrs
  5. check DB schema and upgrade as necessary
    •  :/opt# ./otrs/bin/otrs.CheckDB.pl
    • follow database upgrade instructions in UPGRADING.md
  6. restart apache, log in as an admin to the web interface and reinstall all addon packages
    •  :/opt# service apache2 start
    • ADMIN -> System Administration -> Package Manager
    • Reinstall will be under the ACTION column for each package
  7. recover old sysconfig settings
    •  :/opt# cp otrs-PREVIOUS_VERSION/Kernel/Config/Files/ZZZA* otrs/Kernel/Config/Files/
    •  :/opt# ./otrs/bin/otrs.SetPermissions.pl --otrs-user=otrs --otrs-group=otrs --web-user=www-data --web-group=www-data /opt/otrs
    •  :/opt# service apache2 restart
  8. test functionality
  9. restart exim, manually run puppet (which installs some scripts and reenables cron jobs)
    •  :~# service exim start
    •  :~# service cron start
    •  :~# puppetd -tv
    • send a mail to e.g. info-en and check that it shows up in OTRS
  10. restart slave database

Backups

Our OTRS installation is almost fully database only. That is, all data is stored in a mysql database (m2 shard), and only configuration and code is stored locally on the hosting server. OTRS is open source and hence almost impossible to lose the code and most of the configuration is stored in puppet. There are a few configuration items that are stored locally on the server but this is temporary and those are in the end transferred to puppet. Hence we only care about database data being safe and backed up.

Database is regularly backed up once per week (on Wednesday currently). The infrastructure used is Bacula and most documentation from that page applies. The code doing the pre dump is in https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/templates/mariadb/backups/dumps-otrs.sh.erb, and bacula just backs up the resulting file.

Restoring at a previous point in time is quite easy and all it takes is restore the dump from bacula (covered in Bacula) and applying it to the db server via the mysql command. Restoring individual items (like an article being deleted) is possible but quite complicated and difficult and requires a DBA to help isolate the specific transaction and avoid replaying it while replaying logs between the last backup and the time of the incident. It has never been done, nor required up to now.