Puppet/Performance investigation

From Wikitech
Jump to navigation Jump to search

This page serves to chronicle my investigation into why puppet is so slow in our environment. Bhartshorne 17:01, 9 August 2011 (UTC)

Potential problems to check

  • the built in ruby web server is dog slow. Are we using apache / passenger?
    • yes we are, as evidence by ps on sockpupppet
  • do we have any recursive chmods? apparently, they can be very slow.

Hypotheses about why we're slow

  • we're being too clever - things like auto-generating the nagios config based on the content of other classes is too much work. We should switch to maintaining more config files by hand and just using puppet to distribute them.
    • aka too many virtual and external resources
  • there's too much in there - we should pull out all the 'default' stuff and only actually manage what we want done differently than it would without intervention.
  • While a client puppet update was seemingly hanging for several minutes, I ran an strace on sockpuppet and saw a lot of time being spent on mysql queries to db9. This is for the stored config mechanism.
  • comments from mark:
What I've found slows down Puppet the most, is exported resources. The implementation of that in Puppet is a complete clusterfuck; a very unoptimized ActiveRecord setup with very inefficient db queries, and internally Puppet appears to be doing some silly things, too. The main uses of exported resources in our setup are Nagios and ssh host keys. That's why Spence especially is slow.
The implementation of the Nagios types in particular is also quite retarded indeed, with the full file being scanned every time. Still, even so, it shouldn't be *that* bad, and Ruby is probably to be blamed for that. We could fix that easily by splitting to one file per host, I guess.
The SSH host keys syncing I've already reduced to once every several runs (with a hack), which helped a lot.
  • I've heard the claim taht the puppet client leaks memory, and that switching to cron-based puppet runs will behave better overall. I'm not sure it'll change performance, but it's worth at try.
    • the puppet client is getting kicked every night, which should clear out any memory leaks.

Avenues of investigation

server side

  • look at apache/passenger config and see if there are easy tunables
  • find out how long it takes to generate the catalogs
  • find out what happens after checking in a file and how long that takes
  • read the catalogs for a few hosts and see what's there
  • read through how we generate nagios configs and understand it
    • is there a speed difference between complete generation and generation from snippets?

client side

  • run puppetd --test --debug on a bunch of hosts, both fast and slow, and see what it's doing

Tasks

  • build a puppet server in eqiad for testing RT-1279
    • when the tests are complete, this server can be used as a failover / slave / hot spare for the puppetmaster

Tests to run

  • move a class to a module and see if there's any performance difference

collected data

Data as stashed in ~ben/puppet/ on fenari.

puppetmaster

Creating the catalog takes 8s for most hosts, 2m for the nagios host (from /var/log/daemon.log).

Puppet catalog generation time graph.png

sockpuppet seems to pathalogically hit 100% CPU for long periods. Finding out what's causing this and squashing it will go a long way towards making the puppetmaster responsive. This is likely related to the variance in document retrieval times shown below, though I haven't confirmed that yet.

Sockpuppet load.png

aluminium

The puppet run on aluminium went relatively quickly. I ran $ puppetd --test --debug to get detailed output and tailed the logs on the puppetmaster at the same time. Looking at the apache logs for subsequent runs indicates massive variation in the query duration.

fast run

  • puppetmaster clock time: 5.5 seconds in the apache log to deliver the catalog
    • client log shows "info: Caching catalog for aluminium.wikimedia.org" at the end of retrieving the catalog from the server
  • client clock time: 28 seconds, starting at 17:51:29 and finishing at 17:51:57
    • 29 - start the run
      • loading state: 0.04s
        • puppetmaster: GET /production/file_metadatas/plugins?...: 4.530ms
      • storing state: 0.31s
    • 31 - getting catalog from the puppetmaster: 5.5s
    • 37 - finished downloading catalog
      • loading state: 0.04s
      • staging config, figuring out dependencies
      • puppetmaster: GET /production/file_metadata/files...: 5 * 2.2ms
    • 39 - still staging config
    • 41 - finished staging and ordering components. running apt-get update
      • apt-get update: ~10s
    • 51 - checking package versions
      • running apt-cache policy on a bunch of packages, presumably to find out if they need to be upgraded
      • packages: nagios-plugins-standard, jfsutils, lldpd, rsyslog, gdb, wikimedia-raid-utils, amanda-common, screen, ntp, zsh-beta, nagios-plugins, wipe, openssl, nagios-nrpe-server, tzdata
      • puppetmaster: GET /produciton/file_metadata/files/misc/97-last-puppet-run: 2.2ms
    • 53 - doing a bunch of is-running checks
      • running ps -ef for each of the packages that should be running
      • ntp, rsyslog
      • "debug: file_metadata supports formats: b64_zlib_yaml marshal pson raw yaml; using pson" repeated many times, not sure why.
      • puppetmaster: GET /production/file_metadata/files/...: 9*2.2ms
      • puppetmaster: GET /production/file_metadata/files/misc/vimrc.local: 106.3ms
    • 55 - still doing is-running checks based on the output of ps
      • running some apt sources checks for jenkins
      • doing more apt-cache policy checks
      • checking in with nagios via snmptrap?
      • finished
    • 57

apache logs comparing actions in later runs

Sadly I don't have client logs from the run at 18:07:34. Clearly though, something is going on on the server that is causing incredibly slow responses to the client queries.

Note: the fifth field is apache query duration in microseconds. the last numerical field is transmission size

 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 4530 [09/Aug/2011:17:51:29 +0000] "GET /production/file_metadatas/plugins?recurse=true&links=manage&&checksum_type=md5&ignore=---+%0A++-+.svn%0A++-+CVS%0A++-+.git HTTP/1.1" 200 3962 "-" "-" 
 vs.
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 257059555 [09/Aug/2011:18:07:34 +0000] "GET /production/file_metadatas/plugins?recurse=true&links=manage&&checksum_type=md5&ignore=---+%0A++-+.svn%0A++-+CVS%0A++-+.git HTTP/1.1" 200 3962 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 2571 [09/Aug/2011:17:51:37 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 2242 [09/Aug/2011:17:51:37 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 2223 [09/Aug/2011:17:51:37 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 2292 [09/Aug/2011:17:51:38 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 3369 [09/Aug/2011:17:51:38 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 vs.
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 52877621 [09/Aug/2011:18:14:35 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 2846352 [09/Aug/2011:18:15:33 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 1061784 [09/Aug/2011:18:15:42 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 1084383 [09/Aug/2011:18:15:48 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 368451 [09/Aug/2011:18:15:54 +0000] "GET /production/file_metadata/files/misc/50-wikimedia-base.conf.sysctl HTTP/1.1" 200 2458 "-" "-" 

This one was actually faster the second time around:

 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 106329 [09/Aug/2011:17:51:54 +0000] "GET /production/file_metadata/files/misc/vimrc.local HTTP/1.1" 200 2442 "-" "-" 
 vs.
 sockpuppet.pmtpa.wmnet:8140 208.80.154.6 - - 5782 [09/Aug/2011:18:16:41 +0000] "GET /production/file_metadata/files/misc/vimrc.local HTTP/1.1" 200 2442 "-" "-"

spence

Spence is reliably slow. In addition to the request to the puppetmaster for the catalog taking a long time, the requests appear to wait longer between requests.

  • Whereas Al waits 2 seconds between requesting the file_metadatas and the catalog, spence waits 40 to 60 seconds.
    • this may be an artifact of apache log timestamping combined with the transfer time for the catalog. Don't trust this yet.

Reported puppet client run duration from spence (/var/log/syslog)

 Aug  9 07:26:35 spence puppet-agent[13908]: Finished catalog run in 8785.23 seconds
 Aug  9 09:50:46 spence puppet-agent[13908]: Finished catalog run in 6166.45 seconds
 Aug  9 12:03:13 spence puppet-agent[13908]: Finished catalog run in 5558.38 seconds
 Aug  9 14:23:30 spence puppet-agent[13908]: Finished catalog run in 5559.32 seconds
 Aug  9 16:54:18 spence puppet-agent[13908]: Finished catalog run in 4495.31 seconds

Looking at the output of puppetd --test --debug, 110 minutes elapse after "Prefetching naginator resources for nagios_service". Presumably, it's crunching the naginator bits this entire time, generating the 2800 service declarations that wind up in /etc/nagios/puppet_services.cfg.

Things to try

  • move service declarations into directories instead of one big file. Read http://pieter.barrezeele.be/2009/05/11/puppet-and-nagios/ and do as it says.
    • config change - add target => "/etc/nagios/nagios_services.d/ping-$fqdn.cfg" in the service definition (assuming $fqdn exists, which it should)
  • shift around the nagios config generation style such that rather than creating one service definition per host/service combo, it creates one service definition for each basic service and uses host groups to collect all hosts that want to test that service.
  • upgrade ruby, see if the newest version has any effect

Other resources that look interesting

Resolved issues and their description

The problems with puppet broke down to two major issues:

  • the puppetmaster periodically goes into a death spiral where apache requests gradually take more and more time until they're all lasting 500 seconds. It then recovers on its own. During this time, all puppet clients perform poorly.
  • spence, the nagios host, takes an excessively long time to run puppet. (around 2 hours.) This time is broken down to about 2 minutes to retrieve the puppet catalog from the master and around 110 minutes to process the catalog

The second has been resolved, full description below.

Spence

Spence was taking about 2 hours to run the puppet client:

  • 2 minutes to get the catalog
  • 110 minutes to parse the nagios stuff

A blog post by Pieter Barrezeele explains what's going on: with many tests, the file containing all the tests grows very large. Puppet must parse the entire file for every check to determine if the check needs to be changed. On Spence, that file was about 2MB in size and we had about 2800 checks. Parsing a 2MB file 2800 times takes a long time.

The solution is to break up the checks into many smaller files. Pieter shows off how to do it if each service is actually defined separately, but our config allows the check to be created by a variable passed into the exported @@nagios_service. Instead I chose to store all the checks for a given host in a single file, rather than having one file per service/host pair. This decreased the number of files in the directory from 2800 to 385. In terms of the puppet config, I added a segment to create the directory /etc/nagios/puppet_checks.d/, then changed the target parameter of monitor_service from ${nagios_config_dir}/puppet_checks.cfg to ${nagios_config_dir}/puppet_checks.d/${hostname}.cfg.

# set up the nagios server
class nagios::monitor {
    ...
    # make sure the directory for individual service checks exists 
    file { "/etc/nagios/puppet_checks.d":
        ensure => directory,
        owner => root,
        group => root,
        mode => 0755;
    }
    ...
}
# pull this in from services you want monitored
define monitor_service ($description, $check_command, $host=$hostname, $retries=3, $group=$nagios_group, $ensure=present, $critical="false", $passive="false", $freshness=7200) {
    @@nagios_service { "$hostname $title":
        target => "${nagios_config_dir}/puppet_checks.d/${hostname}.cfg"
        ...

For some reason I haven't yet understood, puppet created each of those files inside puppet_checks.d root/root mode 600; i.e. owned by root and only readable by root. The nagios user was unable to read them, causing nagios to refuse to start. I tried to set each file's permissions when it was created, but when two services both landed inside one host's file, the declaration of file { "${nagios_config_dir}/puppet_checks.d/${hostname}.cfg then appeared twice, causing a conflict. Instead I changed the puppet_checks.d directory to apply its permissions recursively. Note that though the directory mode argument is set to 0600, puppet understands that when applying this recursively, directories are also supposed to get the execute bit and sets them 0755.

# set up the nagios server
class nagios::monitor {
    ...
    # make sure the directory for individual service checks exists 
    file { "/etc/nagios/puppet_checks.d":
        ensure => directory,
        owner => root,
        group => root,
        recurse => true,
        mode => 0644;
    }
    ...
}

With these changes in place, after running a few times to get all the files settled into their correct places, the puppet run on spence now takes on average 5 minutes total.