Obsolete:Ganglia

This page contains historical information. In December 2017 Wikimedia has stopped using Ganglia. It was removed in task T177225.

2017

ganglia is a scalable distributed system monitoring tool. Each server has a small agent gmond collecting metrics locally. Metadata servers (gmetad) poll the agents (which can be other metadata servers), collect an XML document and update a time serie database (i.e. rrdtool).

By chaining gmetad metadata servers, the infrastructure is very scalable.

Installing

On (almost) all servers the installation is handled by Puppet. Every node sets the $cluster variable which determines in which Ganglia group the server belongs. An

 include ganglia

statement then makes sure that ganglia gets installed with the right configuration file.

Configuration background

gmond

Configuration of gmond is done via the master file /home/wikipedia/conf/gmond/gmond.conf.master, and the configuration generator script /home/wikipedia/conf/gmond/conf.php. The whole /home/wikipedia/conf/gmond directory is copied to every server running gmond, as /etc/gmond. The ./sync script is a shortcut to the relevant rsync command. Each server then has a symlink from /etc/gmond.conf to the relevant specialised configuration script in /etc/gmond/*.conf.

Each cluster has its own multicast channel. Channels are allocated automatically by conf.php. The first cluster in the $clusters array is given the IP address 239.192.0.1, the second is given 239.192.0.2, and so on up. 239.192.0.0/24 should be considered reserved for this purpose, as documented at IP addresses.

The *_aggr.conf configuration files are "aggregator" configuration files. These configure ganglia in non-deaf mode, allowing it to listen to the multicast channel, aggregate the state of the cluster in memory and respond to XML requests from gmetad. The remaining configuration files use deaf mode, which saves a small amount of CPU time and memory for those servers.

gmetad

There is ~~one instance~~ of gmetad (ganglia aggregator). The instance for for esams runs on Bast3002. It aggregates data from the misc hosts in the esams cloud (apparently). Its rrd files are writen to /var/lib/ganglia/rrds/. Its config file lives in /etc/ganglia/gmetad.conf.

Another instance runs on streber. It appears to aggregate all the rest of the data. Data and config files live in the same locations as above.

(Really? Where is the data for misc pmtpa? Could someone fill in the missing bits please?)

Puppet

The puppet recipes for ganglia can be found under manifests/ganglia.pp

Web frontend

We run a customised copy of the web frontend with a document root at /home/wikipedia/htdocs/ganglia. There is a symlink from pmtpa to ., conf.php detects the request URI and reads from either gmetad_aggr or gmetad_pmtpa appropriately.

gmetricd

A python daemon called gmetricd collects the diskio_* metrics. It should be running on every server that runs gmond. It should be possible to extend it with other metrics if desired. The code is in SVN in the ganglia_metrics directory, and there are some RPMs in /home/wikipedia/rpms/ganglia/ganglia_metrics/. It's also available in the APT repository: package ganglia-metrics.

issues

If ganglia does not work try to reboot your machine or kill the gmond process with options HUP kill -HUP $proccessID and run sudo puppet agent -tv afterwards