Monitoring/check dsh groups

From Wikitech

This Icinga alert checks if a MediaWiki appserver is a member of the mediawiki-installation DSH group.

The DSH groups control which hosts Scap deploys code to. In the Puppet repo you can see the list of servers now comes from Hiera from ./hieradata/common/scap/dsh.yaml where a reference is made to conftool.

  mediawiki-installation:
    conftool:
      - {'cluster': 'appserver', 'service': 'apache2'}
      - {'cluster': 'api_appserver', 'service': 'apache2'}
      - {'cluster': 'jobrunner', 'service': 'apache2'}
      - {'cluster': 'testserver', 'service': 'apache2'}

The conftool data is in ./conftool-data/node in the puppet repo as well. Check if the affected host name shows up in there. If not, you can add it.

Make sure first there is no existing hardware issue with this server by searching Phabricator for its host name.

If it is in there but you still get the alert, first run scap pull to fetch the latest code and then pool to add it to the pool. The Icinga alert should recover a little while later.

Alternatively you can pool the server from a management host such as cumin1001 using conftool commands.

Inactive servers

There are two levels of depooling. It can be set as pooled=no (translates to enabled: False in pybal) which means it receives no public traffic, but still receives Scap deployments. Or it can be set as pooled=inactive which also removes it from the DSH group. This is generally only used if a host is unable to receive code updates, which then helpfully avoids Scap deployment errors due to unreachable servers.

If a server has come back online from maintenance or downtime and starts issuing Host not in mediawiki-installation dsh group alert, it is recommended to run scap pull to ensure it will not be serving outdated code to monitoring requests in production (T310225), and set pooled=no so that it receives Scap deployments going forward.

Once the maintenance or repair ticket is updated/resolved, and any other verification has taken place, it can also be repooled again.

History

Historically, before we had Salt or Cumin, we used DSH to run commands on multiple servers at once.

Server groups were text files in the Puppet repository and mediawiki-installation was one of them. Taking a server out of the pool meant making an edit to this text file. The use case of pooling severs is now managed via conftool/confctl. See also pool/depool app servers.