Wikimedia DNS/Monitoring

From Wikitech

Wikidough Basic Check

What does it mean?

If this check fails, it means that a Wikidough host is not responding on ports 443 and/or 853 on its IPv4 and/or IPv6 address. This happens when the host is down, depooled, or the dnsdist.service is inactive or has failed.

Resolving this message

  • Head to Icinga and check if the host is up and if there are other checks that are failing (which may indicate a problem with the host itself).
  • Is the dnsdist.service active? Check with: systemctl status dnsdist.service.
    • If it is has stopped or failed, try restarting it. If it fails again, check the journal output to see why it is failing.
  • Since this check queries the host IP and not the anycast IP, it is unlikely that there is an issue with anycast-healthchecker or the bird service.
    • Nevertheless, checking the status of the above two services might be worthwhile.

Service Restart Check

What does it mean?

A failure of this check indicates that the configuration file for either dnsdist or pdns-recursor was changed but the service itself was not restarted. A CRITICAL alert is raised if the time delta between the configuration file change and service restart exceeds 24 hours.

This check is meant to be a warning alert and does not signify an error in the service.

Resolving this message

Please do not perform the steps below without contacting the Traffic team first as restarting any of these services clears the cache.

From a cumin host, restart the service mentioned in the alert on the Wikidough hosts:

sudo cumin -b 1 -s 5 'A:wikidough' 'systemctl restart dnsdist.service'

or,

sudo cumin -b 1 -s 5 'A:wikidough' 'systemctl restart pdns-recursor.service'