Jump to content

ncmonitor

From Wikitech
ncmonitor
Location: eqiad
Status
Overall: Active
Icinga?: host status services status
Hardware
Software

ncmonitor keeps consistency between our registered non-canonical domains and services such as acme-chief and ncredir. Over the years the appropriate manual toil has been neglected, causing drift between it all. ncmonitor helps keep this all in sync by automatically detecting drift and proposing patches. Ultimately, this means that only MarkMonitor needs to be maintained.

Automation of registration is beyond the scope of this utility.

Configuration

Configuring and running the utility is documented in the project's manpages.

Our deployment configuration logic lives in the usual Puppet module and profile manifests.

Reviewing patches

Dependency graph with arrows and text
Patches need to be merged in a proper order.

ncmonitor patches are submitted to Gerrit at the same time; however, care must be had to merge patches in the right order. Services depend on each other in a linear fashion: ncredir requires acme-chief to have issued a certificate; acme-chief requires the domain to be configured in our DNS repository; The DNS repository requires properly-configured MarkMonitor domains.

ncmonitor will not submit patches for improperly-configured MarkMonitor domains.

Running

ncmonitor runs on its own Ganeti VM (ncmonitor1001.eqiad.wmnet) on a systemd timer. The service is entirely stateless: The process runs with temporary directories that are cleaned up after the service runs.

ncmonitor can be run to either simply print out required actions or to automatically submit patches to Gerrit for human approval. The routine service execution is set to automatically submit the patches.

Wikimedia has their MarkMonitor API usage limited to the production IP range: This utility must run in the production cluster.

Implementation details

MarkMonitor domain legitimacy

MarkMonitor domains exist in several states of legitimacy. ncmonitor checks for:

  • Whether or not WMF's DNS servers are used (It will ignore any domains that use different NS servers);
  • Whether or not any NS records are specified at all;
  • Whether domains are configured more than once in MarkMonitor (i.e. duplicate domains exist);
  • Whether or not the domain is of valid registration: Domains could exist in the account but not be valid due to e.g. payment issues, expiration;
  • A hard-coded ignorelist (for e.g. WMCS servers which we don't handle).

If any of these checks fail, it is assumed that corrective action is in order. One exception applies, however: We don't want to remove any domains from services that have duplicates: It's possible that one of them a valid domain. ncmonitor could be extended to check all duplicates for legitimacy and remove if they're all illegitimate.

ncredir

ncredir's nc_redirects.dat file is kept alpha-numerically sorted for organization. Humans aren't expected to edit the file very often, and the automated nature of appending to the file makes organization impossible.

ncredir is unique to the other services in that in that contains subdomains in its list: Any domain that ncmonitor proposes to remove will also remove all related subdomains.

acme-chief

acme-chief is split off of Puppet's usual hieradata/common.yaml file and instead resides in its own file. This is because PyYAML mangles formatting and eats comments - It's easier to just let PyYAML take over formatting of the file.

acme-chief's hieradata (The certificates::acme_chief blocks) are grouped into a maximum of 40 domains per certificate. For more information on why, see acme-chief.

See also