Jump to content

ncmonitor

From Wikitech
ncmonitor1001
Location: eqiad
Status
Overall: Active
Icinga?: host status services status
Hardware
Software

ncmonitor keeps consistency between our registered non-canonical domains and services such as acme-chief and ncredir.

Redirection of a domain redirection requires:

  1. MarkMonitor's domain registration configured to Wikimedia's DNS servers (ns[0-2].wikimedia.org).
  2. operations/dns must have the domain symlinked to ncredir-parking (e.g. templates/example.comtemplates/ncredir-parking).
  3. acme-chief must have issued the domain (and all its subdomains) in a certificate.
  4. ncredir must have the domain configured in its nginx configuration mapping.

ncmonitor handles all but the MarkMonitor bits: Automation of registration is beyond its scope.

Configuration

Configuring and running the utility is documented in the project's manpages.

Our deployment configuration logic lives in the usual Puppet module and profile manifests.

Reviewing patches

Dependency graph with arrows and text
Patches need to be merged in a proper order.

ncmonitor patches are submitted to Gerrit at the same time; however, care must be had to merge patches in the right order. Services depend on each other in a linear fashion: ncredir requires acme-chief to have issued a certificate; acme-chief requires the domain to be configured in our DNS repository; The DNS repository requires properly-configured MarkMonitor domains.

ncmonitor will not submit patches for improperly-configured MarkMonitor domains.

Manual entry/configuration of any of the repositories are still welcome and will not be overwritten by ncmonitor so long as the domains in question exist in MarkMonitor (non-existent domains even with configuration or custom logic will be proposed for removal like anything else.)

Ignoring domains

ncmonitor ignores the existence of any domains included in its ignore list. Domains to ignore include "main" site domains (e.g. wikipedia.org) or domains we want to "dead-park" (i.e. serve no records at all but merely sit on the domain).

Domains with their own zone in operations/dns are automatically ignored: There's no need to add a newly-registered domain with grand plans to the ignore list.

DNS repo

  1. Verify that we actually want to redirect each domain or if we want to "dead-park" any of them: If we want to dead-park, edit the domain out of any open CRs and ignore the domain.
  2. Verify the state of DNS in the real world[1] (ncmonitor will eventually do that with task T402960). For a quick check of a single domain, run:
    $ dig +trace example.com
    [...]
    example.com.	86400	IN	NS	ns0.wikimedia.org.
    example.com.	86400	IN	NS	ns1.wikimedia.org.
    example.com.	86400	IN	NS	ns2.wikimedia.org.
    
    This snippet can be used for manual verification of a bunch of domains in a CR.
  3. Run authdns-update on a DNS server after merging.

acme-chief

  1. Verify the state of DNS in the real world[1] (ncmonitor will eventually do that with task T402960). For a quick check of a single domain, run:
    $ dig +short ns example.com
    
  2. Verify that DNSSEC is disabled as it would interfere with certificate issuance. A quick and dirty check:
    $ dig +short DNSKEY example.com
    
  3. Disable Puppet on A:ncredir from cumin to prevent ncredir from pulling snake oil certificates.[2]:
    # cumin A:ncredir 'disable-puppet "rolling out new acme-chief certs"'
    
  4. Merge/puppet-merge in the acme-chief CR
  5. Use cumin to run the Puppet agent on all acme-chief hosts:
    # cumin A:acmechief run-puppet-agent
    
  6. Verify the certificates on the active acme-chief instance:
    # find -L /var/lib/acme-chief/certs/ -name ec-prime256v1.crt  |grep live |grep non-canonical | while read crt; do printf "$crt: " ; openssl x509 -issuer -noout -in "$crt"; done
    
    All certs should show valid issuance fields such as issuer=C = US, O = Let's Encrypt, CN = E8. If any of them show snakeoil cert material, wait a few minutes for issuance to occur and then run the find command again.
  7. When snake oil certs are all absent, re-enable Puppet on A:ncredir to resume their automated certificate fetching:
    # cumin A:ncredir 'enable-puppet "rolling out new acme-chief certs"'
    

ncredir

Until ncmonitor is extended with a feature to automatically guess appropriate redirection domains (task T368692) domains are automatically set to redirect to https://www.wikimedia.org by default. We can do better, so fine-tune the CR:

  1. Download the CR locally
  2. Update the default https://www.wikimedia.org redirects to more appropriate locations
  3. Rebase the changes and upload a new patch set
  4. Get a +1 from another user[3]
  5. Merge/puppet-merge the changes

Running

ncmonitor runs on its own Ganeti VM (ncmonitor1001.eqiad.wmnet) on a systemd timer. The service is entirely stateless: The process runs with temporary directories that are cleaned up after the service runs.

ncmonitor can be run to either simply print out required actions or to automatically submit patches to Gerrit for human approval. The routine service execution is set to automatically submit the patches.

Wikimedia has their MarkMonitor API usage limited to the production IP range: This utility must run in the production cluster.

Implementation details

MarkMonitor domain legitimacy

MarkMonitor domains exist in several states of legitimacy. ncmonitor checks for:

  • Whether or not WMF's DNS servers are used (It will ignore any domains that use different NS servers);
  • Whether or not any NS records are specified at all;
  • Whether domains are configured more than once in MarkMonitor (i.e. duplicate domains exist);
  • Whether or not the domain is of valid registration: Domains could exist in the account but not be valid due to e.g. payment issues, expiration;
  • A hard-coded ignorelist (for e.g. WMCS servers which we don't handle).

If any of these checks fail, it is assumed that corrective action is in order. One exception applies, however: We don't want to remove any domains from services that have duplicates: It's possible that one of them a valid domain. ncmonitor could be extended to check all duplicates for legitimacy and remove if they're all illegitimate.

ncredir

ncredir's nc_redirects.dat file is kept alpha-numerically sorted for organization. Humans aren't expected to edit the file very often, and the automated nature of appending to the file makes organization impossible.

ncredir is unique to the other services in that in that contains subdomains in its list: Any domain that ncmonitor proposes to remove will also remove all related subdomains. Detection of TLDs requires the use of a list, so the ncmonitor Puppet profile includes a timer to regularly update the list of TLDs from publicsuffix.org.

For guidelines on where to redirect domains, See ncredir#Types of domains.

acme-chief

acme-chief is split off of Puppet's usual hieradata/common.yaml file and instead resides in its own file. This is because PyYAML mangles formatting and eats comments - It's easier to just let PyYAML take over formatting of the file.

acme-chief's hieradata (The certificates::acme_chief blocks) are grouped into a maximum of 40 domains per certificate. For more information on why, see acme-chief.

See also

Notes

  1. 1.0 1.1 ncmonitor checks to make sure that NS records are correctly set to WMF's; however, just because it's correctly configured in MarkMonitor does not mean that DNS queries are appropriately returned. Some TLDs require valid responses from nameservers before they register as valid. In other words, setting up example.de (one such TLD) means that the operations/dns patch requires merging before DNS queries start succeeding.
  2. acme-chief initially issues a snake-oil certificate so as to prevent any servers from failing to start from lack of having any certificate. While a handy feature in general, it's possible that ncredir would pick up the fake certificate and then validity alerts would start firing until the real certificate is issued/synced.
  3. dns and acme-chief changes can be merged without a second pair of eyes (but make sure we don't want to instead ignore the domain first). ncredir changes should get validation from another person.