Wikimedia DNS/Administration

From Wikitech

Adding a new Wikidough host

The following text describes the steps involved in setting up a new Wikidough host. It is recommended that you read this document from start to end at least once to have a better understanding of the steps involved before you proceed to work on them.

1. Puppet Role

  • Start by looking in operations/puppet: manifests/sites.pp to check the existing Wikidough hosts:
# Wikidough (T252132)
node /^(doh[123456]00[12])\.wikimedia\.org$/ {
   role(wikidough)
}
  • If you want to deploy in eqiad, the new hostname will be doh1003 (based on the above).

We will use the hostname doh1003FIXME going forward for the rest of the documentation (the "FIXME" helps prevent erroneous copy-pastes).

2. acme-chief

acme-chief is used to issue the TLS certificates for the Wikidough hosts. When you add a new host, you have to add the hostname to acme-chief's config.

  • In operations/puppet, under the wikidough section in hieradata/role/common/acme_chief.yaml, update the regex under authorized_regexes to add the new hosts.
  • Submit and merge the patch.

3. Ganeti VM

The next step is to create a Ganeti VM.

Specifications:

Hostname: doh1003FIXME (eqiad)
vCPUs: 2
Memory: 8
Disk: 15G
Network: Public
  • Add tags, Traffic and SRE.
  • Create the task.

Now create the VM:

sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 15 --network public eqiad_A doh1003FIXME
  • If you are deploying multiple VMs, make sure to spread them across different rows.
    • To know the rows of the existing hosts, run: sudo gnt-node list -o name,group from a Ganeti master node.
  • When the cookbook finishes, note the MAC address. If you missed that in the output, run: sudo gnt-instance show doh1003FIXME.wikimedia.org | grep -A 2 NIC
  • In operations/puppet, edit: modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 and add the MAC address from above.
  • Submit and merge and then on the cumin host, finalize the change:
   sudo cumin A:installserver 'run-puppet-agent'
  • Ensure you have a working VM with the Wikidough role applied (such as having set the boot order to disk, signed Puppet certs,e tc.) before proceeding with the next steps.
    • An all green on Icinga and no alerts on #wikimedia-operations is a good sign that everything is working as intended.

Possible Issues

  • acme_chief errors and/or failures of dnsdist.service can be resolved by doing two consecutive Puppet agent runs so that the TLS certs are fetched and made available to the new host.

4. Homer and Anycast

  • Copy the IP of the doh1003FIXME host (not the VIP, so nothing in 185.71.138.0/24) and add it to config/sites.yaml under the relevant data center.
   sudo run-puppet-agent

To finalize the homer changes (continuing on cumin), run the following, replacing "cr*-eqiad*" with the name of the relevant data center:

   homer "cr*-eqiad*" commit "Gerrit <REPLACE WITH HOMER CHANGE ID FROM ABOVE>: Set up BGP peering to doh1003FIXME in eqiad, triggering DoH /24 announcement there."

If you are running this in ulsfo, the above command will be:

   homer "cr*-ulsfo*" commit "Gerrit <REPLACE WITH HOMER CHANGE ID FROM ABOVE>: Set up BGP peering to doh1003FIXME in ulsfo, triggering DoH /24 announcement there."
  • Also log the above message in #wikimedia-operations for transparency.
  • Review the output. Make sure that it matches the host IP you just added. Type "yes" to commit.
    • You will need to type "yes" repeatedly, once each for all of the core routers.

If everything goes well, you should have a working Wikidough host.

Post-Setup Notes

  • Add the new to Wikidough's integration test, knead-wikidough.
    • Clone the knead-wikidough repository. In tests/test_dns.py, add to DOUGH_HOSTS, the new host IP.
    • Commit the change.
    • knead-wikidough will check for DoH and DoT settings against the new host and a successful CI check indicates that the new host is working as intended.

Restarting services

Multiple services make up the Wikimedia DNS stack. systemd ordering should automatically handle dependencies when restarting any of the services, but we still need to disable Puppet, initiate Icinga downtime, etc. To that end, we utilize a cookbook (sre.dns.roll-restart-wikimedia-dns) to restart the whole stack cleanly. Operation requires no custom parameters, so just run it like you would run any other cookbook.