DNS

From Wikitech
Jump to navigation Jump to search

This page describes Wikimedia's DNS setup. Wikimedia use two separate kinds of DNS servers, authoritative nameservers (that respond to queries from third party nameservers for our domains) and recursive resolvers (that resolve DNS queries when any of our servers need to look up a name)

Need to make changes to Wikimedia zones? See HOWTO in this page's TOC.

Authoritative nameservers

In Wikimedia's DNS setup, Wikimedia has 3 authoritative-only DNS servers for public service. The three authoritative servers are:

  • ns0.wikimedia.org - 208.80.154.238 (currently hosted on authdns1001.wikimedia.org in eqiad)
  • ns1.wikimedia.org - 208.80.153.231 (currently hosted on authdns2001.wikimedia.org in codfw)
  • ns2.wikimedia.org - 91.198.174.239 (currently hosted on ganeti3003.wikimedia.org in esams)

Additionally, the same authdns setup is also running on all of the recursors dns[12345]00[12] for internal service.

The servers are running with gdnsd with the geoip plugin which is responsible for geographic DNS.

Zonefiles and other configuration are replicated through the use of git fetch and git merge in a set of update scripts. In case of emergency the servers can be synced from any other as well.

All configuration files can be found in

/etc/gdnsd/

on all three hosts, with a separate conf file used just for syncing the zonefiles and such, in /etc/wikimedia-authdns.conf .

The main gdns configuration file is /etc/gdnsd/config. It is generated from files in our operations/dns git repo, as are the zone files.

Domain templates

The zone templates are (regular) files in the operations/dns git repo in the templates/ directory.

Each regular file in this directory corresponds to a zone with the same name. Each symbolic link to a regular file in this directory corresponds to a domain alias. So, in this example:

# ls -l templates/mediawiki*
lrwxrwxrwx    1 root     root           13 Jun 19 15:52 templates/mediawiki.com -> mediawiki.org
lrwxrwxrwx    1 root     root           13 Jun 19 15:52 templates/mediawiki.net -> mediawiki.org
-rw-r--r--    1 root     root         1500 Jun 19 15:12 templates/mediawiki.org

...one zone mediawiki.org is listed, with two alias zones, mediawiki.com and mediawiki.net.

The zonefiles generally follow the standard DNS Master File format (aka BIND zone files) specified by section 5 of RFC1035. gdnsd itself and our jinja-based templating add a few other capabilities on top, notably:

Variables and macros

Within the zone template, a few predefined variables and macros can be used, that will be substituted when the actual zonefiles are generated from the template. These include:

$INCLUDE filename origin 
Includes another file, which should be located in a subdirectory
@Z 
Is replaced by the actual zone name (FQDN) of the zone (works for symlinks, and also include files)
@F 
Is replaced by the original origin at the start of the file, same as @Z in the case of the main zonefile, but can be different in an include file
{{ serial_num }} 
Causes an SOA serial number to autogenerated in this place approximating a datestamp of last change, derived from git commit history
{{ serial_comment }} 
Emits a text suitable only for a comment line, showing the git hash and the first line of the commit message
{{ langlist(...) }} 
A list of language subdomain CNAMEs, i.e. a list of all language abbreviations for all languages any Wikimedia project has, generated from helpers/langlist.tmpl.

Other generic jinja2 templating constructs can be used as well, e.g.:

{% for i in range(1, 10) %}
asw{{ i }}-pmtpa	1H	IN A		10.0.1.{{ i }}
{%- endfor %}

Note that if the range is (a,b) then the first entry will be for a but the last entry will be for b-1.

authdns-update

/usr/sbin/authdns-update is a simple shell script, that automates the invocations of the scripts above. It goes through the following steps on each of ns0-2 (needs updating, too lazy to fix completely right now):

  1. ssh to the host
  2. git pull the templates from operations/git repo via authdns-git-pull
  3. generate the zone files from the zone template files via authdns-gen-zones
  4. update the gdnsd config files from the local git repo just updated
  5. sanity checks and reload of the gdns daemon

Basically, authdns-update takes care of everything after you've edited and merged the zonefiles.

authdns-local-update

/usr/sbin/authdns-local-update is used on any of the servers for pulling in updates from any other (presumably up to date) dns server. It can be used to bring a server back up to date after e.g. downtime or a software install/update. It is also used in this way by puppet during initial setup.

Geographic DNS

Geographic DNS makes sure that clients end up using the Wikimedia cluster closest to them, by varying DNS responses based on the (country of the) resolver IP querying. This is handled by the gdns geoip plugin. The config file is in config-geo in the operations/dns repo. Our geoip setup makes use of /usr/share/GeoIP/GeoIPCity.dat (ipv4) and /usr/share/GeoIP/GeoIPv6.dat (ipv6). These are pulled from the volatile directory on the puppet master which is updated regularly by cron. See the geoip module for more information.

HOWTO

This section briefly explains how to do the most common DNS changes.

Change GeoDNS

For example, when a certain cluster is down/unreachable, and you want to move all traffic to the others.

Edit the config-geo file in the operations/dns repo, commit, and run authdns-update from any of the dns servers.

Changing records in a zonefile

  • This is handled via the git repo operations/dns
  • Edit the template file templates/zonename locally and check into git, and git review (for gerrit review)
  • Merge your change in gerrit, then login to ns0.wikimedia.org, and run sudo authdns-update. This will pull from operations/dns and generate zonefiles and gnsd configs on each nameserver.
  • This no longer requires you to forward your own key, the systems are set up with their own trusted keys for the sync.
  • Once the script completes, its a good habit to query all three DNS servers to make sure your change has been correctly deployed
  • for example: for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any my-changed-record.wikimedia.org ; done
  • If any auth DNS server failed to response, restart it with /etc/init.d/gdnsd restart (though this shouldn't happen anymore as before with pdns)

Adding a new zone

  1. First, decide if this new zone will use a new, independent zonefile, or will be an alias of another zone
    independent zonefile 
    Create the new zone template in the operations/dns repository as templates/zonename. (Copy an existing, relatively clean zonefile like wiktionary.org to start with).
    zone alias 
    Make a symbolic link templates/aliasname for the alias to the zone being aliased.
  2. git add the file in, commit, and review on gerrit.
  3. Run authdns-update on ns0 (or any nameserver).
  4. Query all three ns servers to verify that your change took correctly.
  5. If any auth DNS server failed to response, restart it with /etc/init.d/pdns restart

Removing a zone

  • git rm the appropriate file, and merge on gerrit.
  • Log into NS0 and run authdns-update

Adding a new (language) wiki

  1. Add the language code to templates/helpers/langs.tmpl in the operations/dns repo and merge the commit
  2. Normal deployment process will not create the expected results. You must follow the workaround documented at https://phabricator.wikimedia.org/T97051#1994679 for now!

If a certain nameserver is unreachable

When a certain nameserver is unreachable, the others can still be updated from any of the other servers, by running authdns-update there. To skip the unreachable server in the update process, use:

# authdns-update -s "server list"

where server list is a space separated list of FQDNs. Do not forget the quotes, the script will only accept one argument behind -s.

  1. query all three ns servers to verify that your change took correctly.
  2. If any auth DNS server failed to response, restart it with /etc/init.d/pdns restart

Linting the zone files

  • Most scripts here expect to be run from the root directory of a clone of the operations/dns repository.
  • All of the python scripts require Python 3.5+ (e.g. stretch python3)

To run locally the same checks of CI:

  • Partial local CI: install the tox python module and run tox -- -n
  • Complete local CI: install tox python module and the latest WMF debian package of gdnsd and run: tox

For more detailed instructions see the utils/README file inside the repository.

For more details on the WMF-specific DNS zone files consistency check validator script for the internal zone files (wmnet and pointer zones), see the docstring at the top of the file.

Know to which DC a specific IP is redirected

bblack@authdns1001:~$ gdnsd_geoip_test
[starts an interactive shell, then input mapname followed by IP like:]
> generic-map 1.2.3.4
generic-map => 1.2.3.4/24 => eqiad, codfw, ulsfo, esams, eqsin

"If you just want to do a single lookup you can put the mapname and IP on the commandline too, but if you're doing a bunch the shell way doesn't have to expensively reload maps every time"

Know which IP the AuthDNS is seeing a query from

For scenarios where the network you're currently connected is not being redirected to the proper DC, while both your IP and your resolver's IP maps to the proper DC.

Lookup reflect.wikimedia.org, it will tell you how the authdns really sees the client or recursor IP.

Update DNS if gerrit or DNS are down (on an emergency only)

There may be some time in which GIT may be unavailable, and some changes may be needed on dns to unbreak it (e.g. for a failover), or that cannot wait for it to be up.

Important disclaimer: Try to contact someone on SRE/Traffic before trying this, as the workflow is not yet suited for doing manual changes. Also, soon we will have more than 3 authservers, making any manual "ssh to all the authservers and edit files manually" bad. Make sure this is the last and only resort before doing it- dependencies between services make this an unsafe operation.

  1. ssh into every authserver (dns and ips are at DNS#Authoritative_nameservers)
  2. edit the zone data directly in /etc/gdnsd/zones to fix the problem
  3. execute gdnsdctl reload-zones
  4. Remember to repeat it for each authserver, so they are all in sync

Make sure to log and declare that nobody should touch the DNS gerrit or authdns-update, until things are made sane again, otherwise things will break (even more).

Recursive Resolvers

pdns_recursor is run on the following nodes:

  • dns[1001-1002] in eqiad
  • dns[2001-2002] in codfw
  • dns[3001-3002] in esams
  • dns[4001-4002] in ulsfo
  • dns[5001-5002] in eqsin

These use the role::dnsbox role.

The recursive cache software used here is PowerDNS recursor (aka pdns-recursor). These machines also host a local authdns instance and site NTP servers.

Statistics are available at: Grafana DNS Recursor Stats

How to Remove a record from the DNS resolver caches

If you have just added or updated a DNS record on the authoritative nameservers, it may still be cached on the (unrelated) DNS resolvers used by our servers. To clear a record from the cache, use:

# rec_control wipe-cache record-name

on all the DNS resolvers. This will also clear any negative cache records. If you need to clear a PTR record, be sure to use the actual record name, e.g.

 # rec_control wipe-cache 122.36.64.10.in-addr.arpa.

(with the trailing '.').

All of the recursors can be targeted via this cumin command:

 sudo cumin 'A:dns-rec' 'rec_control wipe-cache <FQDN.of.server>'

External links