Anycast recursive DNS

From Wikitech
Jump to navigation Jump to search

In order to improve resiliency of recursive DNS, this setup leverages BGP and anycast.

Task: https://phabricator.wikimedia.org/T186550

CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/397723/

Limitation of a non-anycast setup

  • Some services don't fail over fast enough to the 2nd server listed on resolv.conf when one fails
  • If the two servers of a site (or the whole site) fails, servers relying on them will experience an outage
  • LVS/pybal depends on DNS and thus making it a chicken/egg problem

Configuration

Anycast rec-dns diagram.png

Server side

modules/role/manifests/dnsrecursor.pp  include ::profile::bird::anycast

hieradata/role/common/recursor.yaml (global)

1 profile::bird::advertise_vips:
2   recdns.anycast.wmnet:
3     address: 10.3.0.1 # VIP to advertise
4     check_cmd: '/usr/lib/nagios/plugins/check_dns -H www.wikipedia.org -s 10.3.0.1 -t 1 -c 1'

check_cmd

In this case we re-use an Icinga NRPE check, installed on all the servers:

/usr/lib/nagios/plugins/check_dns -H www.wikipedia.org -s 10.3.0.1 -t 1 -c 1

Troubleshooting

Know which server a client is redirected to

$ dig @10.3.0.1 CHAOS TXT id.server. +short

Ensure all servers can reach the VIP

bblack@cumin1001:~$ sudo cumin '*' 'dig @10.3.0.1 CHAOS TXT id.server. +short'

Failure tests

Single local recursor failure

bblack@backup2001:~$ while [ 1 ]; do echo ======; date; dig @10.3.0.1 CHAOS TXT id.server. +short; sleep 1; done

"so I can see a result once a second, I've tried stopping just healthchecker, stopping or killing the recursor, etc"

Traffic routes to the one working local node within the second.

Double local recursor failure

Eg. take down dns2001/dns2002

Request end up on dns1001/dns1002

Limitations

  • If the DNS recursors have the anycast VIP as only resolver in resolv.conf, then processes depending on DNS will fail until pdnsd starts
    • Workaround is to either hardcode real recursors IPs or have a daemon that remove the VIP loopback

Future evolution

  • Add Icinga monitoring to check local recursors work (eg. Icinga check on bastX hosts that check it's dnsX that reply and not dnsY)
  • Improve Icinga's check_dns