Anycast recursive DNS
In order to improve resiliency of recursive DNS, this setup leverages BGP and anycast.
Limitation of a non-anycast setup
- Some services don't fail over fast enough to the 2nd server listed on resolv.conf when one fails
- If the two servers of a site (or the whole site) fails, servers relying on them will experience an outage
- LVS/pybal depends on DNS and thus making it a chicken/egg problem
profile::bird::advertise_vips: recdns.anycast.wmnet: address: 10.3.0.1 # VIP to advertise (limited to a /32) check_cmd: '/usr/lib/nagios/plugins/check_dns_query -H 10.3.0.1 -l -d www.wikipedia.org -t 1' service_type: recdns
In this case we re-use an Icinga NRPE check, installed on all the servers:
/usr/lib/nagios/plugins/check_dns_query -H 10.3.0.1 -l -d www.wikipedia.org -t 1
Know which server a client is redirected to
$ dig @10.3.0.1 CHAOS TXT id.server. +short
Ensure all servers can reach the VIP
bblack@cumin1001:~$ sudo cumin '*' 'dig @10.3.0.1 CHAOS TXT id.server. +short'
Single local recursor failure
bblack@backup2001:~$ while [ 1 ]; do echo ======; date; dig @10.3.0.1 CHAOS TXT id.server. +short; sleep 1; done
"so I can see a result once a second, I've tried stopping just healthchecker, stopping or killing the recursor, etc"
Traffic routes to the one working local node within the second.
Double local recursor failure
Eg. take down dns2001/dns2002
Request end up on dns1001/dns1002
- If the DNS recursors have the anycast VIP as only resolver in resolv.conf, then processes depending on DNS will fail until pdnsd starts as they will try to connect to the local recdns service instead of being routed to the closest server.
- Workaround is to either hardcode real recursors IPs or have a daemon that remove the VIP loopback
- Add Icinga monitoring to check local recursors work (eg. Icinga check on bastX hosts that check it's dnsX that reply and not dnsY)