Anycast recursive DNS
In order to improve resiliency of recursive DNS, this setup leverages BGP and anycast.
- 1 Limitation of a non-anycast setup
- 2 Configuration
- 3 Troubleshooting
- 4 Failure tests
- 5 Limitations
- 6 Future evolution
Limitation of a non-anycast setup
- Some services don't fail over fast enough to the 2nd server listed on resolv.conf when one fails
- If the two servers of a site (or the whole site) fails, servers relying on them will experience an outage
- LVS/pybal depends on DNS and thus making it a chicken/egg problem
1 profile::bird::advertise_vips: 2 recdns.anycast.wmnet: 3 address: 10.3.0.1 # VIP to advertise 4 check_cmd: '/usr/lib/nagios/plugins/check_dns -H www.wikipedia.org -s 10.3.0.1 -t 1 -c 1'
In this case we re-use an Icinga NRPE check, installed on all the servers:
/usr/lib/nagios/plugins/check_dns -H www.wikipedia.org -s 10.3.0.1 -t 1 -c 1
Know which server a client is redirected to
$ dig @10.3.0.1 CHAOS TXT id.server. +short
Ensure all servers can reach the VIP
bblack@cumin1001:~$ sudo cumin '*' 'dig @10.3.0.1 CHAOS TXT id.server. +short'
Single local recursor failure
bblack@backup2001:~$ while [ 1 ]; do echo ======; date; dig @10.3.0.1 CHAOS TXT id.server. +short; sleep 1; done
"so I can see a result once a second, I've tried stopping just healthchecker, stopping or killing the recursor, etc"
Traffic routes to the one working local node within the second.
Double local recursor failure
Eg. take down dns2001/dns2002
Request end up on dns1001/dns1002
- If the DNS recursors have the anycast VIP as only resolver in resolv.conf, then processes depending on DNS will fail until pdnsd starts
- Workaround is to either hardcode real recursors IPs or have a daemon that remove the VIP loopback
- Add Icinga monitoring to check local recursors work (eg. Icinga check on bastX hosts that check it's dnsX that reply and not dnsY)
- Improve Icinga's