In order to improve resiliency of recursive DNS, this setup leverages BGP and anycast.



Limitation of a non-anycast setup

  • Some services don't fail over fast enough to the 2nd server listed on resolv.conf when one fails
  • If the two servers of a site (or the whole site) fails, servers relying on them will experience an outage
  • LVS/pybal depends on DNS and thus making it a chicken/egg problem


Anycast rec-dns diagram.png

Server side

modules/role/manifests/dnsrecursor.pp  include ::profile::bird::anycast

hieradata/role/common/recursor.yaml (global)

1 profile::bird::advertise_vips:
2   recdns.anycast.wmnet:
3     address: # VIP to advertise
4     check_cmd: '/usr/lib/nagios/plugins/check_dns -H -s -t 1 -c 1'


In this case we re-use an Icinga NRPE check, installed on all the servers:

/usr/lib/nagios/plugins/check_dns -H -s -t 1 -c 1


Know which server a client is redirected to

$ dig @ CHAOS TXT id.server. +short

Ensure all servers can reach the VIP

bblack@cumin1001:~$ sudo cumin '*' 'dig @ CHAOS TXT id.server. +short'

Failure tests

Single local recursor failure

bblack@backup2001:~$ while [ 1 ]; do echo ======; date; dig @ CHAOS TXT id.server. +short; sleep 1; done

"so I can see a result once a second, I've tried stopping just healthchecker, stopping or killing the recursor, etc"

Traffic routes to the one working local node within the second.

Double local recursor failure

Eg. take down dns2001/dns2002

Request end up on dns1001/dns1002


  • If the DNS recursors have the anycast VIP as only resolver in resolv.conf, then processes depending on DNS will fail until pdnsd starts
    • Workaround is to either hardcode real recursors IPs or have a daemon that remove the VIP loopback

Future evolution

  • Add Icinga monitoring to check local recursors work (eg. Icinga check on bastX hosts that check it's dnsX that reply and not dnsY)
  • Improve Icinga's check_dns