Anycast authoritative DNS
Work in progress.
In order to improve latency and resilience of our authoritative DNS, this setup leverages BGP and anycast.
Tracking task: https://phabricator.wikimedia.org/T98006
Limitation of a non-anycast setup
By definition, GeoDNS can't be used to redirect users to their closest nameserver (NS), like we do for websites.
When asked for a record (eg. fr.wikipedia.org), the .org
zone presents all 3 of our NS to the client, to decide which one to use.
Client side implementations not being great [citation needed], anycast offloads that decision to BGP.
Configuration
Server side
The server side is a regular internal anycast setup.
modules/profile/manifests/dns/auth.pp and modules/profile/manifests/dns/recursor.pp include ::profile::bird::anycast
hieradata/role/common/dnsbox.yaml and hieradata/role/common/dns/auth.yaml
profile::bird::advertise_vips:
nsa.wikimedia.org:
address: 198.35.27.27 # VIP to advertise (limited to a /32)
check_cmd: '/usr/lib/nagios/plugins/check_dns_query -H 198.35.27.27 -a -l -d www.wikipedia.org -t 1'
ensure: present
service_type: authdns
Routers side
Policy to only create (and thus advertise) the /24 anycast prefix if the router learns about it locally.
policy-options {
policy-statement BGP_from_anycast {
term BGP_local_anycast {
from {
protocol bgp;
as-path local_anycast;
}
then accept;
}
then reject;
}
as-path local_anycast "^64605$";
}
routing-options {
aggregate {
route 198.35.27.0/24 policy BGP_from_anycast;
}
}
Troubleshooting
Know which server a client is routed to
$ dig +nsid @nsa.wikimedia.org en.wikipedia.org A |grep NSID
Failure tests
Total local AuthDNS failure
- Stop gdnsd on all ulsfo servers
- The anycast prefix stops being advertised to the routers
- The routers don't have any contributing routes to the less specific prefix
- The routers stop advertising the prefix to their peers
- Start gdnsd back
- prefixes are re-advertised
Limitations
- L3 header LB:
ICMP packet too big
sent by routers along the path will not consistently be router to the correct router - Non-consistent hashing: if a routing change on the Internet causes a TCP packet to arrive through a different router, the router will not consistently route it to the proper server