Portal:Cloud VPS/Admin/Runbooks/Designate record leaks
Overview
The designate-sink service listens on rabbitmq for creation or deletion of servers; when servers are created it attempts to create corresponding dns records; when a server is deleted it attempts to delete those records.
Because designate-sink isn't tightly coupled to nova or neutron, it's possible for an IP address to be allocated or deleted without the corresponding dns updates. When this happens a DNS record is left behind forever.
Error / Incident
Ideally you are here after responding to an alert about leaked DNS records. If those records weren't detected, various other symptoms may appear.
- nova-fullstack tests fail with messages about DNS timeouts
- various complex VM orchestrations (e.g. magnum clusters) fail in the middle without good explanation
- (rarely) traffic intended for one VM is routed to a different one.
Debugging
The quickest way to find out the leaked records is to inspect the Systemd journal for prometheus-node-textfile-wmcs-dnsleaks.service
in cloudcontrol1005
:
cloudcontrol1005:~$ sudo journalctl -u prometheus-node-textfile-wmcs-dnsleaks.service -e
The same script that prompts the alerts on cloudservices nodes can also be run by hand on a cloudcontrol to see the list of problematic records (it takes a few minutes to check all the zones):
andrew@cloudcontrol2001-dev:~$ sudo wmcs-dnsleaks
checking zone: 16.172.in-addr.arpa.
Found 2 ptr recordsets for the same VM: canary2002-dev-2.cloudvirt-canary.codfw1dev.wikimedia.cloud. ['38.128.16.172.in-addr.arpa.', '170.128.16.172.in-addr.arpa.']
Found 2 ptr recordsets for the same VM: canary2003-dev-2.cloudvirt-canary.codfw1dev.wikimedia.cloud. ['134.128.16.172.in-addr.arpa.', '131.128.16.172.in-addr.arpa.']
checking zone: codfw1dev.wikimedia.cloud.
skipping public zone: codfw1dev.wmcloud.org.
skipping public zone: cloudinfra-codfw1dev.codfw1dev.wmcloud.org.
checking zone: 57.15.185.in-addr.arpa.
checking zone: 0-29.57.15.185.in-addr.arpa.
That same tool can be used to automatically clean up most kinds of bad records:
andrew@cloudcontrol2001-dev:~$ sudo wmcs-dnsleaks --delete
It's best to review the list of bad records BEFORE using --delete; setups change and what looks like a leak in 2024 might be an intentional, vital service record in 2025.
Some edge cases are too delicate to be automatically deleted. To clean those records you will need to first determine the correct setup (typically by exploring using openstack server show or horizon) and then adjust records accordingly using the openstack commandline (openstack zone list, openstack recordset show, etc.
Be warned that some records are in 'noauth-project'. Noauth-project does not exist in keystone but can be accessed in Designate by using '--sudo-project-id nooauth.'
root@cloudcontrol1005:~# openstack zone list --sudo-project-id noauth-project
+--------------------------------------+---------------------------------+---------+------------+--------+--------+
| id | name | type | serial | status | action |
+--------------------------------------+---------------------------------+---------+------------+--------+--------+
| 114f1333-c2c1-44d3-beb4-ebed1a91742b | eqiad.wmflabs. | PRIMARY | 1704253940 | ACTIVE | NONE |
| 8d114f3c-815b-466c-bdd4-9b91f704ea60 | 68.10.in-addr.arpa. | PRIMARY | 1695748792 | ACTIVE | NONE |
| d19aff2d-2d57-4d25-9e26-dab2b3a58be4 | svc.eqiad.wmflabs. | PRIMARY | 1695748815 | ACTIVE | NONE |
| df88fcb3-fbc2-42f1-bb12-2424c8b7117e | db.svc.eqiad.wmflabs. | PRIMARY | 1695748815 | ACTIVE | NONE |
| e81ea3bb-f8f0-49f4-906b-2d4c2e83cc7e | web.db.svc.eqiad.wmflabs. | PRIMARY | 1701181110 | ACTIVE | NONE |
| 04c45c1f-214d-450b-a733-028dcdc87a12 | analytics.db.svc.eqiad.wmflabs. | PRIMARY | 1701181101 | ACTIVE | NONE |
| 6990e139-49e6-466c-9421-46cf45f05842 | 16.172.in-addr.arpa. | PRIMARY | 1704413968 | ACTIVE | NONE |
+--------------------------------------+---------------------------------+---------+------------+--------+--------+
Related information
Support contacts
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect or the bridged Telegram group
- Discuss via email after you have subscribed to the cloud@ mailing list
- Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
- Read the News wiki page
Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)
Old incidents
Add your incident here: