Jump to content

Portal:Cloud VPS/Admin/Runbooks/Designate record leaks

From Wikitech
The procedures in this runbook require admin permissions to complete.

The designate-sink service listens on rabbitmq for creation or deletion of servers; when servers are created it attempts to create corresponding dns records; when a server is deleted it attempts to delete those records.

Because designate-sink isn't tightly coupled to nova or neutron, it's possible for an IP address to be allocated or deleted without the corresponding dns updates. When this happens a DNS record is left behind forever.

Error / Incident

Ideally you are here after responding to an alert about leaked DNS records. If those records weren't detected, various other symptoms may appear.

  • nova-fullstack tests fail with messages about DNS timeouts
  • various complex VM orchestrations (e.g. magnum clusters) fail in the middle without good explanation
  • (rarely) traffic intended for one VM is routed to a different one.

Debugging

The quickest way to find out the leaked records is to inspect the Systemd journal for prometheus-node-textfile-wmcs-dnsleaks.service in cloudcontrol1006:

cloudcontrol1006:~$ sudo journalctl -u prometheus-node-textfile-wmcs-dnsleaks.service -e

The same script can also be run by hand to see the list of problematic records (it takes a few minutes to check all the zones):

andrew@cloudcontrol2001-dev:~$ sudo wmcs-dnsleaks
checking zone: 16.172.in-addr.arpa.
Found 2 ptr recordsets for the same VM: canary2002-dev-2.cloudvirt-canary.codfw1dev.wikimedia.cloud. ['38.128.16.172.in-addr.arpa.', '170.128.16.172.in-addr.arpa.']
Found 2 ptr recordsets for the same VM: canary2003-dev-2.cloudvirt-canary.codfw1dev.wikimedia.cloud. ['134.128.16.172.in-addr.arpa.', '131.128.16.172.in-addr.arpa.']
checking zone: codfw1dev.wikimedia.cloud.
skipping public zone: codfw1dev.wmcloud.org.
skipping public zone: cloudinfra-codfw1dev.codfw1dev.wmcloud.org.
checking zone: 57.15.185.in-addr.arpa.
checking zone: 0-29.57.15.185.in-addr.arpa.

That same tool can be used to automatically clean up most kinds of bad records:

andrew@cloudcontrol2001-dev:~$ sudo wmcs-dnsleaks --delete

It's best to review the list of bad records BEFORE using --delete; setups change and what looks like a leak in 2024 might be an intentional, vital service record in 2025.

Some edge cases are too delicate to be automatically deleted. To clean those records you will need to first determine the correct setup (typically by exploring using openstack server show or horizon) and then adjust records accordingly using the openstack commandline (openstack zone list, openstack recordset show, etc.

Be warned that some records are in 'noauth-project'. Noauth-project does not exist in keystone but can be accessed in Designate by using '--sudo-project-id nooauth.'


root@cloudcontrol1006:~# wmcs-openstack zone list --sudo-project-id noauth-project
+--------------------------------------+---------------------------------+---------+------------+--------+--------+
| id                                   | name                            | type    |     serial | status | action |
+--------------------------------------+---------------------------------+---------+------------+--------+--------+
| 114f1333-c2c1-44d3-beb4-ebed1a91742b | eqiad.wmflabs.                  | PRIMARY | 1704253940 | ACTIVE | NONE   |
| 8d114f3c-815b-466c-bdd4-9b91f704ea60 | 68.10.in-addr.arpa.             | PRIMARY | 1695748792 | ACTIVE | NONE   |
| d19aff2d-2d57-4d25-9e26-dab2b3a58be4 | svc.eqiad.wmflabs.              | PRIMARY | 1695748815 | ACTIVE | NONE   |
| df88fcb3-fbc2-42f1-bb12-2424c8b7117e | db.svc.eqiad.wmflabs.           | PRIMARY | 1695748815 | ACTIVE | NONE   |
| e81ea3bb-f8f0-49f4-906b-2d4c2e83cc7e | web.db.svc.eqiad.wmflabs.       | PRIMARY | 1701181110 | ACTIVE | NONE   |
| 04c45c1f-214d-450b-a733-028dcdc87a12 | analytics.db.svc.eqiad.wmflabs. | PRIMARY | 1701181101 | ACTIVE | NONE   |
| 6990e139-49e6-466c-9421-46cf45f05842 | 16.172.in-addr.arpa.            | PRIMARY | 1704413968 | ACTIVE | NONE   |
+--------------------------------------+---------------------------------+---------+------------+--------+--------+

Old incidents

Add your incident here: