Portal:Cloud VPS/Admin/Runbooks/Designate record leaks

From Wikitech

Overview

The procedures in this runbook require admin permissions to complete.

The designate-sink service listens on rabbitmq for creation or deletion of servers; when servers are created it attempts to create corresponding dns records; when a server is deleted it attempts to delete those records.

Because designate-sink isn't tightly coupled to nova or neutron, it's possible for an IP address to be allocated or deleted without the corresponding dns updates. When this happens a DNS record is left behind forever.

Error / Incident

Ideally you are here after responding to an alert about leaked DNS records. If those records weren't detected, various other symptoms may appear.

  • nova-fullstack tests fail with messages about DNS timeouts
  • various complex VM orchestrations (e.g. magnum clusters) fail in the middle without good explanation
  • (rarely) traffic intended for one VM is routed to a different one.

Debugging

The same script that prompts the alerts on cloudservices nodes can be run by hand on a cloudcontrol to see the list of problematic records:

andrew@cloudcontrol2001-dev:~$ sudo wmcs-dnsleaks
checking zone: 16.172.in-addr.arpa.
Found 2 ptr recordsets for the same VM: canary2002-dev-2.cloudvirt-canary.codfw1dev.wikimedia.cloud. ['38.128.16.172.in-addr.arpa.', '170.128.16.172.in-addr.arpa.']
Found 2 ptr recordsets for the same VM: canary2003-dev-2.cloudvirt-canary.codfw1dev.wikimedia.cloud. ['134.128.16.172.in-addr.arpa.', '131.128.16.172.in-addr.arpa.']
checking zone: codfw1dev.wikimedia.cloud.
skipping public zone: codfw1dev.wmcloud.org.
skipping public zone: cloudinfra-codfw1dev.codfw1dev.wmcloud.org.
checking zone: 57.15.185.in-addr.arpa.
checking zone: 0-29.57.15.185.in-addr.arpa.

That same tool can be used to automatically clean up most kinds of bad records:

andrew@cloudcontrol2001-dev:~$ sudo wmcs-dnsleaks --delete

It's best to review the list of bad records BEFORE using --delete; setups change and what looks like a leak in 2024 might be an intentional, vital service record in 2025.

Some edge cases are too delicate to be automatically deleted. To clean those records you will need to first determine the correct setup (typically by exploring using openstack server show or horizon) and then adjust records accordingly using the openstack commandline (openstack zone list, openstack recordset show, etc.

Be warned that some records are in 'noauth-project'. Noauth-project does not exist in keystone but can be accessed in Designate by using '--sudo-project-id nooauth.'


root@cloudcontrol1005:~# openstack zone list --sudo-project-id noauth-project
+--------------------------------------+---------------------------------+---------+------------+--------+--------+
| id                                   | name                            | type    |     serial | status | action |
+--------------------------------------+---------------------------------+---------+------------+--------+--------+
| 114f1333-c2c1-44d3-beb4-ebed1a91742b | eqiad.wmflabs.                  | PRIMARY | 1704253940 | ACTIVE | NONE   |
| 8d114f3c-815b-466c-bdd4-9b91f704ea60 | 68.10.in-addr.arpa.             | PRIMARY | 1695748792 | ACTIVE | NONE   |
| d19aff2d-2d57-4d25-9e26-dab2b3a58be4 | svc.eqiad.wmflabs.              | PRIMARY | 1695748815 | ACTIVE | NONE   |
| df88fcb3-fbc2-42f1-bb12-2424c8b7117e | db.svc.eqiad.wmflabs.           | PRIMARY | 1695748815 | ACTIVE | NONE   |
| e81ea3bb-f8f0-49f4-906b-2d4c2e83cc7e | web.db.svc.eqiad.wmflabs.       | PRIMARY | 1701181110 | ACTIVE | NONE   |
| 04c45c1f-214d-450b-a733-028dcdc87a12 | analytics.db.svc.eqiad.wmflabs. | PRIMARY | 1701181101 | ACTIVE | NONE   |
| 6990e139-49e6-466c-9421-46cf45f05842 | 16.172.in-addr.arpa.            | PRIMARY | 1704413968 | ACTIVE | NONE   |
+--------------------------------------+---------------------------------+---------+------------+--------+--------+

Related information

Support contacts

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Stay aware of critical changes and plans
Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)

Old incidents

Add your incident here: