Revision as of 09:52, 16 September 2021

The procedures in this runbook require admin permissions to complete.

Note: some of these steps might require extra access to internal infrastructure systems, we are working on improving the runbooks, until then, take this as a guideline.

Error

Usually an email with the subject:

[Cloud VPS alert] Puppet failure on <hostname>

For example:

Subject: [Cloud VPS alert] Puppet failure on toolsbeta-sgeexec-1001.toolsbeta.eqiad1.wikimedia.cloud

Debugging

If there's only a few of those emails, the error is most probably on the client side and/or affecting only a limited amount of hosts. Otherwise it might indicate a wider issue.

Usually you would want to retry the run on the failed machine, so ssh to it and run:

dcaro@vulcanus$ ssh toolsbeta-sgeexec-1001.toolsbeta.eqiad1.wikimedia.cloud
 Linux toolsbeta-sgeexec-1001 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
 Debian GNU/Linux 10 (buster)
 The last Puppet run was at Sun Mar 28 15:16:07 UTC 2021 (26947 minutes ago). 
 Last puppet commit: 
 Last login: Thu Apr 15 08:23:23 2021 from 172.16.1.135
dcaro@toolsbeta-sgeexec-1001:~$ sudo run-puppet-agent

From there there's a wide variety of puppet issues that might happen, some common ones follow.

Usual puppet errors

TODO: Fill up as you encounter errors.

Failed resources if any: Exec[create-volume-group]

If you see the above text in your e-mail alert, it means you have the legacy LVM Puppet role enabled project-wide. Help:Adding_Disk_Space_to_Cloud_VPS_instances#With_LVM_(deprecated_as_of_February,_2021).

Solution

log in to https://horizon.wikimedia.org/;
go to Puppet > Project Puppet;
click on the Edit button below Puppet Classes;
delete the role::labs::lvm::srv line;
click on the Apply Changes button.

Done!

Function lookup() did not find a value for the name

This error usually means that there's a default value missing on some puppetclass parameter or some value missing from hiera, an example of the error:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::logstash::apifeatureusage::curator_actions' (file: /etc/puppet/modules/profile/manifests/logstash/apifeatureusage.pp, line: 7) on node deployment-logstash03.deployment-prep.eqiad.wmflabs

From there you have the missing parameter profile::logstash::apifeatureusage::curator_actions and the class that's missing it /etc/puppet/modules/profile/manifests/logstash/apifeatureusage.pp, line 7, so, if the project is using the same puppet as production (as it's the case for the example), you can see that code here, for that file and line here.

You can see there that there's a parameter called curator_actions that does the lookup for that variable but has no default:

class profile::logstash::apifeatureusage(
   Array[Stdlib::Host] $targets         = lookup('profile::logstash::apifeatureusage::targets'),
   Hash                $curator_actions = lookup('profile::logstash::apifeatureusage::curator_actions'),
) {

One solution (not the correct one in this case) is to add a default value there:

class profile::logstash::apifeatureusage(
   Array[Stdlib::Host] $targets         = lookup('profile::logstash::apifeatureusage::targets'),
   Hash                $curator_actions = lookup('profile::logstash::apifeatureusage::curator_actions', {'default_value' => {}}),
) {

Another way of fixing the issue, is defining that value in Horizon, under the project puppet page, or for the specific prefix for that VPS.

If that's not possible, then we have to look deeper. In this case doing a quick git grep curator_actions shows that it's defined in the file hieradata/role/common/logstash.yaml:

10:50 AM ~/Work/wikimedia/operations-puppet  (production|✔) 
dcaro@vulcanus$ git grep curator_actions
 hieradata/role/common/logstash.yaml:profile::logstash::apifeatureusage::curator_actions:
 hieradata/role/common/logstash/elasticsearch7.yaml:profile::elasticsearch::logstash::curator_actions:
...

Let's see why it's not using that.

Checking up the puppetmaster

In order to do some extra debuggin and find out who the puppetmaster is, you can do the following:

dcaro@deployment-logstash03:~$ host 172.16.0.38
 38.0.16.172.in-addr.arpa domain name pointer cloud-puppetmaster-03.cloudinfra.eqiad1.wikimedia.cloud.
dcaro@deployment-logstash03:~$ sudo puppet config print ca_server
 puppet
dcaro@deployment-logstash03:~$ host puppet
 puppet has address 172.16.0.38
 puppet has address 172.16.0.38
 puppet has address 172.16.0.38
dcaro@deployment-logstash03:~$ host 172.16.0.38
 38.0.16.172.in-addr.arpa domain name pointer cloud-puppetmaster-03.cloudinfra.eqiad1.wikimedia.cloud.

So now we now that the puppetmaster for this instance is cloud-puppetmaster-03.cloudinfra.eqiad1.wikimedia.cloud. This is the common master (cloudinfra) that any VPS will use by default, when they don't have a dedicated one.

Hiera lookups work different in cloud than prod

Looking on that master, we can check the hiera config (/etc/puppet/hiera.yaml) for the order on which the data is loaded:

...
hierarchy:
 - name: 'Http Yaml'
   data_hash: cloudlib::httpyaml
   uri: "http://puppetmaster.cloudinfra.wmflabs.org:8100/v1/%{::labsproject}/node/%{facts.fqdn}"
 - name: "cloud hierarchy"
   paths:
     - "cloud/%{::wmcs_deployment}/%{::labsproject}/hosts/%{::hostname}.yaml"
     - "cloud/%{::wmcs_deployment}/%{::labsproject}/common.yaml"
     - "cloud/%{::wmcs_deployment}.yaml"
     - "cloud.yaml"
 - name: "Secret hierarchy"
   path: "%{::labsproject}.yaml"
   datadir: "/etc/puppet/secret/hieradata"
 - name: "Private hierarchy"
   paths:
     - "labs/%{::labsproject}/common.yaml"
     - "%{::labsproject}.yaml"
     - "labs.yaml"
   datadir: "/etc/puppet/private/hieradata"
 - name: "Common hierarchy"
   path: "common.yaml"
 - name: "Secret Common hierarchy"
   path: "common.yaml"
   datadir: "/etc/puppet/secret/hieradata"
 - name: "Private Common hierarchy"
   path: "common.yaml"
   datadir: "/etc/puppet/private/hieradata"

Comparing that with the production puppet hiera.yaml, we see that it's missing the role hierarchy:

29   - name: "role"
30     paths:
31       - "role/%{::site}/%{::_role}.yaml"
32       - "role/common/%{::_role}.yaml"

So another solution would be to add the value also to a yaml file that would be picked up by puppet, checking for the other parameter in that class in the git repo, we find that it's defined also on:

12:14 PM ~/Work/wikimedia/operations-puppet  (production|✔) 
dcaro@vulcanus$ git grep apifeatureusage::targets
 hieradata/cloud/eqiad1/deployment-prep/common.yaml:profile::logstash::apifeatureusage::targets:
 ...

So adding it there also would solve the issue.

TODO: Maybe add roles to the cloud hiera lookup (T280324)

More info

Related tasks

@@ Line 30: / Line 30: @@
 '''TODO:''' Fill up as you encounter errors.
+=== Failed resources if any: Exec[create-volume-group] ===
+If you see the above text in your e-mail alert, it means you have the legacy LVM Puppet role enabled project-wide.
+[[Help:Adding_Disk_Space_to_Cloud_VPS_instances#With_LVM_(deprecated_as_of_February,_2021)]].
+==== Solution ====
+*log in to https://horizon.wikimedia.org/;
+*go to ''Puppet'' > ''Project Puppet'';
+*click on the ''Edit'' button below ''Puppet Classes'';
+*delete the <code>role::labs::lvm::srv</code> line;
+*click on the ''Apply Changes'' button.
+Done!
 === Function lookup() did not find a value for the name ===
 This error usually means that there's a default value missing on some puppetclass parameter or some value missing from hiera, an example of the error: