Portal:Cloud VPS/Admin/Runbooks/Cloud VPS alert Puppet failure on: Difference between revisions
DCaro (WMF) (talk | contribs) No edit summary |
→Usual puppet errors: legacy LVM |
||
Line 30: | Line 30: | ||
'''TODO:''' Fill up as you encounter errors. |
'''TODO:''' Fill up as you encounter errors. |
||
=== Failed resources if any: Exec[create-volume-group] === |
|||
If you see the above text in your e-mail alert, it means you have the legacy LVM Puppet role enabled project-wide. |
|||
[[Help:Adding_Disk_Space_to_Cloud_VPS_instances#With_LVM_(deprecated_as_of_February,_2021)]]. |
|||
==== Solution ==== |
|||
*log in to https://horizon.wikimedia.org/; |
|||
*go to ''Puppet'' > ''Project Puppet''; |
|||
*click on the ''Edit'' button below ''Puppet Classes''; |
|||
*delete the <code>role::labs::lvm::srv</code> line; |
|||
*click on the ''Apply Changes'' button. |
|||
Done! |
|||
=== Function lookup() did not find a value for the name === |
=== Function lookup() did not find a value for the name === |
||
This error usually means that there's a default value missing on some puppetclass parameter or some value missing from hiera, an example of the error: |
This error usually means that there's a default value missing on some puppetclass parameter or some value missing from hiera, an example of the error: |
Revision as of 09:52, 16 September 2021
Note: some of these steps might require extra access to internal infrastructure systems, we are working on improving the runbooks, until then, take this as a guideline.
Error
Usually an email with the subject:
[Cloud VPS alert] Puppet failure on <hostname>
For example:
Subject: [Cloud VPS alert] Puppet failure on toolsbeta-sgeexec-1001.toolsbeta.eqiad1.wikimedia.cloud
Debugging
If there's only a few of those emails, the error is most probably on the client side and/or affecting only a limited amount of hosts. Otherwise it might indicate a wider issue.
Usually you would want to retry the run on the failed machine, so ssh to it and run:
dcaro@vulcanus$ ssh toolsbeta-sgeexec-1001.toolsbeta.eqiad1.wikimedia.cloud Linux toolsbeta-sgeexec-1001 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 Debian GNU/Linux 10 (buster) The last Puppet run was at Sun Mar 28 15:16:07 UTC 2021 (26947 minutes ago). Last puppet commit: Last login: Thu Apr 15 08:23:23 2021 from 172.16.1.135 dcaro@toolsbeta-sgeexec-1001:~$ sudo run-puppet-agent
From there there's a wide variety of puppet issues that might happen, some common ones follow.
Usual puppet errors
TODO: Fill up as you encounter errors.
Failed resources if any: Exec[create-volume-group]
If you see the above text in your e-mail alert, it means you have the legacy LVM Puppet role enabled project-wide. Help:Adding_Disk_Space_to_Cloud_VPS_instances#With_LVM_(deprecated_as_of_February,_2021).
Solution
- log in to https://horizon.wikimedia.org/;
- go to Puppet > Project Puppet;
- click on the Edit button below Puppet Classes;
- delete the
role::labs::lvm::srv
line; - click on the Apply Changes button.
Done!
Function lookup() did not find a value for the name
This error usually means that there's a default value missing on some puppetclass parameter or some value missing from hiera, an example of the error:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::logstash::apifeatureusage::curator_actions' (file: /etc/puppet/modules/profile/manifests/logstash/apifeatureusage.pp, line: 7) on node deployment-logstash03.deployment-prep.eqiad.wmflabs
From there you have the missing parameter profile::logstash::apifeatureusage::curator_actions
and the class that's missing it /etc/puppet/modules/profile/manifests/logstash/apifeatureusage.pp, line 7
, so, if the project is using the same puppet as production (as it's the case for the example), you can see that code here, for that file and line here.
You can see there that there's a parameter called curator_actions
that does the lookup for that variable but has no default:
class profile::logstash::apifeatureusage( Array[Stdlib::Host] $targets = lookup('profile::logstash::apifeatureusage::targets'), Hash $curator_actions = lookup('profile::logstash::apifeatureusage::curator_actions'), ) {
One solution (not the correct one in this case) is to add a default value there:
class profile::logstash::apifeatureusage( Array[Stdlib::Host] $targets = lookup('profile::logstash::apifeatureusage::targets'), Hash $curator_actions = lookup('profile::logstash::apifeatureusage::curator_actions', {'default_value' => {}}), ) {
Another way of fixing the issue, is defining that value in Horizon, under the project puppet page, or for the specific prefix for that VPS.
If that's not possible, then we have to look deeper.
In this case doing a quick git grep curator_actions
shows that it's defined in the file hieradata/role/common/logstash.yaml
:
10:50 AM ~/Work/wikimedia/operations-puppet (production|✔) dcaro@vulcanus$ git grep curator_actions hieradata/role/common/logstash.yaml:profile::logstash::apifeatureusage::curator_actions: hieradata/role/common/logstash/elasticsearch7.yaml:profile::elasticsearch::logstash::curator_actions: ...
Let's see why it's not using that.
Checking up the puppetmaster
In order to do some extra debuggin and find out who the puppetmaster is, you can do the following:
dcaro@deployment-logstash03:~$ host 172.16.0.38 38.0.16.172.in-addr.arpa domain name pointer cloud-puppetmaster-03.cloudinfra.eqiad1.wikimedia.cloud. dcaro@deployment-logstash03:~$ sudo puppet config print ca_server puppet dcaro@deployment-logstash03:~$ host puppet puppet has address 172.16.0.38 puppet has address 172.16.0.38 puppet has address 172.16.0.38 dcaro@deployment-logstash03:~$ host 172.16.0.38 38.0.16.172.in-addr.arpa domain name pointer cloud-puppetmaster-03.cloudinfra.eqiad1.wikimedia.cloud.
So now we now that the puppetmaster for this instance is cloud-puppetmaster-03.cloudinfra.eqiad1.wikimedia.cloud
. This is the common master (cloudinfra) that any VPS will use by default, when they don't have a dedicated one.
Hiera lookups work different in cloud than prod
Looking on that master, we can check the hiera config (/etc/puppet/hiera.yaml
) for the order on which the data is loaded:
... hierarchy: - name: 'Http Yaml' data_hash: cloudlib::httpyaml uri: "http://puppetmaster.cloudinfra.wmflabs.org:8100/v1/%{::labsproject}/node/%{facts.fqdn}" - name: "cloud hierarchy" paths: - "cloud/%{::wmcs_deployment}/%{::labsproject}/hosts/%{::hostname}.yaml" - "cloud/%{::wmcs_deployment}/%{::labsproject}/common.yaml" - "cloud/%{::wmcs_deployment}.yaml" - "cloud.yaml" - name: "Secret hierarchy" path: "%{::labsproject}.yaml" datadir: "/etc/puppet/secret/hieradata" - name: "Private hierarchy" paths: - "labs/%{::labsproject}/common.yaml" - "%{::labsproject}.yaml" - "labs.yaml" datadir: "/etc/puppet/private/hieradata" - name: "Common hierarchy" path: "common.yaml" - name: "Secret Common hierarchy" path: "common.yaml" datadir: "/etc/puppet/secret/hieradata" - name: "Private Common hierarchy" path: "common.yaml" datadir: "/etc/puppet/private/hieradata"
Comparing that with the production puppet hiera.yaml
, we see that it's missing the role hierarchy:
29 - name: "role" 30 paths: 31 - "role/%{::site}/%{::_role}.yaml" 32 - "role/common/%{::_role}.yaml"
So another solution would be to add the value also to a yaml file that would be picked up by puppet, checking for the other parameter in that class in the git repo, we find that it's defined also on:
12:14 PM ~/Work/wikimedia/operations-puppet (production|✔) dcaro@vulcanus$ git grep apifeatureusage::targets hieradata/cloud/eqiad1/deployment-prep/common.yaml:profile::logstash::apifeatureusage::targets: ...
So adding it there also would solve the issue.
TODO: Maybe add roles to the cloud hiera lookup (T280324)
More info
- General puppet info at wikimedia
- Cloud specific hiera/enc info
- Cloud specific testing
- Cloud per-project puppetmaster info