Portal:Cloud VPS/Admin/Network/Tests

From Wikitech

This page explains the network checklist/testing functions that we have in place to verify normal network operations for Cloud VPS / Toolforge.

The checklist is meant to test the network exactly as an user would use it:

  • in different directions (ie, internet --> cloud and cloud --> internet)
  • verify correct operation of routing_source_ip NAT, dmz_cidr, floating IP NAT, etc
  • verify interaction with NFS servers and other special services
  • verify some other basic network functions like DNS, LDAP, etc.

You may wonder: How is this different from icinga or other monitoring methods? The answer is: this isn't. We could probably migrate all this to icinga or prometheus with some kungfu. But this was quickly developed to fill a tooling gap, so here we are.

Components

  • cmd-checklist-runner.py: a simple python script that reads a yaml file with a bunch of tests definitions, runs them and reports the results.
  • /etc/networktests/networktests.yaml: yaml file containing test case definitions.
  • a systemd timer job that runs the checklist periodically (15 minutes?). Icinga can monitor systemd services and page if they fail. We can activate this if necessary.
  • In puppet, we have the openstack::monitor::networktests which should be declared for cloudcontrol nodes. This class deploys all the above.
  • We have a cookbook to help running the test suite manually.
  • tests make use of the srv-networktests user both locally (cloudcontrol) and in virtual machines (LDAP).

adding new tests

To add new tests:

  • include desired envvars in profile::openstack::XYZ::networktests::envvars
  • include desired checks in modules/openstack/templates/monitor/networktests.yaml.erb

Checklist

The checklist is a list of shell commands to run. The runner can optionally verify stdout/stderr/retcode to decide if the test passed or not.

Example:

---
- envvars:
  - SSH: /usr/bin/ssh -i /etc/networktests/sshkeyfile [..] -o Proxycommand="ssh -o StrictHostKeyChecking=no -i /etc/networktests/sshkeyfile -W %h:%p srv-networktests@eqiad1.bastion.wmcloud.org"
    CLOUDGW_A_IP: 185.15.56.245
    CLOUDGW_B_IP: 185.15.56.246
    TOOLFORGE_BASTION_LOGIN: login.toolforge.org
    TOOLFORGE_BASTION_DEV: dev.toolforge.org
---
- name: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
  tests:
    - cmd: timeout -k5s 10s ping -c1 $CLOUDGW_A_IP >/dev/null
      stdout: ""
      retcode: 0
      stderr: ""
    - cmd: timeout -k5s 10s ping -c1 $CLOUDGW_B_IP >/dev/null
      stdout: ""
      retcode: 0
      stderr: ""
      
- name: VM (using floating IP) can connect to wikireplicas from Toolforge
  tests:
    - cmd: $SSH $TOOLFORGE_BASTION_LOGIN 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
      stdout: "1"
      retcode: 0
      stderr: ""
    - cmd: $SSH $TOOLFORGE_BASTION_DEV 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
      stdout: "1"
      retcode: 0
      stderr: ""

Example execution:

root@cloudcontrol1004:~# cmd-checklist-runner --config /etc/networktests/networktests.yaml 
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, no floating IP
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, with floating IP
[cmd-checklist-runner] INFO: running test: VM (no floating IP) contacting the internet gets NAT'd using routing_source_ip
[cmd-checklist-runner] INFO: running test: VM (no floating IP) contacting an address covered by dmz_cidr doesn't get NAT'd
[cmd-checklist-runner] INFO: running test: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact auth DNS server
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact recursor DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact auth DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact recursor DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact LDAP server
[cmd-checklist-runner] INFO: running test: VM (not using floating IP) can contact LDAP server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact openstack API
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact openstack API
[cmd-checklist-runner] INFO: running test: puppetmasters can sync git tree
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
[cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
[cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 22
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 22

Cookbook

There is a handy cookbook to help leverage this testing suite for other purposes:

arturo@endurance:~ $ cookbook wmcs.openstack.network.tests --cluster-name codfw1dev
START - Cookbook wmcs.openstack.network.tests
----- OUTPUT of 'sudo -i cmd-chec...etworktests.yaml' -----
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, no floating IP
[..]
[cmd-checklist-runner] INFO: running test: puppetmasters can sync git tree
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 19
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 19
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cmd-chec...etworktests.yaml'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
NetworkTestRunner: 19/19 passed tests.
END (PASS) - Cookbook wmcs.openstack.network.tests (exit_code=0)
arturo@endurance:~ $ cookbook wmcs.openstack.network.tests -d eqiad1
START - Cookbook wmcs.openstack.network.tests
----- OUTPUT of 'sudo -i cmd-chec...etworktests.yaml' -----
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[..]
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
[cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
[cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 22
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 22
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cmd-chec...etworktests.yaml'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
NetworkTestRunner: 22/22 passed tests.
END (PASS) - Cookbook wmcs.openstack.network.tests (exit_code=0)

In the future we plan to develop other cookbooks that depend on the testsuite results to decide on operations, for example:

  • only perform an operation if the network testsuite passes
  • rollback a kernel upgrade if the network testsuite doesn't pass

systemd timer execution

To see history of how network tests have been performing, check the logs of the systemd cloud-vps-networktest service:

root@cloudcontrol1004:~# journalctl -u cloud-vps-networktest.service -f
-- Logs begin at Thu 2021-11-11 10:41:53 UTC. --
Nov 11 16:15:26 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
Nov 11 16:15:29 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
Nov 11 16:15:31 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
Nov 11 16:15:40 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
Nov 11 16:15:40 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: ---
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- passed tests: 22
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- failed tests: 0
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- total tests: 22
Nov 11 16:15:44 cloudcontrol1004 systemd[1]: cloud-vps-networktest.service: Succeeded.

As of this writing, per puppet code, this only runs in one of the cloudcontrol nodes. Usually the second one (to avoid the already overloaded first one).

TODO: as of this writing, icinga wont alert if this service fail, because we disabled the check in cloudcontrol boxes (was too noisy).

See also