Portal:Cloud VPS/Admin/Network/Tests
This page explains the network checklist/testing functions that we have in place to verify normal network operations for Cloud VPS / Toolforge.
The checklist is meant to test the network exactly as an user would use it:
- in different directions (ie, internet --> cloud and cloud --> internet)
- verify correct operation of routing_source_ip NAT, dmz_cidr, floating IP NAT, etc
- verify interaction with NFS servers and other special services
- verify some other basic network functions like DNS, LDAP, etc.
You may wonder: How is this different from icinga or other monitoring methods? The answer is: this isn't. We could probably migrate all this to icinga or prometheus with some kungfu. But this was quickly developed to fill a tooling gap, so here we are.
Components
cmd-checklist-runner.py
: a simple python script that reads a yaml file with a bunch of tests definitions, runs them and reports the results./etc/networktests/networktests.yaml
: yaml file containing test case definitions.- a systemd timer job that runs the checklist periodically (15 minutes?). Icinga can monitor systemd services and page if they fail. We can activate this if necessary.
- In puppet, we have the
openstack::monitor::networktests
which should be declared for cloudcontrol nodes. This class deploys all the above. - We have a cookbook to help running the test suite manually.
- tests make use of the
srv-networktests
user both locally (cloudcontrol) and in virtual machines (LDAP).
adding new tests
To add new tests:
- include desired envvars in
profile::openstack::XYZ::networktests::envvars
- include desired checks in
modules/openstack/templates/monitor/networktests.yaml.erb
Checklist
The checklist is a list of shell commands to run. The runner can optionally verify stdout/stderr/retcode to decide if the test passed or not.
Example:
---
- envvars:
- SSH: /usr/bin/ssh -i /etc/networktests/sshkeyfile [..] -o Proxycommand="ssh -o StrictHostKeyChecking=no -i /etc/networktests/sshkeyfile -W %h:%p srv-networktests@eqiad1.bastion.wmcloud.org"
CLOUDGW_A_IP: 185.15.56.245
CLOUDGW_B_IP: 185.15.56.246
TOOLFORGE_BASTION_LOGIN: login.toolforge.org
TOOLFORGE_BASTION_DEV: dev.toolforge.org
---
- name: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
tests:
- cmd: timeout -k5s 10s ping -c1 $CLOUDGW_A_IP >/dev/null
stdout: ""
retcode: 0
stderr: ""
- cmd: timeout -k5s 10s ping -c1 $CLOUDGW_B_IP >/dev/null
stdout: ""
retcode: 0
stderr: ""
- name: VM (using floating IP) can connect to wikireplicas from Toolforge
tests:
- cmd: $SSH $TOOLFORGE_BASTION_LOGIN 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
stdout: "1"
retcode: 0
stderr: ""
- cmd: $SSH $TOOLFORGE_BASTION_DEV 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
stdout: "1"
retcode: 0
stderr: ""
Example execution:
root@cloudcontrol1004:~# cmd-checklist-runner --config /etc/networktests/networktests.yaml
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, no floating IP
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, with floating IP
[cmd-checklist-runner] INFO: running test: VM (no floating IP) contacting the internet gets NAT'd using routing_source_ip
[cmd-checklist-runner] INFO: running test: VM (no floating IP) contacting an address covered by dmz_cidr doesn't get NAT'd
[cmd-checklist-runner] INFO: running test: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact auth DNS server
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact recursor DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact auth DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact recursor DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact LDAP server
[cmd-checklist-runner] INFO: running test: VM (not using floating IP) can contact LDAP server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact openstack API
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact openstack API
[cmd-checklist-runner] INFO: running test: puppetmasters can sync git tree
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
[cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
[cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 22
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 22
Cookbook
There is a handy cookbook to help leverage this testing suite for other purposes:
arturo@endurance:~ $ cookbook wmcs.openstack.network.tests --cluster-name codfw1dev
START - Cookbook wmcs.openstack.network.tests
----- OUTPUT of 'sudo -i cmd-chec...etworktests.yaml' -----
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, no floating IP
[..]
[cmd-checklist-runner] INFO: running test: puppetmasters can sync git tree
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 19
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 19
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cmd-chec...etworktests.yaml'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
NetworkTestRunner: 19/19 passed tests.
END (PASS) - Cookbook wmcs.openstack.network.tests (exit_code=0)
arturo@endurance:~ $ cookbook wmcs.openstack.network.tests -d eqiad1
START - Cookbook wmcs.openstack.network.tests
----- OUTPUT of 'sudo -i cmd-chec...etworktests.yaml' -----
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[..]
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
[cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
[cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 22
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 22
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cmd-chec...etworktests.yaml'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
NetworkTestRunner: 22/22 passed tests.
END (PASS) - Cookbook wmcs.openstack.network.tests (exit_code=0)
In the future we plan to develop other cookbooks that depend on the testsuite results to decide on operations, for example:
- only perform an operation if the network testsuite passes
- rollback a kernel upgrade if the network testsuite doesn't pass
systemd timer execution
To see history of how network tests have been performing, check the logs of the systemd cloud-vps-networktest
service:
root@cloudcontrol1004:~# journalctl -u cloud-vps-networktest.service -f
-- Logs begin at Thu 2021-11-11 10:41:53 UTC. --
Nov 11 16:15:26 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
Nov 11 16:15:29 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
Nov 11 16:15:31 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
Nov 11 16:15:40 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
Nov 11 16:15:40 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: ---
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- passed tests: 22
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- failed tests: 0
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- total tests: 22
Nov 11 16:15:44 cloudcontrol1004 systemd[1]: cloud-vps-networktest.service: Succeeded.
As of this writing, per puppet code, this only runs in one of the cloudcontrol nodes. Usually the second one (to avoid the already overloaded first one).
TODO: as of this writing, icinga wont alert if this service fail, because we disabled the check in cloudcontrol boxes (was too noisy).
See also
- This was developed in T294955 - cloud network: improve automated testing & monitoring