Conftool

See also: Conftool/Load balanced services and dbctl

Conftool is a set of tools we use to sync and manage the dynamic state configuration for services (as of February 2018, varnish backend, the PyBal pools, the DNS discovery entries, and some variables in MediaWiki configuration). This configuration is stored in the distributed key/value store: Etcd.

Overview

Conftool just gets information input in a series of configuration files, which are in the conftool-data/ directory in the Puppet repository. These files represent a static view of the configuration - so some information about services we manage, and then which services are installed on which hosts, and so on according to a schema. There is another part of the equation, which is the dynamic state of such configuration (such as, the weight of the server in its pool, and the information about either having the server pooled or not), which is left untouched by the sync (apart from setting default values in newly added object).

Schema

Conftool has two builtin object types, <service> and <node>. It is possible to add any additional object one might want by defining an appropriate schema. For details on how to write a schema, please refer to the conftool README. We'll briefly describe here the production schema, found at /etc/conftool/schema.yaml.

It contains the definitions of the following object types:

discovery: objects containing data used for the DNS/Discovery system. Refer to the documentation there for details.
mwconfig: objects containing data used by MediaWiki and EtcdConfig. This is a bit different than the other object-types as it allows to define validation rules for individual keys. Refer to the documentation of that system for details.

The base data files

Relative to the conftool root (/etc/conftool/data in production), configuration files are organized as follows:

For each object type (the ones defined in the schema.yaml file), we have a directory containing yaml files

that include object definitions as a hierarchical list of tags and the object name they're referring to.

the node objects have a directory structure that is a bit peculiar, as the data structure is adapted for ease of editing, and is as follows:

datacenter:
  cluster_name:
    node_name.eqiad.wmnet:
      - service_name
      - another_service
    another_node.eqiad.wmnet:
      - service_name
...

The tools

Currently, we have two tools, both installed on the puppetmaster (and on any node declaring profile::conftool::client:

conftool-sync which is used to sync what we write in the files described above to the distributed key/value-cluster (as of February 2018, it's [Etcd], but this may well change in the future). conftool-sync will not be called by you directly, in most cases, you will just call conftool-merge. When you merge a change to the puppet repository, conftool-merge will be directly invoked by our puppet-merge utlity on the puppetmaster.
confctl is the tool to interact with the key/value store and set dynamic values; for the full details of how to use it please see the README. a typical invocation could be:

confctl select dc=eqiad,cluster=cache_text,service=varnish-fe,name=cp1077.eqiad.wmnet get

{"cp1077.eqiad.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=eqiad,cluster=cache_text,service=varnish-fe"}

where the tags argument is a comma-separated list of data that specifies the service you want to query, so for the varnish backend service of the cache_text cluster in the eqiad datacenter will look like shown above.

The required tag list of course changes, but conftool will complain if you don't specify those correctly. Of course you can work on any object, you just need to specify the object-type parameter. So for example:

confctl --object-type discovery select 'dnsdisc=swift.*' get

or

confctl --object-type mwconfig select name=WMFMasterDatacenter get

will work as well.

In puppet

Conftool is installed by including the conftool class into your node manifest. It won't install the conftool-data directory, though, which is part of the puppet git repository. So it's pretty natural for the puppetmaster (puppetmaster1001) to be the standard machine where you should run conftool.

Operating

Add a server node to a service

confctl select 'service=(varnish-fe|nginx),name=<fqdn>' set/pooled=yes

If you need to add a server node to a pool, find the corresponding cluster in conftool-data/nodes/, see if the node stanza is present. If it is, then just add the service to the list of services; if not, add the node with its fqdn, as a key to the cluster, and add a list containing the service as a value.

After you have done that, you will need to merge the change in puppet and follow the steps outlined before for adding a service. Typically, though, new nodes will NOT be pooled, so if you want to pool your service you will need to modify the state of the node as follows:

# Pick a weight
confctl select name=<fqdn> set/weight=10
# Pool the host. Note that inactive hosts (e.g. when newly added) won't show up on config-master.w.o
confctl select name=<fqdn> set/pooled=yes

Modify the state of a server in a pool

Let's say we want to depool the server mw1018.eqiad.wmnet: what we'll do is what follows:

The server is in the eqiad datacenter, is part of the appserver cluster in puppet, and the service we want to change is apache2. We need all this information as we'll see next.
Run, from any host where conftool is installed:

confctl depool --hostname mw1018.eqiad.wmnet --service apache2

Verify that it worked with

confctl select name=mw1018.eqiad.wmnet get

The syntax for the set action is: set/key1=value1:key2=value2. A small note on the pooled value meaning:

yes means the server is pooled
no means the server is not pooled but (only in pybal) present in the config. For MediaWiki this also means that the server is receiving code updates via scap
inactive means the server is not in the config we write at all

Pooling/depooling a server from all the related services

When a server is in maintenance mode or needs to be depooled/repooled in all of its services, you have some useful shortcuts you can use. Specifially, those are:

confctl pool, which pools all services configured on the current host (pooled=yes)
confctl depool, which depools all services on the current host (pooled=no)
confctl decommission, which decommissions all services on the current host (pooled=inactive)
confctl drain , which drains traffic from all services on the current host (weight=0)

you can modify the behaviour of those actions in the following way:

if you specify an hostname with --hostname FQDN, actions will be performed on that host instead than on the current one
if you specify a service with --service SERVICE, actions will be performed on the services that match the regular expression at SERVICE instead than on every service.

Be careful on cache servers: this will not only depool the server from the load balancers, but also as a backend varnish! If you just want to depool the server from pybal, the best solution is to

confctl depool --service '(varnish-fe|nginx)'

Depool all nodes in a specific datacenter

confctl --object-type discovery select 'dnsdisc=wdqs,name=codfw' set/pooled=false

Decommission a server

Decommissioning a server is as simple as:

Depool it from all services with confctl decommission
Remove its stanza from conftool-data, then sync the data exactly in the way you did for adding a node.

Show pool status

Per-pool status is available at all times at http://config-master.wikimedia.org/pybal/DATACENTER/POOL or available via confctl like so:

 # confctl --tags dc=DATACENTER,cluster=CLUSTER,service=POOL --action get all | jq .
 {
   "restbase1011.eqiad.wmnet": {
     "weight": 10,
     "pooled": "yes"
 },
 ...

Server changing IP address

At the moment, PyBal does not redo DNS resolution. In the case where a server changes IP address, for example when moved to a different row, it is necessary to make PyBal completely forget about this server. This can be done by setting the server to set/pooled=inactive:

 # confctl decommission --hostname foo.example.net
 # sleep 60
 # confctl pool --hostname foo.example.net

Troubleshooting

Insufficient credentials

confctl needs to be run with proper etcd credentials, which are read from $HOME/.etcdrc; if the user conftool is run by doesn't have the adequate permissions, you will receive the following error

WARNING:etcd.client:etcd response did not contain a cluster ID
ERROR:conftool:Error when trying to set/pooled=no on name=mw1243.eqiad.wmnet
ERROR:conftool:Failure writing to the kvstore: Backend error: The request requires user authentication: Insufficient credentials

Ensure you have the correct credentials to access the objects you're trying to reach.

In practice in production, these credentials only exist in /root/.etcdrc, and the thing to do is to run any mutating operations with sudo.

Maintenance

Releasing conftool

To release conftool, start by reviewing the content of the release and choosing a version number (i.e., following semver). Conftool does not have a separate Debianization branch, so you can update setup.py and debian/changelog in a single patch, the latter with, e.g.

dch -v${CONFTOOL_VERSION}-1 -Dbullseye-wikimedia --force-distribution

Once your patch is merged in Gitlab, the packages will be automatically built by the gitlab build pipeline.

Conftool is installed widely, so builds are likely needed for all Debian versions supported in production (policy). To confirm, check DebMonitor for the three conftool binary packages (python3-conftool, python3-conftool-dbctl, python3-conftool-requestctl).

Next, rebuild for the other Debian versions. This involves committing new changelog entries locally, creating a merge request, and approve it, one for each version. For example, to rebuild for buster and bookworm:

# Build for buster
git switch -c buster-rebuild
dch -v${CONFTOOL_VERSION}-1+deb10u1 -Dbuster-wikimedia --force-distribution
git commit -am 'Rebuild for buster'
git push
# Go to gitlab, merge the change. This will trigger the package building pipeline
# Build for bookworm
git switch -c bookworm-rebuild
dch -v${CONFTOOL_VERSION}-1+deb12u1 -Dbookworm-wikimedia --force-distribution
git commit -am 'Rebuild for bookworm'
git push
# Again, merge the change in gitlab

This might change soon as we're in the process of automating deb package uploads

Once all packages are built, they can be downloaded from the gitlab pipelines outputs, and uploaded to apt1002.wikimedia.org.

To download the built files, go in the gitlab UI to pipelines and search for the pipeline for the merge request. It should have 4 stages represented. On the far right, there is a dropdown menu to download artifacts: select the build_ci_deb:archive and it will download a zip file containing the debs and the related changes files.

When you're ready to deploy conftool (no sooner), import the packages into the main component of the APT_repository for each of the relevant distributions (e.g., bullseye-wikimedia) as described in Importing packages.

Deploying conftool

Conftool is released to production hosts using debdeploy as described in Software deployment. Use update type "tool" when generating the deployment spec, and remember to specify the source package name (conftool) rather than any of the binary packages.

Exactly how to sequence updates across the fleet depends on the content of the release. Consider:

Which CLI tools are affected and what are the affected use cases?
Are there additional manual actions that must be taken (e.g., due to an entity schema change)? See NOTES-FOR-NEXT-RELEASE.txt in the conftool repository.

At a minimum, it's recommended to pick a single host for each affected CLI tool / use case, updating the host, and verifying the expected behavior before moving on to bulk updates.