Conftool

From Wikitech
Jump to navigation Jump to search

Conftool is a set of tools we use to sync and manage the dynamic state configuration for services (as of February 2018, varnish backend, the pybal pools, the DNS discovery entries, and some variables in Mediawiki configuration). This configuration is stored in the distributed key/value store: Etcd.

Overview

Conftool just gets information input in a series of configuration files, which are in the conftool-data/ directory in the Puppet repository. These files represent a static view of the configuration - so some information about services we manage, and then which services are installed on which hosts, and so on according to a schema. There is another part of the equation, which is the dynamic state of such configuration (such as, the weight of the server in its pool, and the information about either having the server pooled or not), which is left untouched by the sync (apart from setting default values in newly added object).

Schema

Conftool has two builtin object types, <service> and <node>. It is possible to add any additional object one might want by defining an appropriate schema. For details on how to write a schema, please refer to the conftool README. We'll briefly describe here the production schema, found at /etc/conftool/schema.yaml.

It contains the definitions of the following object types:

  • discovery: objects containing data used for the DNS/Discovery system. Refer to the documentation there for details.
  • mwconfig: objects containing data used by MediaWiki_config_on_Etcd. This is a bit different than the other object-types as it allows to define validation rules for individual keys. Refer to the documentation of that system for details.

The base data files

Relative to the conftool root (/etc/conftool/data in production), configuration files are organized as follows:

  • For each object type (node, service, and the other ones defined in the schema.yaml file), we have a directory containing yaml files

that include object definitions as a hierarchical list of tags and the object name they're referring to.

  • the nodes directory is a bit peculiar, as the data structure is adapted for ease of editing, and is as follows:
datacenter:
  cluster_name:
    node_name.eqiad.wmnet:
      - service_name
      - another_service
    another_node.eqiad.wmnet:
      - service_name
...

The tools

Currently, we have two tools, both installed on the puppetmaster (and on any node declaring profile::conftool::client:

  • conftool-sync which is used to sync what we write in the files described above to the distributed key/value-cluster (as of February 2018, it's [Etcd], but this may well change in the future). conftool-sync will not be called by you directly, in most cases, you will just call conftool-merge. When you merge a change to the puppet repository, conftool-merge will be directly invoked by our puppet-merge utlity on the puppetmaster. Beware: you need to either have the etcd credentials for conftool added to your own .etcdrc, or you need to use sudo -i when running puppet-merge or your changes to conftool-data will not be written.
  • confctl is the tool to interact with the key/value store and set dynamic values; for the full details of how to use it please see the README. a typical invocation could be:
confctl select dc=eqiad,cluster=cache_text,service=varnish-be,name=cp1052.eqiad.wmnet get

{"cp1052": {"pooled": "no", "weight": 0}}

where the tags argument is a comma-separated list of data that specifies the service you want to query, so for the varnish backend service of the cache_text cluster in the eqiad datacenter will look like shown above.

The required tag list of course changes, but conftool will complain if you don't specify those correctly. Of course you can work on any object, you just need to specify the object-type parameter. So for example:

confctl --object-type service select cluster=cache_text,name=varnish-be get

will work as well.

In puppet

Conftool is installed by including the conftool class into your node manifest. It won't install the conftool-data directory, though, which is part of the puppet git repository. So it's pretty natural for the puppetmaster (puppetmaster1001) to be the standard machine where you should run conftool.

Operating

Add a service

If you need to add a service to a cluster, just edit the relevant yaml file under conftool-data/services, adding a service entry, and then run conftool-sync.

So for now you typically:

  • Create a puppet change adding the service stanza
  • On puppetmaster1001, you run puppet-merge
  • Again on puppetmaster1001, you run conftool-merge without arguments (this is a wrapper script that "does the right thing")

Add a server node to a service

confctl select 'service=(varnish-fe|nginx),name=<fqdn>' set/pooled=yes

If you need to add a server node to a pool, find the corresponding cluster in conftool-data/nodes/, see if the node stanza is present. If it is, then just add the service to the list of services; if not, add the node with its fqdn, as a key to the cluster, and add a list containing the service as a value.

After you have done that, you will need to merge the change in puppet and follow the steps outlined before for adding a service. Typically, though, new nodes will NOT be pooled, so if you want to pool your service you will need to modify the state of the node as shown below.

Modify the state of a server in a pool

Let's say we want to depool the server mw1018.eqiad.wmnet: what we'll do is what follows:

  • The server is in the eqiad datacenter, is part of the appserver cluster in puppet, and the service we want to change is apache2. We need all this information as we'll see next.
  • Run, from any host where conftool is installed:
confctl depool --hostname mw1018.eqiad.wmnet --service apache2
  • Verify that it worked with
confctl select name=mw1018.eqiad.wmnet get

The syntax for the set action is: set/key1=value1:key2=value2. A small note on the pooled value meaning:

  • yes means the server is pooled
  • no means the server is not pooled but (only in pybal) present in the config
  • inactive means the server is not in the config we write at all

Pooling/depooling a server from all the related services

When a server is in maintenance mode or needs to be depooled/repooled in all of its services, you have some useful shortcuts you can use. Specifially, those are:

  • confctl pool, which pools all services configured on the current host (pooled=yes)
  • confctl depool, which depools all services on the current host (pooled=no)
  • confctl decommission, which decommissions all services on the current host (pooled=inactive)
  • confctl drain , which drains traffic from all services on the current host (weight=0)

you can modify the behaviour of those actions in the following way:

  1. if you specify an hostname with --hostname FQDN, actions will be performed on that host instead than on the current one
  2. if you specify a service with --service SERVICE, actions will be performed on the services that match the regular expression at SERVICE instead than on every service.

Be careful on cache servers: this will not only depool the server from the load balancers, but also as a backend varnish! If you just want to depool the server from pybal, the best solution is to

confctl depool --service '(varnish-fe|nginx)'

Decommission a server

Decommissioning a server is as simple as:

  • Depool it from all services with confctl decommission
  • Remove its stanza from conftool-data, then sync the data exactly in the way you did for adding a node.

Show pool status

Per-pool status is available at all times at http://config-master.wikimedia.org/conftool/DATACENTER/POOL or available via confctl like so:

 # confctl --tags dc=DATACENTER,cluster=CLUSTER,service=POOL --action get all | jq .
 {
   "restbase1011.eqiad.wmnet": {
     "weight": 10,
     "pooled": "yes"
 },
 ...

Server changing IP address

At the moment, PyBal does not redo DNS resolution. In the case where a server changes IP address, for example when moved to a different row, it is necessary to make PyBal completely forget about this server. This can be done by setting the server to set/pooled=inactive:

 # confctl decommission --hostname foo.example.net
 # sleep 60
 # confctl pool --hostname foo.example.net