MediaWiki and EtcdConfig

From Wikitech

This page details how we use Etcd with MediaWiki in production at Wikimedia Foundation. For the documentation about the EtcdConfig class in MediaWiki, refer to doc.wikimedia.org.

Value format

We manage Etcd keys for MediaWiki via Conftool. (Similar to other applications at Wikimedia that use Etcd).

The schema for values that MediaWiki reads from Etcd share the following structure:

{
    val: <VALUE>
}

The <VALUE> can any valid JSON value (null, boolean, number, string, array, object). These are parsed and unwrapped to PHP equivalents by MediaWiki's EtcdConfig class.

Internally, our Conftool schema specifies value type "any", which allows any JSON value to be stored, including without the above value wrapper. We also use separate json-schema based validators. The validator makes sure MediaWiki keys use the value wrapper, and that they only store values that are valid in MediaWiki. These validators ensure only safe changes are applied and spread to app servers. Conftool rejects invalid edits before they can reach MediaWiki.

Conftool organizes its value objects in file-like paths, where each sub directory is a "tag". In the case of the mwconfig object type, there is only one tag: scope.

As of writing, the following scopes are used:

  • common - variables shared across DCs.
  • eqiad - app servers in Eqiad.
  • codfw - app servers in Codfw.

The tree structure of mwconfig objects is: <basedir>/mediawiki-config/<scope>/<name>

Currently used keys:

common/WMFMasterDatacenter
eqiad/ReadOnly
codfw/ReadOnly

Get values

Using conftool, it's pretty easy to see all Etcd values related with MediaWiki:

$ confctl --object-type mwconfig select 'name=.*' get

To see how each key is used by MediaWiki, see wmf-config/etcd.php.

Edit existing values

If you want to edit an existing value, you can use confctl to fetch a value, edit it in your preferred editor, and resubmit it.

For example, if you want to alter the value of $wgReadOnly on Eqiad app servers, do:

$ sudo -i confctl --object-type mwconfig select='scope=eqiad,name=ReadOnly' edit

Your changes are expected to fully propagate to all MediaWiki clusters within 15 seconds.

Edit actions are automatically logged to the SAL, too.

Add a new value

Adding a new Etcd key and consuming it from MediaWiki is a rather cumbersome process, and that's a good thing! We don't want too much data to be stored in Etcd. When you're considering to add a key to Etcd, always ask yourself: Does its value represents a state or a configuration?

The pooled/depooled state of a database in MediaWiki is "state". Enabling or disabling the VisualEditor feature on a wiki is configuration. Those two examples are pretty extreme and clear-cut, but you'll find out it's not always that clear. When in doubt, avoid moving keys to Etcd!

If you do have a key you want to add, follow this three-step process:

  1. Define a json schema for your key or group of keys, and add it to the json-schemas in operations/puppet, and add a rule to the main conftool schema file for matching tags and names that correspond to your validation.
  2. Add an entry in conftool-data for your new object; during puppet-merge, an empty object will be created.
  3. Add code to the wmfEtcdConfig function in wmf-config that reads the value and uses it. Please avoid loading EtcdConfig multiple times. MediaWiki must only fetch data from Etcd once per request. Upon the first $etcdConfig->get, all keys in the specified directory are loaded. Subsequent get reads values from memory.

Operational guarantees and failure scenarios

Flowchart
Flowchart for the loading of data from etcd within MediaWiki

Whenever a MediaWiki process starts, MediaWiki tries to fetch the config data from the local cache (APC for fastcgi/web requests, a local hash on cli); if it's not there, or it's stale, it will try to fetch fresh data from the Etcd cluster.

A locking mechanism guarantees at most one thread per application server will request the data. Once data is fetched, it's cached for 10 seconds, so EtcdConfig should result in 6 read/appserver/minute, which is quite a small volume and should not become an issue for the Etcd servers. MediaWiki is smart and will randomly pick one of the servers listed in an SRV record, and connect to the next one if the first is not available.

If no server is available, the data from cache will be used, even if stale. This means that an appserver will continue to work as expected as long as it's not restarted, even in case of a complete failure of the Etcd cluster. Whenever some failure happens, at most one request out of all the concurrent ones will try to fetch the configuration anyways, so the overall slowdown of the user experience should be limited.

If you want to get into the full details of how this works, you can refer to the flowchart linked here, or read the code for EtcdConfig::load. As you can see from what said above, the implementation favours availability over consistency, as it allows stale reads.

Alerting

While the above is OK for avoiding the worst failure scenarios, there is still a possibility that an appserver ends up being out of sync with the Etcd cluster. In such a case, an icinga alert will pop up for "MediaWiki EtcdConfig up-to-date", and the general response should be to first check the server, then the Etcd cluster, and eventually restart the Fcgi server (HHVM?).

See also