Dumps/Dump servers

From Wikitech

XML Dump servers

Hardware

We have two hosts:

  • clouddumps1001 in eqiad, production, nfs server to WMF Cloud and stats hosts:
    Hardware/OS: Dell PowerEdge R740xd2, Debian 11 (bullseye), 512GB RAM, 2 12-core Intel Xeon Silver 4214 2.2G cpus
    Disks: 2 internal 480 GB SSD 1T drives for the OS in raid 1, 24 x 18TB drives in raid 10 for dumps
  • clouddumps1002 in eqiad, production, web server, rsync to public mirrors:
    Hardware/OS: Dell PowerEdge R740xd2, Debian 11 (bullseye), 512GB RAM, 2 12-core Intel Xeon Silver 4214 2.2G cpus
    Disks: 2 internal 480 GB SSD 1T drives for the OS in raid 1, 24 x 18TB drives in raid 10 for dumps

Note that these hosts also serve other public datasets such as some POTY files, the pagecount stats, etc.

Services

The production host serves dump files and other public data sets to the public, using nginx. It also serves as an rsync server to our mirrors and to labs.

Deploying a new host

You'll need to set up the raid arrays by hand. We typically have two arrays so set up two raid 10 arrays with LVM to make one giant 64T volume, ext4.

Install in the usual way (add to puppet, copying a pre-existing production labstorexxx host stanza, set up everything for PXE boot and go). Depending on what the new box is going to do, you'll need to choose the appropriate role (web/rsync, or nfs work), or combine profiles to create a new role.

Space issues

If we run low on space, we can keep fewer rounds of XML dumps; this is controlled by /etc/dumps/xml_keeps.conf on each host. This file is generated by puppet. The hosts where generated dumps are written as they are created, keep only a few dumps, and the web servers and such keep many more.

The class dumps::web::cleanups::xmldumps generates one list of how many dumps to keep for hosts that are 'replicas', i.e. the web servers, with larger keep numbers, and one list for the generating hosts (the nfs servers where dumps are written during each run). The list $keep_replicas is the one you want to tweak; the number of dumps can be adjusted for the huge wikis (enwiki, wikidatawiki), the big wikis (such as dewiki, commonswiki, etc) and then the rest separately.