Dumps/Dump servers
XML Dump servers
Hardware
We have two hosts:
- clouddumps1001 in eqiad, production, nfs server to WMF Cloud and stats hosts:
- Hardware/OS: Dell PowerEdge R740xd2, Debian 10 (bullseye), 512GB RAM, 2 12-core Intel Xeon Silver 4214 2.2G cpus
- Disks: 2 internal 480 GB SSD 1T drives for the OS in raid 1, 24 x 18TB drives in raid 10 for dumps
- clouddumps1002 in eqiad, production, web server, rsync to public mirrors:
- Hardware/OS: Dell PowerEdge R740xd2, Debian 10 (bullseye), 512GB RAM, 2 12-core Intel Xeon Silver 4214 2.2G cpus
- Disks: 2 internal 480 GB SSD 1T drives for the OS in raid 1, 24 x 18TB drives in raid 10 for dumps
Note that these hosts also serve other public datasets such as some POTY files, the pagecount stats, etc.
Services
The production host serves dump files and other public data sets to the public, using nginx. It also serves as an rsync server to our mirrors and to labs.
Deploying a new host
You'll need to set up the raid arrays by hand. We typically have two arrays so set up two raid 10 arrays with LVM to make one giant 64T volume, ext4.
Install in the usual way (add to puppet, copying a pre-existing production labstorexxx host stanza, set up everything for PXE boot and go). Depending on what the new box is going to do, you'll need to choose the appropriate role (web/rsync, or nfs work), or combine profiles to create a new role.
Space issues
If we run low on space, we can keep fewer rounds of XML dumps; this is controlled by /etc/dumps/xml_keeps.conf
on each host. This file is generated by puppet. The hosts where generated dumps are written as they are created, keep only a few dumps, and the web servers and such keep many more.
The class dumps::web::cleanups::xmldumps
generates one list of how many dumps to keep for hosts that are 'replicas', i.e. the web servers, with larger keep numbers, and one list for the generating hosts (the nfs servers where dumps are written during each run). The list $keep_replicas
is the one you want to tweak; the number of dumps can be adjusted for the huge wikis (enwiki, wikidatawiki), the big wikis (such as dewiki, commonswiki, etc) and then the rest separately.