Portal:Data Services/Admin/Shared storage

From Wikitech

Labstore (cloudstore) is the prefix for the naming of a class of servers that fulfill different missions. The common thread is off-compute-host storage use cases that serve the VPS instances and Tools. The majority of labstore clusters provide NFS shares for some purpose.

Clusters

The "Primary" and "Secondary" names should be changed whenever that is practical. They cause confusion with the DRBD naming convention of calling the active server a "primary" and the standby server the "secondary".

Primary Cluster

Servers: labstore1004, labstore1005

This was previously called the secondary cluster (now it only is in the client mount).

  • Tools project share that is used operationally for deployment.
  • Tools home share that is used for /home in Toolforge.
  • Misc home and project shares that are used by all NFS enabled projects, except maps.

Components: NFS, nfs-manage, maintain-dbusers, nfs-exportd, BDSync

Secondary Cluster

Servers: cloudstore1008, cloudstore1009

  • An NFS share large enough to be used as general scratch space across projects
    • /data/scratch
  • Maps project(s) also have tile generation on a share here temporarily.
  • (proposed) Quota limited rsync backup service for Cloud VPS tenants (phab:T209530)
  • Uses DRBD to stay in sync similar to the primary cluster.

Components: NFS, nfs-exportd, nfs-manage

Dumps

Servers: clouddumps1001, clouddumps1002

  • Dumps customer facing storage
    • NFS exports to Cloud VPS projects including Toolforge
    • NFS exports to Analytics servers
    • Rsync origin server for dumps mirroring
    • https://dumps.wikimedia.org (nginx)
    • Analytics manages an hdfs client there which means the servers are kerberized
  • Does NOT use nfs-exportd and nfs, rsync and nginx should remain active on both servers

Components: NFS, nginx, rsync (for mirrors and syncing to stats servers), kerberos, hdfs

Offsite backup

Servers: cloudbackup2001, cloudbackup2002

  • cloudbackup2001 acts as a backup server for the "tools-project" logical volume from labstore100[45]
  • cloudbackup2002 acts as a backup server for the "misc project" logical volume from labstore100[45]

The backup is an ssh-based bdsync job that happens once a week on a different day for each volume between the eqiad Primary Cluster and the backup servers in codfw.

Components: BDSync, Backups

Components

NFS

General Setup

The system strictly uses NFSv4 and where possible (aka Debian Buster+ clients) NFSv4.2 to take advantage of some locking improvements in 4.1 and 2. Using a strict v4 system allows the firewall to accept only port 2049, which is nice and clean and doesn't require rpcbind/portmap on clients (see port 111 vulnerability). However, exports are defined using the old NFSv3 style without using the virtual filesystem feature of v4. The reasons for that at this time are:

  • There is little value to the virtual filesystem unless you think your clients need to be able to discover other shares.
  • You can use v3 style exports in v4, despite this being rarely documented online.
  • Using the virtual filesystem requires mounting any volumes to be shared under a specific hierarchy on the server, which typically results in a lot of bind mounts. When trying to fail over DRBD, those bind mounts will refuse to unmount unless you can force all clients to stop writing files and holding locks for a bit. With gridengine, that's quite impossible since the database for the grid is on NFS by design (enables failover). Turn the filesystem readonly would result in far more breakage than simply not using bind mounts and smoothing the failover. In the past, the solution was to not run the script to failover and instead run the commands by hand and reboot the server when it refuses to unmount the volumes. This is fixed by using v3 style exports since a umount -f will eventually unmount the share in that setup.
    • It is worth noting here that the you need to unmount the volumes to make the DRBD active server into a standby, and if you do not do that before you fail over, the data will become inconsistent, split-brained or simply ruined from a user perspective. The recovery from making such a mistake is to determine which one is more likely to be the good server (probably the one successfully holding the IP address at the time) resync the other server from scratch after making it a secondary/standby.

Since writable NFS access is determined by the Openstack project of the client (via nfs-exportd, most of the Openstack Projects are all on one volume at this time, which is labeled "misc" and mounted at /srv/misc/ on the labstore1004/5 cluster. Two projects have their very own volumes because they are very large: maps and tools. Tools is also prone to filling, which would cause a problem for everyone else.

None of this DRBD stuff applies to dumps because they are read only and don't have DRBD replication. They are shared using the v3 style to be sure of precisely what is being shared and for consistency.

NFS volume cleanup

Because the Primary and Secondary NFS clusters lack user quotas, WMCS must occasionally create a task to start removing large files and helping users clean up their shares. If it has been six months, and no clean-up has taken place, please check the NFS servers at least on grafana to make sure one isn't needed. The tasks generally take a form similar to task T247315, an overall tracking task with administrator work logged on it and a tree of user tasks that we assign to end users to clean up their tool shares or project shares with some advice and assistance where possible.

If a page has triggered a cleanup task, make sure you downtime the alert for a good long while.

Admin actions include, but are not limited to:

  • Checking Grafana for the list of the heaviest users.
  • Running ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%k KB %p\n" | sort -h > tools_large_files_$(date +%Y%m%d).txt to find the largest files. Often they are simply toolforge-created logs that can be truncated with truncate -s 0 $filename and a SAL log to tools.$toolname.
  • Truncate automatically-created *.out and *.err files that were created by Grid Engine.
  • Log files generated by the webservice command such as access.log and error.log can be treated similarly.
  • Other files should probably be checked with the user before deleting unless the situation is very urgent (usually asking the user in the phabricator task is enough).
  • If a service is consistently filling up NFS volumes, and users cannot be reached, it could be shut down as a danger to the overall service. We should make our best effort to avoid needing to do that, of course.

NFS client operations

When significant changes are made on an NFS server, clients that mount that often need actions taken on them to recover from whatever state they are suddenly in. To this end, the cumin host file backend is there to work in tandem with the nfs-hostlist script. The latter script will generate a list of VMs by project and specified mounts where NFS is mounted. Currently, you must be on a cloudinfra cumin host to run these commands. The list of cumin hosts can be seen on Cumin#WMCS Cloud VPS infrastructure.

The nfs-hostlist script takes several options (some are required):

  • -h Show help
  • -m <mount> A space-delimited list of "mounts" as defined in the /etc/nfs-mounts.yaml file generated from puppet (it won't accept wrong answers, so this is a pretty safe option)
  • --all-mounts Anything NFS mounted (but you can still limit the number of projects)
  • -p <project> A space-delimited list of OpenStack projects to run against. This will be further trimmed according to the mounts you selected. (If you used -m maps and -p maps tools, you'll only end up with maps hosts)
  • --all-projects Any project mentioned in /etc/nfs-mounts.yaml, but you can still filter by mounts.
  • -f <filename> Without this, the script prints to STDOUT.

Example:

  1. First, create your host list based on the mounts or projects you know you will be operating on. For example, if I was making a change only to the secondary cluster, which currently serves maps and scratch, one might generate a host list with the command:
    bstorm@cloud-cumin-01:~$ sudo nfs-hostlist -m maps scratch --all-projects -f hostlist.txt
    
    Note that root/sudo is needed because this interacts with cumin's query setup to get hostnames. It will take quite a while to finish because it also calls openstack-browser's API to read Hiera settings.
  2. Now you can run a command with cumin across all hosts in hostlist.txt similar to
    bstorm@cloud-cumin-01:~$ sudo cumin --force -x 'F{/home/bstorm/hostlist.txt}' 'puppet agent -t'
    

It is sensible to have the host list generated shortly before the changes will take place to respond quickly as needed with cumin when you need to.

nfs-manage

This script is meant as the entry point to bringing up and taking down the DRBD/NFS stack in its entirety.

nfs-manage status
nfs-manage up
nfs-manage down

To actually use it to failover a cluster, try Portal:Data_Services/Admin/Runbooks/Failover_an_NFS_cluster

nfs-exportd

Dynamically generates the contents of /etc/export.d to mirror active projects and shares as defined in /etc/nfs-mounts.yaml, every 5 minutes.

This daemon fetches project information from OpenStack to know the IPs of the instances and add them to the exports ACL.

See ::labstore::fileserver::exports.

WARNING: there is a known issue, in case some openstack component is misbehaving (for example, keystone), this will typically return a 401. Please don't allow this to make it past the traceback. We want exceptions and failures in the service instead of letting it remove exports. There is also a cron job that backs up exports to /etc/exports.bak.

maintain-dbusers

We maintain the list of accounts to access the Wiki Replicas on the cloudcontrols (only one acting as primary at a time). The script writes out a $HOME/replica.my.cnf file to each user and project home containing MySQL connection credentials by using an API that each NFS server (tools and paws currently) is running. This uses LDAP to get a list of accounts to create.

The credential files are created with the immutable bit set with chattr to prevent deletion by the Tool account.

The code pattern here is that you have a central data store (the db), that is then read/written to by various independent functions. These functions are not 'pure' - they could even be separate scripts. They mutate the DB in some way. They are also supposed to be idempotent - if they have nothing to do, they should not do anything.

Most of these functions should be run in a continuous loop, maintaining mysql accounts for new tool/user accounts as they appear.

populate_new_accounts

  • Find list of tools/users (From LDAP) that aren't in the `accounts` table
  • Create a replica.my.cnf for each of these tools/users
  • Make an entry in the `accounts` table for each of these tools/users
  • Make entries in `account_host` for each of these tools/users, marking them as absent

create_accounts

  • Look through `account_host` table for accounts that are marked as absent
  • Create those accounts, and mark them as present.

If we need to add a new labsdb, we can do so the following way:

  • Add it to the config file
  • Insert entries into `account_host` for each tool/user with the new host.
  • Run `create_accounts`

In normal usage, just a continuous process running `populate_new_accounts` and `create_accounts` in a loop will suffice.

TODO:

 - Support for maintaining per-tool restrictions (number of connections + time)

BDSync

We use the WMF bdsync package on both source and destination backup hosts. Backup hosts run a job periodically to sync a remote block device from a remote target to an LVM device locally.

Backups

Uses bdsync via rsync using SSH as a transport to copy block devices over the network.

Mounting a backup

This is basically the restore procedure, and is described at Portal:Data_Services/Admin/Runbooks/Restore_NFS_files

How to enable NFS for a project

Follow this runbook