Jump to content

Help:Shared storage

From Wikitech

Shared Storage currently includes shared directories offered to Cloud VPS and Toolforge users. Tools users and Tool accounts have most of the common directories already available, and VPS users can access them on request. You can request for access to the listed shares by filing a task on Phabricator under the Data-Services and VPS-Projects projects. When Shared Storage NFS services have been granted, NFS will be mounted by Puppet on any VMs where the Hiera key mount_nfs: true applies.

Disadvantages of shared NFS directories

As of 2024, NFS is the only storage option for Toolforge, so Toolforge users can disregard this section; it is intended as advice for Cloud VPS users.

The shared directories are offered by locally mounting shares exposed from our NFS storage servers. Problems and failures in the NFS server can render your instance slow or sometime unusable, so we strongly recommend considering other options before going this route. The NFS shares are not a solution for the following problems:

  • Storing database or other backups (We are working on coming up with alternatives for tools)
  • Downloading or writing a lot of data into (Use local directories on the instance or /tmp on tools if you need to for any intermediate processing)
  • Other IO intense operations
  • Storing a lot of data - our available storage is limited

/data/scratch

This is a 'temp' space that is shared across all instances in all projects that have opted into this. Any data you put into them can be read by all other instances that have a /data/scratch, but they can not delete your data by default. This data is not backed up.

Use this for:

  1. Sharing public large data between instances
  2. 'Temporary' storage / usage that can be purged later

Do not use these for:

  1. Information that should be kept private to your project (credentials, keys, etc)
  2. Information that should be backed up and kept safe (code, data backups, etc)
  3. Files that are actively read from or written to (e.g. databases, logfiles)

/data/project

This is per-project private space that is shared across all instances in the project only (and not across all instances across all projects as with /data/scratch). Any data you put in them is visible to all other instances in your project only.

Data stored in a project NFS share has some redundancy due to mirroring of content with the secondary NFS server. NFS servers also have periodic snapshots taken for disaster recovery. Point in time recovery of individual files is not easily accomplished however, and may not be possible depending on when the most recent snapshot was taken.

Use this for:

  1. Sharing data/files between instances of your project

Do not use this for:

  1. Storing code / config that is directly run (please store code in git and run them off local storage)
  2. Storing databases / data that is directly manipulated (do not put postgres / mysql / mongo / etc data directories on NFS, and do not do lots of sqlite operations on NFS either)
  3. Logfiles
  4. Temporary storage of large amount of data (use /data/scratch for that instead)

/home

This is per-project private space shared across all instances in your project only and mounted in /home. This allows you to keep a shared homedirectory across instances, to keep a useful scripts, etc in. Note that enabling this will very strongly couple availability of your instance to NFS - you can not ssh in when NFS is down. This data is also backed up.

Use this for:

  1. Storing small scripts / .rc files across instances

Do not use this for:

  1. Same as the "do not use this for" section of Help:Shared storage#.2Fdata.2Fproject

Note that progress is being made in building a simple system to share .rc / convenience scripts that does not involve NFS. You can track that on task T102173.

/public/dumps

Toolforge has access to this directory that stores the dumps generated by Wikimedia projects: public Wikimedia datasets. Toolforge users can directly access dumps data through their Tool account. Cloud VPS users can request to have the share available.

This is a global, read-only share that contains data dumps for research purposes. These include compressed XML dumps of Wikimedia wikis, raw page counts data, Wikidata JSON dumps, and more! This directory is read-only, but you can copy the files to your tool's home directory if necessary. Ideally you can find (or build!) a library that can be used to read data from the dumps without decompressing them. See meta:Data dumps/Other tools for some examples.

Older dumps

You can manually download older dumps from the Wikimedia downloads server, or from mirrors which may have better bandwidth.

/data/project/shared/mediawiki

On Toolforge, you can access a full checkout of all MediaWiki repositories hosted on Gerrit.

This is especially useful to search code across all repositories with commands like ack-grep.

The checkout should also include the code review notes, from which you can e.g. extract code review statistics.