Jump to content

Data Platform/Systems/Clients

From Wikitech

The production cluster has several servers which you can use to access the various private data sources and do general statistical computation. They are called the analytics clients, since they act as clients accessing data from various other databases (but they are also known informally as stat hosts, stat machines, or stat clients). To learn more about how to access these, refer to Data_Platform/Data access.

They can all provide hosted Jupyter notebooks.

Host OS CPU cores RAM Disk Space GPU Relative I/O performance
stat1008 Debian Bullseye 32 512G 7.2TB yes 4th out of 4
stat1009 Debian Bullseye 72 188G 17TB no 2nd out of 4
stat1010 Debian Bullseye 72 512G 6TB yes 1st out of 4
stat1011 Debian Bullseye 48 128G 6TB no 3rd out of 4

Jupyter

Every client provides a hosted Jupyter environment for interactive notebooks and terminals.

Conda

We use Conda on the analytics clients to help folks create isolated environments to work in and install whatever packages they need. It's best to do all your work inside a Conda environment.

If you use Jupyter, this is all handled automatically. If you're working through a standard terminal, make sure to follow the instructions on the Conda page to create and activate environments.

Querying data

The easiest way to query data on one of the analytics client is to use one of the Wmfdata packages in a Jupyter environment.

Python

For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.

R

For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.

Internet access

You may need to access the internet from the analytics clients (for example, to download a Python script using pip). By default, this will fail because the machines are tightly firewalled. You'll have to use the HTTP proxy.

Resource management

Once 90% of a client's memory is consumed, the topmost memory-intensive processes are killed until sufficient memory is freed up. Only 90% of CPUs resources are available for user processes.

Local data storage

First, note that the Analytics clients store data using redundant RAID configurations, but are not otherwise backed up. Your home directory on HDFS (/user/your-username) is a safer place for important data.

Please ensure that there is enough space on disk before storing big datasets/files. On the Analytics clients, the home directories are stored under the /srv partition, so the command df -h should be used regularly to check space used. There are client nodes that are more crowded than other ones, so please try to use the least used client first (for example, checking with the aforementioned command what stat hosts has more free space).

Checking for available disk space

On all the Analytics clients the home directories are stored under the /srv partition, so the command df -h should be used regularly to check space used. There are client nodes that are more crowded than other ones, so please try to use the least used client first (for example, checking with the aforementioned command what stat hosts has more free space).

Here an example to clarify the last point, using the stat1009 host:

elukey@stat1007:~$ df -h
Filesystem                                                 Size  Used Avail Use% Mounted on
udev                                                        32G     0   32G   0% /dev
tmpfs                                                      6.3G  666M  5.7G  11% /run
/dev/md0                                                    92G   16G   71G  19% /
tmpfs                                                       32G  1.2M   32G   1% /dev/shm
tmpfs                                                      5.0M     0  5.0M   0% /run/lock
tmpfs                                                       32G     0   32G   0% /sys/fs/cgroup

/dev/mapper/stat1007--vg-data                              7.2T  6.4T  404G  95% /srv                            <<=====================================<<

tmpfs                                                      6.3G     0  6.3G   0% /run/user/3088
tmpfs                                                      6.3G     0  6.3G   0% /run/user/13926
tmpfs                                                      6.3G     0  6.3G   0% /run/user/20171
fuse_dfs                                                   2.3P  1.8P  511T  78% /mnt/hdfs
tmpfs                                                      6.3G     0  6.3G   0% /run/user/18005
tmpfs                                                      6.3G   32K  6.3G   1% /run/user/17677
labstore1006.wikimedia.org:/srv/dumps/xmldatadumps/public   98T   59T   35T  64% /mnt/nfs/dumps-labstore1006.wikimedia.org
labstore1007.wikimedia.org:/                                97T   65T   28T  70% /mnt/nfs/dumps-labstore1007.wikimedia.org
tmpfs                                                      6.3G     0  6.3G   0% /run/user/22235
tmpfs                                                      6.3G     0  6.3G   0% /run/user/22071
tmpfs                                                      6.3G     0  6.3G   0% /run/user/10668

In this case, the /srv partition is almost full, so it is better to look for another stat1xxx host.

Checking the space used by your files

It is sufficient to ssh to the host that you want to check and execute the following:

# Ensure that I am in my home directory, usually /home/your-username
# if not, please do cd /home/your-username
elukey@stat1007:~$ pwd
/home/elukey

elukey@stat1007:~$ du -hs
369M	.

For a detailed view:

# Ensure that I am in my home directory, usually /home/your-username
# if not, please do cd /home/your-username
elukey@stat1007:~$ pwd
/home/elukey

elukey@stat1007:~$ du -hs * | sort -h
[..]
164K	dump.out
648K	eventlogging_cleaner.log
7.5M	refinery
21M	python_env
49M	webrequest.stats.json
245M	spark2-2.3.1-bin-hadoop2.6

It is easy to have a quick view of how much data we are storing, and delete files that are not needed.

Alternatively, you can use the the tool ncdu, which provides a curses interface and lets you navigate around the directory tree and delete files as you encounter them.

Web publication

If you wish to publish a dataset or report from one of the analytics clients, you can place it in the /srv/published/ directory, which will make it available on the web in the equivalent place under analytics.wikimedia.org/published/. You can find more information on Data_Platform/Web publication.

GPU usage

On stat1008 and stat1010 we have deployed an AMD GPU for T148843. The long term plan is to make it available for all the users logging in, but for the moment its access is restricted to the POSIX group gpu-testers to better test it (and avoid usage contention etc..). Please reach out to the Analytics team if you wish to get added to the group to test the GPU for your use case.

Rsync between clients

On every stat host there is a rsync server that allows users to copy data from one host to the other one. A typical use case would be to move a home directory. For example, let's see how user batman can copy all his data from stat1006 to stat1007:

batman@stat1007:~$ rsync --exclude "/.*/" -av stat1006.eqiad.wmnet::home/batman/ ~/

The key details:

  • The command is run on the destination host (stat1007 in this case)
  • --exclude "/.*/" excludes top-level hidden directories like ~/.conda, ~/.cache, and ~/.jupyter which usually shouldn't be copied between hosts
  • -av stat1006.eqiad.wmnet::home/batman/ specifies the path on the source host
  • ~/ is referring to the home directory on the destination host
  • This is substantially faster and more secure than using scp -3 on your laptop (e.g. scp -3 stat1006:/home/batman/ stat1007:/home/batman/)

Suppose batman needed to sync a notebook he modified on stat1007 (now the source) back to stat1006 (now the destination):

batman@stat1006:~$ rsync -av stat1007.eqiad.wmnet::home/batman/Untitled.ipynb ~/Untitled.ipynb

Please note that there is a limitation - the rsync daemon runs as user nobody, so in order to copy data the home directory files must have permissions set accordingly, otherwise you'll see permission errors while copying. If you are seeing this problem and you are unsure about how to set permissions, please contact the Data Engineering team via IRC on Libera.chat (#wikimedia-analytics connect) or open a Phabricator task with the tag "Data-Engineering".

Common workflows (WIP)

This section is mostly written to help SREs understand how the hosts are used. Feel free to update this with your workflows!

Spark jobs

  • Fast-running jobs (1 hr or less), these are cached on the spark workers or held in RAM on the stat hosts.
  • Expensive jobs ( > 1h). These write to the disk (that way, even if the notebook/server stops, the spark job will still complete and write the output). With this approach, the state can be easily recovered. The disadvantage is that you have to manually clean up your storage.