Analytics/Systems/Clients

From Wikitech
Jump to navigation Jump to search

The production cluster has several servers which you can use to access the various private data sources and do general statistical computation. There are two types: the stat servers, designed for command-line use, and the SWAP servers, designed for Jupyter notebook use. Together, these are called the analytics clients, since they act as clients accessing data from various other databases.

Host CPU cores RAM Disk Space GPU? Data access
stat1004 16 32G 7.2TB no Hadoop, MariadBB, XML Dumps
stat1005 40 64G 7.2TB yes Hadoop, MariaDB, XML Dumps
stat1006 40 64G 7.2TB no MariaDB, XML Dumps
stat1007 32 64G 7.2TB no Hadoop, MariaDB, XML Dumps
notebook1003 32 64G 120GB no Hadoop, MariaDB, XML Dumps
notebook1004 32 64G 120GB no Hadoop, MariaDB, XML Dumps

You may need to access the internet from the analytics clients (for example, to download a Python script using pip). By default, this will fail because the machines are tightly firewalled. You'll have to use the HTTP proxy.

Local data storage

First, note that the Analytics clients store data using redundant RAID configurations, but are not otherwise backed up. Your home directory on HDFS (/user/your-username) is a safer place for important data.

Please ensure that there is enough space on disk before storing big datasets/files. On the Analytics clients, the home directories are stored under the /srv partition, so the command df -h should be used regularly to check space used. There are client nodes that are more crowded than other ones, so please try to use the least used client first (for example, checking with the aforementioned command what stat hosts has more free space).

Checking for available disk space

List of available nodes and their disk space available:

  • stat1004 - 7.2TB of total space
  • stat1005 - 7.2TB of total space
  • stat1006 - 7.2TB of total space
  • stat1007 - 7.2TB of total space
  • notebook100[3,4] - 120G of total space, not intended to store any data but just as Hadoop client (if needed, store data on HDFS).

On all the Analytics clients the home directories are stored under the /srv partition, so the command df -h should be used regularly to check space used. There are client nodes that are more crowded than other ones, so please try to use the least used client first (for example, checking with the aforementioned command what stat hosts has more free space).

Here an example to clarify the last point, using the stat1007 host:

elukey@stat1007:~$ df -h
Filesystem                                                 Size  Used Avail Use% Mounted on
udev                                                        32G     0   32G   0% /dev
tmpfs                                                      6.3G  666M  5.7G  11% /run
/dev/md0                                                    92G   16G   71G  19% /
tmpfs                                                       32G  1.2M   32G   1% /dev/shm
tmpfs                                                      5.0M     0  5.0M   0% /run/lock
tmpfs                                                       32G     0   32G   0% /sys/fs/cgroup

/dev/mapper/stat1007--vg-data                              7.2T  6.4T  404G  95% /srv                            <<=====================================<<

tmpfs                                                      6.3G     0  6.3G   0% /run/user/3088
tmpfs                                                      6.3G     0  6.3G   0% /run/user/13926
tmpfs                                                      6.3G     0  6.3G   0% /run/user/20171
fuse_dfs                                                   2.3P  1.8P  511T  78% /mnt/hdfs
tmpfs                                                      6.3G     0  6.3G   0% /run/user/18005
tmpfs                                                      6.3G   32K  6.3G   1% /run/user/17677
labstore1006.wikimedia.org:/srv/dumps/xmldatadumps/public   98T   59T   35T  64% /mnt/nfs/dumps-labstore1006.wikimedia.org
labstore1007.wikimedia.org:/                                97T   65T   28T  70% /mnt/nfs/dumps-labstore1007.wikimedia.org
tmpfs                                                      6.3G     0  6.3G   0% /run/user/22235
tmpfs                                                      6.3G     0  6.3G   0% /run/user/22071
tmpfs                                                      6.3G     0  6.3G   0% /run/user/10668

In this case, the /srv partition is almost full, so it is better to look for another stat1xxx host.

Checking the space used by your files

It is sufficient to ssh to the host that you want to check and execute the following:

# Ensure that I am in my home directory, usually /home/your-username
# if not, please do cd /home/your-username
elukey@stat1007:~$ pwd
/home/elukey

elukey@stat1007:~$ du -hs
369M	.

For a detailed view:

# Ensure that I am in my home directory, usually /home/your-username
# if not, please do cd /home/your-username
elukey@stat1007:~$ pwd
/home/elukey

elukey@stat1007:~$ du -hs * | sort -h
[..]
164K	dump.out
648K	eventlogging_cleaner.log
7.5M	refinery
21M	python_env
49M	webrequest.stats.json
245M	spark2-2.3.1-bin-hadoop2.6

It is easy to have a quick view of how much data we are storing, and delete files that are not needed.

Web publication

If you wish to publish a dataset or report from one of the analytics clients, you can place it in the /srv/published/ directory, which will make it available on the web in the equivalent place under analytics.wikimedia.org/published/. You can find more information on Analytics/Web publication.

GPU usage

On stat1005 we have deployed an AMD GPU for T148843. The long term plan is to make it available for all the users logging in, but for the moment its access is restricted to the POSIX group gpu-testers to better test it (and avoid usage contention etc..). Please reach out to the Analytics team if you wish to get added to the group to test the GPU for your use case.

Jupyterhub usage

On every stat100x node except stat1007 (still in progress) it is available a Jupyterhub server, see wiki/SWAP. Please note that on stat1005 it is running Jupyterhub 1.1.0 (last upstream), please report any bug if you find one. The goal is to upgrade all the other nodes as soon as they are ready for an OS upgrade.