The production cluster has several servers which you can use to access the various private data sources and do general statistical computation. They are called the analytics clients, since they act as clients accessing data from various other databases. To learn more about how to access these, refer to Analytics/Data access.
They can all provide hosted Jupyter notebooks.
|Host||OS||CPU cores||RAM||Disk Space||GPU|
- Main article: Analytics/Systems/Jupyter
Every client provides a hosted Jupyter environment for interactive notebooks and terminals.
- Main article: Data Engineering/Systems/Conda
We use Conda on the analytics clients to help folks create isolated environments to work in and install whatever packages they need. It's best to do all your inside a Conda environment
If you use Jupyter, this is all handled automatically. If you're working through a standard terminal, make sure to follow the instructions on the Conda page to create and activate environments.
The easiest way to query data on one of the analytics client is to use one of the Wmfdata packages in a Jupyter environment.
For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.
You may need to access the internet from the analytics clients (for example, to download a Python script using
pip). By default, this will fail because the machines are tightly firewalled. You'll have to use the HTTP proxy.
Local data storage
First, note that the Analytics clients store data using redundant RAID configurations, but are not otherwise backed up. Your home directory on HDFS (
/user/your-username) is a safer place for important data.
Please ensure that there is enough space on disk before storing big datasets/files. On the Analytics clients, the home directories are stored under the /srv partition, so the command df -h should be used regularly to check space used. There are client nodes that are more crowded than other ones, so please try to use the least used client first (for example, checking with the aforementioned command what stat hosts has more free space).
Checking for available disk space
On all the Analytics clients the home directories are stored under the /srv partition, so the command df -h should be used regularly to check space used. There are client nodes that are more crowded than other ones, so please try to use the least used client first (for example, checking with the aforementioned command what stat hosts has more free space).
Here an example to clarify the last point, using the stat1007 host:
elukey@stat1007:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 32G 0 32G 0% /dev tmpfs 6.3G 666M 5.7G 11% /run /dev/md0 92G 16G 71G 19% / tmpfs 32G 1.2M 32G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/mapper/stat1007--vg-data 7.2T 6.4T 404G 95% /srv <<=====================================<< tmpfs 6.3G 0 6.3G 0% /run/user/3088 tmpfs 6.3G 0 6.3G 0% /run/user/13926 tmpfs 6.3G 0 6.3G 0% /run/user/20171 fuse_dfs 2.3P 1.8P 511T 78% /mnt/hdfs tmpfs 6.3G 0 6.3G 0% /run/user/18005 tmpfs 6.3G 32K 6.3G 1% /run/user/17677 labstore1006.wikimedia.org:/srv/dumps/xmldatadumps/public 98T 59T 35T 64% /mnt/nfs/dumps-labstore1006.wikimedia.org labstore1007.wikimedia.org:/ 97T 65T 28T 70% /mnt/nfs/dumps-labstore1007.wikimedia.org tmpfs 6.3G 0 6.3G 0% /run/user/22235 tmpfs 6.3G 0 6.3G 0% /run/user/22071 tmpfs 6.3G 0 6.3G 0% /run/user/10668
In this case, the /srv partition is almost full, so it is better to look for another stat1xxx host.
Checking the space used by your files
It is sufficient to ssh to the host that you want to check and execute the following:
# Ensure that I am in my home directory, usually /home/your-username # if not, please do cd /home/your-username elukey@stat1007:~$ pwd /home/elukey elukey@stat1007:~$ du -hs 369M .
For a detailed view:
# Ensure that I am in my home directory, usually /home/your-username # if not, please do cd /home/your-username elukey@stat1007:~$ pwd /home/elukey elukey@stat1007:~$ du -hs * | sort -h [..] 164K dump.out 648K eventlogging_cleaner.log 7.5M refinery 21M python_env 49M webrequest.stats.json 245M spark2-2.3.1-bin-hadoop2.6
It is easy to have a quick view of how much data we are storing, and delete files that are not needed.
Alternatively, you can use the the tool ncdu, which provides a curses interface and lets you navigate around the directory tree and delete files as you encounter them.
If you wish to publish a dataset or report from one of the analytics clients, you can place it in the
/srv/published/ directory, which will make it available on the web in the equivalent place under analytics.wikimedia.org/published/. You can find more information on Analytics/Web publication.
On stat1005 we have deployed an AMD GPU for T148843. The long term plan is to make it available for all the users logging in, but for the moment its access is restricted to the POSIX group
gpu-testers to better test it (and avoid usage contention etc..). Please reach out to the Analytics team if you wish to get added to the group to test the GPU for your use case.
Rsync between clients
On every stat/notebook host there is a rsync server that allows users to copy data from one host to the other one. A typical use case would be to move a home directory. For example, let's see how user
batman can copy his data from notebook1004 to stat1005:
batman@stat1005:~$ rsync --exclude venv/ --exclude R/ --exclude ".*/" -av notebook1004.eqiad.wmnet::home/batman/ ~/
Please note that there is a limitation - the rsync daemon runs as user
nobody, so in order to copy data the home directory files must have permissions set accordingly, otherwise you'll see permission errors while copying. If you are seeing this problem and you are unsure about how to set permissions, please contact the Analytics team via IRC on Libera.chat (#wikimedia-analytics connect) or open a Phabricator task with the tag "Analytics".