Machine Learning/AMD GPU

What GPU model do we have? On what hosts?

For the most up-to-date info on which hosts have GPUs, search for a GPU-specific metric such as amd_rocm_gpu_usage_percent in Grafana Explorer (WMF NDA only). Make sure to set the data source to "Thanos" to cover all clusters.

If you have access to root on cumin hosts, you can also run sudo cumin C:prometheus::node_amd_rocm to get a list of nodes with GPUs.

As of August 2024, the following production hosts have GPUs:

Hosts with GPU
Node	Cluster	GPU	VRAM	Intended use
an-worker1100	Analytics	AMD Radeon Pro WX 9100	16GB	Hadoop workloads
an-worker1101	Analytics	AMD Radeon Pro WX 9100	16GB	Hadoop workloads
dse-k8s-worker1001	DSE	2x AMD Radeon Pro WX 9100	16GB	Kubernetes workloads
dse-k8s-worker1009	DSE	2x AMD Instinct MI210	64GB	* Kubernetes workloads
ml-lab1001	n/a	2x AMD Instinct MI210	64GB	* Experimentation with e.g. Jupyter notebooks
ml-lab1002	n/a	2x AMD Instinct MI210	64GB	* Experimentation with e.g. Jupyter notebooks
ml-serve1001	ml-serve	2x AMD Radeon Pro WX 9100	16GB	Kubernetes workloads
ml-serve1009	ml-serve	2x AMD Instinct MI210	64GB	* Kubernetes workloads
ml-serve1010	ml-serve	2x AMD Instinct MI210	64GB	* Kubernetes workloads
ml-serve1011	ml-serve	2x AMD Instinct MI210	64GB	* Kubernetes workloads
ml-serve2009	ml-serve	2x AMD Instinct MI210	64GB	Kubernetes workloads
ml-serve2010	ml-serve	2x AMD Instinct MI210	64GB	Kubernetes workloads
ml-serve2011	ml-serve	2x AMD Instinct MI210	64GB	Kubernetes workloads
ml-staging2001	ml-staging	AMD Instinct MI100	32GB	Kubernetes workloads (staging)
ml-staging2003	ml-staging	2x AMD Instinct MI210	64GB	Kubernetes workloads (staging)
stat1008	n/a	AMD Radeon Pro WX 9100	16GB	Experimentation with e.g. Jupyter notebooks
stat1010	n/a	AMD Radeon Pro WX 9100	16GB	Experimentation with e.g. Jupyter notebooks

NOTE: Entries marked with * are machines that re planned to be installed soon, but not available yet.

The WMF chose AMD because they are currently the only ones releasing their software stack open source: https://github.com/RadeonOpenCompute

Do we have Nvidia GPUs?

The short answer is no, we are not planning to use (now or in the future) Nvidia cards. Since it is a vendor-specific stance, it is worth to explain why we are doing it: the Nvidia drivers and tools are not open source and they represent a risk for the Foundation's policies. These are the main reasons:

Security: they rely on binary-only blobs (running in the Linux Kernel) to work properly. High security vulnerabilities, that may require a kernel patch and rebuild, may not be rolled out on nodes running Nvidia software since they may not be compatible (so one has to wait for Nvidia upstream's updates before proceeding any further).
Ethical: the Wikimedia foundation has a very firm policy on open-source software, and using proprietary-only hardware and software (when there is an alternative) is not contemplated. We risk of not being compatible with all libraries/tools/software that works with Nvidia CUDA (and related), but we accept it. We also go further, trying to promote open-source stacks to solve emergent Data and ML challenges (including working with upstream projects to support for platforms like AMD ROCm).
Cost and availability: Nvidia cards tend to be more expensive than the available alternatives on the market, and due to their demand there may be time in which their supply is reduced (also to favor big players that demand way more hardware than us).
Debugging and updates: with open source projects is easier and more effective to track down bugs/incompatibilities/issues/etc.. and report them to upstream (as we have already done multiple times with AMD). With proprietary software it is not that easy, and it is a challenge to get updates when required (since we need to wait for upstream's releases and hope that they fix a specific issue).

At the time of writing (November 2023), Nvidia seems to be oriented towards releasing part of their stack using open source licenses, but it seems more a rumor than a solid way to go. In a future where Nvidia and AMD both provide open-source solutions, we'll surely revisit the choice.

Should we run Nvidia cards on cloud providers to bypass the above concerns?

Running in the cloud may be a solution for ad-hoc projects, but some considerations need to be made:

The SRE team and our infrastructure stack don't support any cloud provider at the moment. Wikimedia Enterprise is pioneering on AWS, but they run a completely separate stack from production and with only public data. All automation and security boundaries that we built for production would need to be re-created somewhere else, or at least a bare minimum to consider a service running in the cloud maintainable and secure.
The cost of running ML services with GPUs in the cloud is not cheap nowadays, so it would be a big investment in terms of engineering resources and money. It is not an impossible project but we have to be realistic and weight pros and cons before taking any action; finally we also need evaluate if the pros/cons justify the cost that it would take to implement the project.

Should we run Nvidia cards on a subset of hosts in Production with specific security rules and boundaries?

This is an option, but the SRE team wouldn't maintain the solution. This means that the team owning the Nvidia hardware would need to provide the same support that SRE provides in Production, most notably security wise. For example, we mentioned above that high/critical security issues may require patching the Linux kernel, getting everything rebuilt and rolled out promptly to neutralize any attack surface. Without SRE support the team owning this special cluster/hardware should take care of the extra workload too (and that would surely be against efficiency and transparency for the whole org/Foundation).

Use the Debian packages

See profile::statistics::gpu or the amd_rocm module in operations/puppet.

Use the GPU on the host

All users in analytics-privatedata-users are automatically granted access to the GPUs, otherwise a user need to be in the gpu-testers POSIX group in operations/puppet. This is a workaround to force the users in that group to be in the render POSIX group (available on Debian), that grants access to the GPU. Please keep in mind few things:

Be careful in launching multiple parallel jobs on the same GPU, see https://phabricator.wikimedia.org/T248574
Follow up with the ML team for guidance and best practices!

Use tensorflow

The easiest solution is to create a Python 3 virtual environment on a stat host and then pip3 install https://pypi.org/project/tensorflow-rocm/. Please remember that every version of the package is linked against a specific version of ROCm, so it may be possible that newer versions of tensorflow-rocm don't run on our hosts since we don't have an up to date version of ROCm deployed yet.

Upstream suggested to follow https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/RELEASE.md and check every time what combination of tensorflow-rocm and ROCm is supported.

We have two versions of ROCm deployed:

4.2 on stat100[5,8] - Only tensorflow-rocm 2.5.0 is supported.
5.4.0 on the DSE K8s Cluster and Lift Wing - Only tensorflow-rocm 2.11.0.540 is supproted.

# Example for stat100x nodes
virtualenv -p python3 test_tf
source test_tf/bin/activate 
pip3 install tensorflow-rocm==2.5.0

Experimental

The fact that AMD forked the tensorflow pypi package poses some challenges when using other Tensorflow-based packages in our infrastructure. Most of them in fact have the tensorflow package dependency in their setup.py configurations, and this means that pip will always try to install it (conflicting with tensorflow-rocm). We are testing a hack that aims to trick pip, installing an empty tensorflow package alongside with tensorflow-rocm. This is the procedure:

Create an empty tensorflow package. The version that you use needs to be the same as tensorflow-rocm. A possible solution is:

$ mkdir test
$ cd test
$ cat > setup.py << EOF
from setuptools import setup, find_packages

setup(
    name = 'tensorflow', 
    version='2.5.0', 
    packages=find_packages(),
)
EOF
$ python3 setup.py bdist_wheel
$ ls dist/tensorflow-2.5.0-py3-none-any.whl

Create your Python conda/venv environment as always.
pip install /path/to/dist/tensorflow-2.6-py3-none-any.whl
pip install tensorflow-rocm==2.5.0
pip install etc.. [namely all packages that require tensorflow, the ones that you are interested in]

You may need to solve some dependency issue when pip installing in this way, since installing packages separately may induce some conflicts that you wouldn't have in "regular" installs. Some useful tools:

pip install pipdeptree; pipdeptree (to check your dependency tree)
pip install pip --upgrade (to get the latest pip version)

Configure your Tensorflow script

By default, Tensorflow tasks take all available resources (both from the CPU and the GPU). In resource sharing settings, this might cause resources to saturate quickly and some process to block before execution. When using Tensorflow scripts on our GPU machines, please make sure you add to your code the following snippet:

For Tensorflow version 2.0 and 2.1:

 import tensorflow as tf
 gpu_devices = tf.config.experimental.list_physical_devices('GPU')
 tf.config.experimental.set_memory_growth(gpu_devices[0], True)

or directly

 import tensorflow as tf
 tf.config.gpu.set_per_process_memory_growth(True)

For prior versions:

import tensorflow as tf
tf_config=tf.ConfigProto()
tf_config.gpu_options.allow_growth=True
sess = tf.Session(config=tf_config)

Also, a good practice is to limit the number of threads used by your tensorflow code.

For Tensorflow version 2.0 and 2.1:

import tensorflow as tf
tf.config.threading.set_intra_op_parallelism_threads(10) #or lower values
tf.config.threading.set_inter_op_parallelism_threads(10) #or lower values

For prior versions:

import tensorflow as tf
tf_config=tf.ConfigProto(intra_op_parallelism_threads=10,inter_op_parallelism_threads=10)
sess = tf.Session(config=tf_config)

Check the version of ROCm deployed on a host

elukey@stat1005:~/test$ dpkg -l rocm-dev
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=================================================
ii  rocm-dev       2.7.22       amd64        Radeon Open Compute (ROCm) Runtime software stack

elukey@stat1008:~$ dpkg -l rocm-dev
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=================================================
ii  rocm-dev       3.3.0-19     amd64        Radeon Open Compute (ROCm) Runtime software stack

Changelog in https://rocm-documentation.readthedocs.io/en/latest/Current_Release_Notes/Current-Release-Notes.html

Check usage of the GPU

On the host (limited to analytics-privatedata-users):

elukey@stat1005:~$ sudo radeontop

In Grafana: https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu

Code available in: https://github.com/wikimedia/puppet/blob/production/modules/prometheus/manifests/node_amd_rocm.pp

Outstanding issues

~~https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559~~
~~GPUs are not correctly handling multi-tasking - https://phabricator.wikimedia.org/T248574~~
~~https://github.com/tensorflow/io/issues/1548~~

Reset the GPU state

If the GPU gets stuck for some reason (unclean job completion, etc..) the following may happen:

radeontop shows steady RAM usage (90%+ for example).
tensorflow gets stuck when trying to execute jobs.

Usually rebooting the host works, but the following procedure might help as well:

run sudo /opt/rocm/bin/rocm-smi and get the id of the GPU (usually 1)
run sudo /opt/rocm/bin/rocm-smi --gpureset -d X (with X equals to the id of the GPU)

Upgrade the Debian packages

We import the Debian packages released by AMD for Ubuntu Xenial to the amd-rocm component in wikimedia-buster. Up to now (Oct 2021) there is one Debian package released by AMD that is not open source, hsa-ext-rocr-dev. It contains binary libraries to have a better image support in OpenCL, and we don't use it for obvious reasons. The package is sadly required by other packages, and upstream still hasn't made it optional (https://github.com/RadeonOpenCompute/ROCm/issues/761).

The solution found in https://phabricator.wikimedia.org/T224723 was to create a dummy package via Debian equiv to satisfy dependencies and please the apt install process. This means that every time a new ROCm release is out, the following procedure needs to be done:

Before starting to upgrade, please:

Check https://github.com/RadeonOpenCompute/ROCm, there is a changelog for every version. Pay attention for breaking changes and OS supported.
Check what version of ROCm is supported by what version of tensorflow-rocm. As indicated in a previous section, upstream builds a fork of tensorflow building/linking every version with a specific ROCm library version. The file build_rocm_python3 (please change the branch according to the version that you are targeting, or use this one for latest) should list a ROCM_INSTALL_DIR value, that will tell you what version of ROCm was used when releasing.
At this point, it is better to involve people that use Tensorflow before targeting a specific release, to choose the best combination for our use cases. Miriam Redi is a good point of contact.

Once you have a target release in mind:

1) Check http://repo.radeon.com/rocm/apt/ and see if a new version is out. If so, create a new component like:

Name: amd-rocmXX
Method: http://repo.radeon.com/rocm/apt/X.X/
Suite: xenial
Components: main>thirdparty/amd-rocmXX
UDebComponents:
Architectures: amd64
VerifyRelease: 9386B48A1A693C5C
ListShellHook: grep-dctrl -e -S '^([..cut..])$' || [ $? -eq 1 ]

Replace the XX wildcards with the version number of course.

2) ssh to apt1001, run puppet and check for updates (remember to replace the XX wildcards):

root@apt1001:/srv/wikimedia# reprepro --noskipold --ignore=forbiddenchar --component thirdparty/amd-rocmXX checkupdate buster-wikimedia
Calculating packages to get...
Updates needed for 'buster-wikimedia|thirdparty/amd-rocm|amd64':
[..]
'hsa-rocr-dev': newly installed as '1.1.9-87-g1566fdd' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hsa-rocr-dev/hsa-rocr-dev_1.1.9-87-g1566fdd_amd64.deb
[..]

3) find the new version of hsa-rocr-dev, since it is the only package in ROCm that requires a precise version of the hsa-ext-rocr-dev package (namely its version).

3) create a control file like the following on boron:

### Commented entries have reasonable defaults.
### Uncomment to edit them.
# Source: <source package name; defaults to package name>
Section: devel
Priority: optional
# Homepage: <enter URL here; no default>
Standards-Version: 3.9.2

Package: hsa-ext-rocr-dev
Version: 1.1.9-87-g1566fdd
Maintainer: Luca Toscano <ltoscano@wikimedia.org>
# Pre-Depends: <comma-separated list of packages>
# Depends: <comma-separated list of packages>
# Recommends: <comma-separated list of packages>
# Suggests: <comma-separated list of packages>
# Provides: <comma-separated list of packages>
# Replaces: <comma-separated list of packages>
Architecture: amd64
# Multi-Arch: <one of: foreign|same|allowed>
# Copyright: <copyright file; defaults to GPL2>
# Changelog: <changelog file; defaults to a generic changelog>
# Readme: <README.Debian file; defaults to a generic one>
# Extra-Files: <comma-separated list of additional files for the doc directory>
# Files: <pair of space-separated paths; First is file to include, second is destination>
#  <more pairs, if there's more than one file to include. Notice the starting space>
Description: dummy package to satisfy dependencies for hsa-rocr-dev
 hsa-rocr-dev-ext contains binary only and non open-source libraries
 .

Make sure the Version is the new one of hsa-rocr-dev and save.

4) build the package with equivs-build control

5) upload the package to reprepro (remember to replace the XX wildcards):

reprepro -C thirdparty/amd-rocmXX includedeb buster-wikimedia /home/elukey/hsa-ext-rocr-dev_1.1.9-87-g1566fdd_amd64.deb

6) Update the thirdparty/rocmXX component (remember to replace the XX wildcards):

reprepro --noskipold --ignore=forbiddenchar --component thirdparty/amd-rocmXX update buster-wikimedia

7) From Bullseye onward there are some compatibility issues with stdc++, gcc and python libraries. The issue is described in https://github.com/RadeonOpenCompute/ROCm/issues/1125#issuecomment-925362329. The workaround for us is to create other equiv fake packages from the following control files:

Section: misc
Priority: optional
Standards-Version: 3.9.2

Package: fake-libgcc-7-dev
Version: 1.0
Provides: libgcc-7-dev, libstdc++-7-dev
Architecture: all
Description: Fake libgcc7-dev package to satisfy dependencies
 Fake libgcc7-dev package to satisfy dependencies

Section: misc
Priority: optional
Standards-Version: 3.9.2

Package: fake-libpython3.8
Version: 1.0
Depends: libpython3.9
Provides: libpython3.8, libpython3.8-minimal, libpython3.8-stdlib
Architecture: all
Description: Fake libpython3.8 package to satisfy dependencies
 Fake libpython3.8 package to satisfy dependencies

Build them like outlined above for the hsa-ext-rocr-dev package, and upload them to the new amd-rocmXX component as well. Puppet will take care of installing them, alongside with the proper libraries (like gcc/stdc++ version 10).

8) Update the versions supported by the amd_rocm module in operations/puppet.

9) On the host that you want to upgrade:

sudo apt autoremove -y rocm-smi-lib migraphx miopengemm rocminfo hsakmt-roct rocrand hsa-rocr-dev rocm-cmake hsa-ext-rocr-dev rocm-device-libs hip_base hip_samples llvm-amdgpu comgr rocm-gdb rocm-dbgapi mivisionx

And then run puppet to install the new packages. Some quick tests to see if the GPU is properly recognized:

elukey@stat1005:~$ /opt/rocm/bin/rocminfo
[..]

elukey@stat1005:~$ /opt/rocm/opencl/bin/clinfo
[..]

elukey@stat1005:~$ /opt/rocm/bin/hipconfig
[..]

elukey@stat1005:~$ export https_proxy=http://webproxy:8080

elukey@stat1005:~$ virtualenv -p python3 test

elukey@stat1005:~$ source test/bin/activate

(test) elukey@stat1005:~$ pip3 install tensorflow-rocm

elukey@stat1008:~$ cat gpu_test.py
import tensorflow as tf
# Creates a graph.
with tf.device('/device:GPU:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Runs the op.
print(c)

(test) elukey@stat1008:~$ python gpu_test.py
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)