What GPU model do we have? On what hosts?
The Analytics team added a GPU to stat1005 and one to stat1008. The model is AMD Radeon Pro WX 9100 16GB. The choice fell to AMD since they are currently the only ones releasing their software stack open source: https://rocm.github.io/ROCmInstall.html
There is also a GPU (same model) on each of the following Hadoop worker nodes: an-worker1096 to an-worker1101.
Use the Debian packages
Use the GPU on the host
All users in
analytics-privatedata-users are automatically granted access to the GPUs, otherwise a user need to be in the
gpu-testers POSIX group in operations/puppet. This is a workaround to force the users in that group to be in the
render POSIX group (available on Debian), that grants access to the GPU. Please keep in mind few things:
- Be careful in launching multiple parallel jobs on the same GPU, see https://phabricator.wikimedia.org/T248574
- Miriam Redi is currently the main point of contact to decide what the schedule of GPU usage should be. In case of doubt, before starting any test or heavy job, please follow up with her.
The easiest solution is to create a Python 3 virtual environment on stat1005 or stat1008 and then pip3 install https://pypi.org/project/tensorflow-rocm/. Please remember that every version of the package is linked against a specific version of ROCm, so it may be possible that newer versions of tensorflow-rocm don't run on our hosts since we don't have an up to date version of ROCm deployed yet.
Upstream suggested to follow https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/RELEASE.md and check every time what combination of tensorflow-rocm and ROCm is supported.
We have two versions of ROCm deployed:
- 4.2 on stat100[5,8] - Only tensorflow-rocm 2.5.0 is supported.
- 5.4.0 on the DSE K8s Cluster - Only tensorflow-rocm 22.214.171.1240 is supproted.
# Example for stat100x nodes virtualenv -p python3 test_tf source test_tf/bin/activate pip3 install tensorflow-rocm==2.5.0
The fact that AMD forked the
tensorflow pypi package poses some challenges when using other Tensorflow-based packages in our infrastructure. Most of them in fact have the
tensorflow package dependency in their setup.py configurations, and this means that pip will always try to install it (conflicting with
tensorflow-rocm). We are testing a hack that aims to trick pip, installing an empty
tensorflow package alongside with
tensorflow-rocm. This is the procedure:
- Create an empty
tensorflowpackage. The version that you use needs to be the same as
tensorflow-rocm. A possible solution is:
$ mkdir test $ cd test $ cat > setup.py << EOF from setuptools import setup, find_packages setup( name = 'tensorflow', version='2.5.0', packages=find_packages(), ) EOF $ python3 setup.py bdist_wheel $ ls dist/tensorflow-2.5.0-py3-none-any.whl
- Create your Python conda/venv environment as always.
- pip install /path/to/dist/tensorflow-2.6-py3-none-any.whl
- pip install tensorflow-rocm==2.5.0
- pip install etc.. [namely all packages that require tensorflow, the ones that you are interested in]
You may need to solve some dependency issue when pip installing in this way, since installing packages separately may induce some conflicts that you wouldn't have in "regular" installs. Some useful tools:
- pip install pipdeptree; pipdeptree (to check your dependency tree)
- pip install pip --upgrade (to get the latest pip version)
Configure your Tensorflow script
By default, Tensorflow tasks take all available resources (both from the CPU and the GPU). In resource sharing settings, this might cause resources to saturate quickly and some process to block before execution. When using Tensorflow scripts on our GPU machines, please make sure you add to your code the following snippet:
For Tensorflow version 2.0 and 2.1:
import tensorflow as tf gpu_devices = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(gpu_devices, True)
import tensorflow as tf tf.config.gpu.set_per_process_memory_growth(True)
For prior versions:
import tensorflow as tf tf_config=tf.ConfigProto() tf_config.gpu_options.allow_growth=True sess = tf.Session(config=tf_config)
Also, a good practice is to limit the number of threads used by your tensorflow code.
For Tensorflow version 2.0 and 2.1:
import tensorflow as tf tf.config.threading.set_intra_op_parallelism_threads(10) #or lower values tf.config.threading.set_inter_op_parallelism_threads(10) #or lower values
For prior versions:
import tensorflow as tf tf_config=tf.ConfigProto(intra_op_parallelism_threads=10,inter_op_parallelism_threads=10) sess = tf.Session(config=tf_config)
Check the version of ROCm deployed on a host
elukey@stat1005:~/test$ dpkg -l rocm-dev Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-============-============-================================================= ii rocm-dev 2.7.22 amd64 Radeon Open Compute (ROCm) Runtime software stack
elukey@stat1008:~$ dpkg -l rocm-dev Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-============-============-================================================= ii rocm-dev 3.3.0-19 amd64 Radeon Open Compute (ROCm) Runtime software stack
Check usage of the GPU
On the host (limited to analytics-privatedata-users):
elukey@stat1005:~$ sudo radeontop
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559 GPUs are not correctly handling multi-tasking - https://phabricator.wikimedia.org/T248574
Reset the GPU state
If the GPU gets stuck for some reason (unclean job completion, etc..) the following may happen:
- radeontop shows steady RAM usage (90%+ for example).
- tensorflow gets stuck when trying to execute jobs.
Usually rebooting the host works, but the following procedure might help as well:
sudo /opt/rocm/bin/rocm-smiand get the id of the GPU (usually 1)
sudo /opt/rocm/bin/rocm-smi --gpureset -d X(with X equals to the id of the GPU)
Upgrade the Debian packages
We import the Debian packages released by AMD for Ubuntu Xenial to the amd-rocm component in wikimedia-buster. Up to now (Oct 2021) there is one Debian package released by AMD that is not open source,
hsa-ext-rocr-dev. It contains binary libraries to have a better image support in OpenCL, and we don't use it for obvious reasons. The package is sadly required by other packages, and upstream still hasn't made it optional (https://github.com/RadeonOpenCompute/ROCm/issues/761).
The solution found in https://phabricator.wikimedia.org/T224723 was to create a dummy package via Debian equiv to satisfy dependencies and please the apt install process. This means that every time a new ROCm release is out, the following procedure needs to be done:
Before starting to upgrade, please:
- Check https://github.com/RadeonOpenCompute/ROCm, there is a changelog for every version. Pay attention for breaking changes and OS supported.
- Check what version of ROCm is supported by what version of tensorflow-rocm. As indicated in a previous section, upstream builds a fork of tensorflow building/linking every version with a specific ROCm library version. The file build_rocm_python3 (please change the branch according to the version that you are targeting, or use this one for latest) should list a ROCM_INSTALL_DIR value, that will tell you what version of ROCm was used when releasing.
- At this point, it is better to involve people that use Tensorflow before targeting a specific release, to choose the best combination for our use cases. Miriam Redi is a good point of contact.
Once you have a target release in mind:
1) Check http://repo.radeon.com/rocm/apt/ and see if a new version is out. If so, create a new component like:
Name: amd-rocmXX Method: http://repo.radeon.com/rocm/apt/X.X/ Suite: xenial Components: main>thirdparty/amd-rocmXX UDebComponents: Architectures: amd64 VerifyRelease: 9386B48A1A693C5C ListShellHook: grep-dctrl -e -S '^([..cut..])$' || [ $? -eq 1 ]
Replace the XX wildcards with the version number of course.
2) ssh to apt1001, run puppet and check for updates (remember to replace the XX wildcards):
root@apt1001:/srv/wikimedia# reprepro --noskipold --ignore=forbiddenchar --component thirdparty/amd-rocmXX checkupdate buster-wikimedia Calculating packages to get... Updates needed for 'buster-wikimedia|thirdparty/amd-rocm|amd64': [..] 'hsa-rocr-dev': newly installed as '1.1.9-87-g1566fdd' (from 'amd-rocm'): files needed: pool/thirdparty/amd-rocm/h/hsa-rocr-dev/hsa-rocr-dev_1.1.9-87-g1566fdd_amd64.deb [..]
3) find the new version of
hsa-rocr-dev, since it is the only package in ROCm that requires a precise version of the
hsa-ext-rocr-dev package (namely its version).
3) create a control file like the following on boron:
### Commented entries have reasonable defaults. ### Uncomment to edit them. # Source: <source package name; defaults to package name> Section: devel Priority: optional # Homepage: <enter URL here; no default> Standards-Version: 3.9.2 Package: hsa-ext-rocr-dev Version: 1.1.9-87-g1566fdd Maintainer: Luca Toscano <email@example.com> # Pre-Depends: <comma-separated list of packages> # Depends: <comma-separated list of packages> # Recommends: <comma-separated list of packages> # Suggests: <comma-separated list of packages> # Provides: <comma-separated list of packages> # Replaces: <comma-separated list of packages> Architecture: amd64 # Multi-Arch: <one of: foreign|same|allowed> # Copyright: <copyright file; defaults to GPL2> # Changelog: <changelog file; defaults to a generic changelog> # Readme: <README.Debian file; defaults to a generic one> # Extra-Files: <comma-separated list of additional files for the doc directory> # Files: <pair of space-separated paths; First is file to include, second is destination> # <more pairs, if there's more than one file to include. Notice the starting space> Description: dummy package to satisfy dependencies for hsa-rocr-dev hsa-rocr-dev-ext contains binary only and non open-source libraries .
Make sure the Version is the new one of
hsa-rocr-dev and save.
4) build the package with
5) upload the package to reprepro (remember to replace the XX wildcards):
reprepro -C thirdparty/amd-rocmXX includedeb buster-wikimedia /home/elukey/hsa-ext-rocr-dev_1.1.9-87-g1566fdd_amd64.deb
6) Update the
thirdparty/rocmXX component (remember to replace the XX wildcards):
reprepro --noskipold --ignore=forbiddenchar --component thirdparty/amd-rocmXX update buster-wikimedia
7) From Bullseye onward there are some compatibility issues with stdc++, gcc and python libraries. The issue is described in https://github.com/RadeonOpenCompute/ROCm/issues/1125#issuecomment-925362329. The workaround for us is to create other equiv fake packages from the following control files:
Section: misc Priority: optional Standards-Version: 3.9.2 Package: fake-libgcc-7-dev Version: 1.0 Provides: libgcc-7-dev, libstdc++-7-dev Architecture: all Description: Fake libgcc7-dev package to satisfy dependencies Fake libgcc7-dev package to satisfy dependencies
Section: misc Priority: optional Standards-Version: 3.9.2 Package: fake-libpython3.8 Version: 1.0 Depends: libpython3.9 Provides: libpython3.8, libpython3.8-minimal, libpython3.8-stdlib Architecture: all Description: Fake libpython3.8 package to satisfy dependencies Fake libpython3.8 package to satisfy dependencies
Build them like outlined above for the hsa-ext-rocr-dev package, and upload them to the new amd-rocmXX component as well. Puppet will take care of installing them, alongside with the proper libraries (like gcc/stdc++ version 10).
8) Update the versions supported by the amd_rocm module in operations/puppet.
9) On the host that you want to upgrade:
sudo apt autoremove -y rocm-smi-lib migraphx miopengemm rocminfo hsakmt-roct rocrand hsa-rocr-dev rocm-cmake hsa-ext-rocr-dev rocm-device-libs hip_base hip_samples llvm-amdgpu comgr rocm-gdb rocm-dbgapi mivisionx
And then run puppet to install the new packages. Some quick tests to see if the GPU is properly recognized:
elukey@stat1005:~$ /opt/rocm/bin/rocminfo [..] elukey@stat1005:~$ /opt/rocm/opencl/bin/clinfo [..] elukey@stat1005:~$ /opt/rocm/bin/hipconfig [..]
elukey@stat1005:~$ export https_proxy=http://webproxy:8080 elukey@stat1005:~$ virtualenv -p python3 test elukey@stat1005:~$ source test/bin/activate (test) elukey@stat1005:~$ pip3 install tensorflow-rocm elukey@stat1008:~$ cat gpu_test.py import tensorflow as tf # Creates a graph. with tf.device('/device:GPU:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # Runs the op. print(c) (test) elukey@stat1008:~$ python gpu_test.py tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32)