Data Platform/Systems/Hadoop

The hardware infrastructure page has the system description and configurations. As of August 2023, we are running a slightly modified version of Apache Bigtop 1.5 which includes Hadoop 2.10.2, so the most relevant documentation is at hadoop.apache.org/docs/r2.10.2/.

http://infolab.stanford.edu/~ullman/mmds/ch2.pdf is a good read describing some Hadoop & MapReduce fundamentals.

Administration links

See Hue documentation for jobs running on Hadoop and hunting down logs.

See the Administration page for servicing individual nodes or understanding the cluster better.

See Analytics/Cluster/Hadoop/Test for details about the test cluster.

Building Packages

See Bigtop Packages for the process of building and updating the packages we use on the Analytics Cluster

For users

Hive is the most frequently used way to access data on our Hadoop cluster, although some have been using Spark, too.

Queues

When submitting jobs to Hadoop, you can specify a YARN queue. Each queue has different settings for allocating resources. Usually the default queue will be fine. As of April 19th 2021, we have the following:

The queues are reduced to 4:
- production - only for specific system users like analytics, druid, analytics-search, analytics-product
- default - default queue for analytics-privatedata-users for any job that doesn't belong to production
- fifo - replaces the old sequential, and it is meant to run one job at the time. It was requested in the past by the Search Platform team, and now it has evolved also to target GPUs (more details on this later on).
- essential - reserved to the Analytics team for data loading jobs.
The Yarn scheduler moves to Fair to Capacity, to better control resources. Please read the upstream docs for more info if you are interested.
Automatic mapping of users -> queues is added. For example, a job for analytics-product will be automatically inserted in the right queue (production) without requiring the user to specify it.
Basic ACLs to improve what users can submit/admin jobs in the various queues.
The maximum job lifetime for the default queue is 7 days. This affects also Jupyter notebooks and other similar user applications. The idea is to have an automatic cleanup that doesn't rely on Analytics team members to patrol the Yarn queues status periodically.
Experimental support for GPUs in the fifo queue.

GPU Support

We have 2 hadoop worker nodes each one equipped with one AMD GPU, but we have never been able to use them up to now. The main problem with Hadoop < 3 (we have 2.10 right now) is that GPUs are not considered resources like CPUs and Memory, so a user cannot really easily tell to the job scheduler "I need this amount of GPU resources for my job". The workaround that we have found is to label the Hadoop worker nodes with a GPU mark, so that a user can ask to run their jobs on specific worker nodes, that we know have a GPU onboard. Their job will need to be smart to be able to leverage this extra capacity, for example running tensorflow-rocm. We decided to only allow the usage of GPUs in the fifo queue to avoid having multiple jobs on the same nodes using the GPU, since in the past we had some reliability problems on stat100x nodes with GPUs (that should have been solved by now, but we are not sure). Only one job at the time will be allowed (initially) to run on GPUs, the other ones will be queued up. This constraint will surely change in the future once we'll have more data from user experiments!

For example, let's say that we want to run a spark2-shell on GPUs:

spark2-shell --master=yarn --queue fifo --conf spark.yarn.am.nodeLabelExpression=GPU --conf spark.yarn.executor.nodeLabelExpression=GPU