The hardware infrastructure page has the system description and configurations. As of February 2021, we are running Apache Bigtop 1.5, which includes Hadoop 2.10.1.
http://infolab.stanford.edu/~ullman/mmds/ch2.pdf is a good read describing some Hadoop & MapReduce fundamentals.
See Hue documentation for jobs running on Hadoop and hunting down logs.
See the Administration page for servicing individual nodes or understanding the cluster better.
Hive is the most frequently used way to access data on our Hadoop cluster, although some have been using Spark, too.
When submitting jobs to Hadoop, you can specify a YARN queue. Each queue has different settings for allocating resources. Usually the default queue will be fine. As of April 19th 2021, we have the following:
- The queues are reduced to 4:
production- only for specific system users like
default- default queue for
analytics-privatedata-usersfor any job that doesn't belong to production
fifo- replaces the old sequential, and it is meant to run one job at the time. It was requested in the past by the Search Platform team, and now it has evolved also to target GPUs (more details on this later on).
essential- reserved to the Analytics team for data loading jobs.
- The Yarn scheduler moves to Fair to Capacity, to better control resources. Please read the upstream docs for more info if you are interested.
- Automatic mapping of users -> queues is added. For example, a job for analytics-product will be automatically inserted in the right queue (production) without requiring the user to specify it.
- Basic ACLs to improve what users can submit/admin jobs in the various queues.
- The maximum job lifetime for the
defaultqueue is 7 days. This affects also Jupyter notebooks and other similar user applications. The idea is to have an automatic cleanup that doesn't rely on Analytics team members to patrol the Yarn queues status periodically.
- Experimental support for GPUs in the
We have 6 hadoop worker nodes each one equipped with one AMD GPU, but we have never been able to use them up to now. The main problem with Hadoop < 3 (we have 2.10 right now) is that GPUs are not considered resources like CPUs and Memory, so a user cannot really easily tell to the job scheduler "I need this amount of GPU resources for my job". The workaround that we have found is to label the Hadoop worker nodes with a GPU mark, so that a user can ask to run their jobs on specific worker nodes, that we know have a GPU onboard. Their job will need to be smart to be able to leverage this extra capacity, for example running tensorflow-rocm. We decided to only allow the usage of GPUs in the fifo queue to avoid having multiple jobs on the same nodes using the GPU, since in the past we had some reliability problems on stat100x nodes with GPUs (that should have been solved by now, but we are not sure). Only one job at the time will be allowed (initially) to run on GPUs, the other ones will be queued up. This constraint will surely change in the future once we'll have more data from user experiments!
For example, let's say that we want to run a spark2-shell on GPUs:
spark2-shell --master=yarn --queue fifo --conf spark.yarn.am.nodeLabelExpression=GPU --conf spark.yarn.executor.nodeLabelExpression=GPU