Analytics/Systems/Cluster/Hadoop

From Wikitech
Jump to navigation Jump to search

The hardware infrastructure page has the system description and configurations.

We run Cloudera's CDH5.

http://infolab.stanford.edu/~ullman/mmds/ch2.pdf is a good read describing some Hadoop & MapReduce fundamentals.

Administration links

See Hue documentation for jobs running on Hadoop and hunting down logs.

See the Administration page for servicing individual nodes or understanding the cluster better.

For users

Hive is the most frequently used way to access data on our Hadoop cluster, although some have been using Spark, too.

Queues

When submitting jobs to Hadoop, you can specify a YARN queue. Each queue has different settings for allocating resources. Usually the default queue will be fine. If you expect to run a resource intensive but low priority job, you should probably put your job in the nice queue.

Setting a queue for Hive: Analytics/Systems/Cluster/Hive/Queries#Run_long_queries_in_a_screen_session_and_in_the_nice_queue