Data Engineering/Systems/Cluster/Hadoop/Load

From Wikitech

This page shows how to check the load on the Analytics Cluster without being root and detect if it is getting stalled.

Detecting a stalling cluster using only a terminal

Connect to stat1007 and run

 mapred job -list

. Each line of the output denotes a job. Look for jobs that have a non-zero RsvdMem column. Those jobs are waiting for memory. Then run the command again in 5 minutes. If the same jobs are still waiting for the same amount of memory, the cluster is most likely overwhelmed.

If the output of the two mapred runs completely agrees, the cluster has fully ground to a halt.

If you're in doubt about the cluster really being overwhelmed, repeat the procedure in 5 minutes.

Detecting a stalling cluster using only a browser

Graphs to check: [no longer available as of December 2017]

Further ways to detect a stalling cluster

Ganglia boasts with graphs that expose this or that aspect of the cluster. Many of them are either hard to read, not as reliable as the the one I pointed to above, or they only expose extreme failures. If you find other graphs more usefull, use them. Add them here. But the above two methods are proven to be effective and rather immediate measures on the cluster getting overwhelmed.

What to do if the cluster is stalling?

If the cluster is stalling, ping the Analytics team in #wikimedia-analytics connect in IRC, or try to free up more resources yourself.

You can free up resources by:

  • Killing some of your jobs that take up lots of resources (check output of mapred job -list and/or hadoop job -list to identify such jobs),
  • Asking others to kill their resource hungry jobs, or
  • Adding more Hardware to the cluster.

Do not:

  • Kill someone else's job without asking them;
  • Kill a job run by the hdfs user under the root.essential queue unless you're an analytics engineer;
  • Pour WD-40 into the servers in an effort to un-stick them.

Once enough resources are freed, the cluster should pick up again on it's own.

If the cluster was stalled for more than 30 minutes, please notify the Analytics team, so they can double-check that back-pressure on the pipelines did not cause errors that need manual cleanup.

Analytics Admins on IRC
ottomata, ironholds, ...?