Data Platform/Systems/Cluster/Hadoop/Load
This page shows how to check the load on the Analytics Cluster without being root and detect if it is getting stalled.
Detecting a stalling cluster using only a terminal
Connect to stat1007
and run
mapred job -list
. Each line of the output denotes a job. Look for jobs that have a non-zero RsvdMem
column. Those jobs are waiting for memory.
Then run the command again in 5 minutes. If the same jobs are still waiting for the same amount of memory, the cluster is most likely overwhelmed.
If the output of the two mapred
runs completely agrees, the cluster has fully ground to a halt.
If you're in doubt about the cluster really being overwhelmed, repeat the procedure in 5 minutes.
Detecting a stalling cluster using only a browser
Graphs to check: [no longer available as of December 2017]
Extended content |
---|
You should see a nice even pattern of a base line around
signals that everything is working fine and the cluster is working nicely.
But if:
the cluster is most likely overwhelmed. Like
shows the cluster being on the edge, as pauses between bumps are different, bumps are of different shape, and no bump starts around 12:50.
), the cluster has fully ground to a halt. |
Further ways to detect a stalling cluster
Ganglia boasts with graphs that expose this or that aspect of the cluster. Many of them are either hard to read, not as reliable as the the one I pointed to above, or they only expose extreme failures. If you find other graphs more usefull, use them. Add them here. But the above two methods are proven to be effective and rather immediate measures on the cluster getting overwhelmed.
What to do if the cluster is stalling?
If the cluster is stalling, ping the Analytics team in #wikimedia-analytics connect in IRC, or try to free up more resources yourself.
You can free up resources by:
- Killing some of your jobs that take up lots of resources (check output of
mapred job -list
and/orhadoop job -list
to identify such jobs), - Asking others to kill their resource hungry jobs, or
- Adding more Hardware to the cluster.
Do not:
- Kill someone else's job without asking them;
- Kill a job run by the hdfs user under the root.essential queue unless you're an analytics engineer;
- Pour WD-40 into the servers in an effort to un-stick them.
Once enough resources are freed, the cluster should pick up again on it's own.
If the cluster was stalled for more than 30 minutes, please notify the Analytics team, so they can double-check that back-pressure on the pipelines did not cause errors that need manual cleanup.
- Analytics Admins on IRC
- ottomata, ironholds, ...?