This page shows how to check the load on the Analytics Cluster without being root and detect if it is getting stalled.
Detecting a stalling cluster using only a terminal
stat1007 and run
mapred job -list
. Each line of the output denotes a job. Look for jobs that have a non-zero
RsvdMem column. Those jobs are waiting for memory.
Then run the command again in 5 minutes. If the same jobs are still waiting for the same amount of memory, the cluster is most likely overwhelmed.
If the output of the two
mapred runs completely agrees, the cluster has fully ground to a halt.
If you're in doubt about the cluster really being overwhelmed, repeat the procedure in 5 minutes.
Detecting a stalling cluster using only a browser
Graphs to check: [no longer available as of December 2017]
|The following content has been placed in a collapsed box for improved usability.|
You should see a nice even pattern of a base line around
signals that everything is working fine and the cluster is working nicely.
the cluster is most likely overwhelmed. Like
shows the cluster being on the edge, as pauses between bumps are different, bumps are of different shape, and no bump starts around 12:50.
), the cluster has fully ground to a halt.
|The above content has been placed in a collapsed box for improved usability.|
Further ways to detect a stalling cluster
Ganglia boasts with graphs that expose this or that aspect of the cluster. Many of them are either hard to read, not as reliable as the the one I pointed to above, or they only expose extreme failures. If you find other graphs more usefull, use them. Add them here. But the above two methods are proven to be effective and rather immediate measures on the cluster getting overwhelmed.
What to do if the cluster is stalling?
If the cluster is stalling, ping the Analytics team in #wikimedia-analytics connect in IRC, or try to free up more resources yourself.
You can free up resources by:
- Killing some of your jobs that take up lots of resources (check output of
mapred job -listand/or
hadoop job -listto identify such jobs),
- Asking others to kill their resource hungry jobs, or
- Adding more Hardware to the cluster.
- Kill someone else's job without asking them;
- Kill a job run by the hdfs user under the root.essential queue unless you're an analytics engineer;
- Pour WD-40 into the servers in an effort to un-stick them.
Once enough resources are freed, the cluster should pick up again on it's own.
If the cluster was stalled for more than 30 minutes, please notify the Analytics team, so they can double-check that back-pressure on the pipelines did not cause errors that need manual cleanup.
- Analytics Admins on IRC
- ottomata, ironholds, ...?