Analytics/Cluster/Hadoop/Alerts

Please note that every paragraph's link is used in puppet when defining the related alert, so if you change any of them remember to follow up!

HDFS Namenode RPC length queue

The HDFS Namenode handles operations on HDFS via RPCs (getfileinfo, mkdir, etc..) and it has a fixed amount of worker threads dedicated to handle the incoming RPCs. Any RPC enters a queue, and then it is processed by a worker. If the queue length grows too much, the HDFS Namenode starts to lag in answering to clients and datanode health checks, and it also may end up in trashing due to heap pressure and GC activity. When icinga alerts for RPC queue too long, usually it is sufficient to do the following:

ssh an-master1003.eqiad.wmnet

tail -f /var/log/hadoop-hdfs/hdfs-audit.log

You will see a ton of entries logged for every second, but usually it should be very easy to spot a user making a ton of subsequent requests. Issues happened in the past:

Too many getfileinfo RPCs sent (scanning directories with a ton of small files)
Too many small/temporary files created in a short burst (order of Millions)
etc..

Once the user that hammers the Namenode is identified, check in yarn.wikimedia.org if there is something running for the same user, and kill it asap if the user doesn't answer in few minutes. We don't care what the job is doing, the availability of the HDFS Namenode comes first :)

HDFS topology check

The HDFS Namenode has a view of the racking details of the HDFS Datanodes, and it uses it to establish how to best spread blocks and their replicas to get the best reliability and availability. The racking details are set in puppet's hiera, and if a Datanode is not added to it for any reason (new node, accidental changes, etc..) the Namenode will put it in the "default rack", that is not optimal.

A good follow up to this alarm is to:

1) SSH to an-master1003 (or if it is a different cluster, check where the Namenode run) and run sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology and look for hosts in the default rack.

2) Check for new nodes in the Hadoop hiera config (hieradata/common.yaml in puppet).

No active HDFS Namenode running

Normally there are two HDFS Namenode running, one active and one standby. If none of them are in active state, we get an alert since the Hadoop cluster cannot function properly.

A good follow up to this alarm is to ssh to the Namenode hosts (for example, an-master1003.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster) and check /var/log/hadoop-hdfs . You should find a log file related to what's happening, and look for exceptions or errors.

To be sure that it is not a false alert, check the status of the Namenodes via:

sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1003-eqiad-wmnet
sudo /usr/local/bin/kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1004-eqiad-wmnet

HDFS FSImage too old

Check FSImage too old from node sending the alert

sudo ls -al /srv/hadoop/name/current

Identify if node sending alert is the standby or the active node

sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

If the issue comes from standby it means the FSImage creation has not been done.

If the issue comes from active it means that transfer of FSImage between standby and active node had a failure.

If standby node is dead start it and ensure it reach standby mode with safemode off.

Then force the run of the checkpoint from the standby node

sudo -u hdfs /usr/bin/hdfs dfsadmin -safemode enter  
sudo -u hdfs /usr/bin/hdfs dfsadmin -saveNamespace  
sudo -u hdfs /usr/bin/hdfs dfsadmin -safemode leave

If the standby node is available look at namenode logs in /var/log/hadoop-hdfs to identify the issue.

You can also force the run of the checkpoint to see the behavior.

HDFS corrupt blocks

In this case, the HDFS Namenode is registering blocks that are corrupted. This is not necessarily bad, it may be due to faulty Datanodes, so before worrying check:

How many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
What files have corrupt blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks on the Hadoop master nodes (an-master1003.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster).
If there are roll restart of Hadoop HDFS Datanodes/Namenodes in progress, or if one was performed recently. In the past this was a source of false positives due to the JMX metric reporting a temporary weird values. In this case always trust what the fsck command above tells you, it is way more reliable than the JMX metric (from past experiences).

Depending on how bad the situation is, fsck may or may not solve the problem (check how to run it to repair corrupted blocks in case). If the issue is related to a specific Datanode host, it may need to be depooled by an SRE.

HDFS missing blocks

In this case, the HDFS Namenode is registering blocks that are missing, namely that no replica for them is available (hence the data that they carry is no available at all). Some useful steps:

Check how many corrupt blocks there are. We have very sensitive alarms, and keep in mind that we handle millions of blocks.
What files have missing blocks. This can be done via sudo -u hdfs kerberos-run-command hdfs hdfs fsck / on the Hadoop master nodes (an-master1003.eqiad.wmnet and an-master1002.eqiad.wmnet for the main cluster), filtering for missing blocks.

At this point there are two cases: either the blocks are definitely gone for some reason (in case look on HDFS tutorials about what to do, like removing references to those files to fix the inconsistency) or they are temporary gone (for example if multiple datanodes are down for network reasons).

HDFS total files and heap size

We have always used https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.3/bk_command-line-installation/content/configuring-namenode-heap-size.html to increase the heap size of the HDFS Namenodes after a certain threshold of files stored. This alarm is meant to remember the heap size bump when needed.

Next steps:


Number of files (millions)	Total Java Heap (Xmx and Xms)	Young Generation Size (-XX:NewSize -XX:MaxNewSize)
50-70	36889m	4352m
70-100	52659m	6144m
100-125	65612m	7680m
125-150	78566m	8960m
150-200	104473m	8960m

The above settings are for the CMS GC, meanwhile we are using G1GC, please refer to the info in https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html too.

HDFS Capacity Remaining

This alert indicates that the available capacity on the HDFS cluster has dropped below a given threshold and is a concern. The thresholds are currently 15% for a warning, 10% for a critical alert, and 5% for a paging-level incident.

The first thing to do if this alert fires is not to panic. Check to see via the Hadoop Dashboard in Grafana whether the usage increase is rapid, or whether there has been sustained growth over some time.

This rate of growth, combined with the severity level of the alert, will inform you as to how urgently action is required.

A Phabricator ticket should be created for this work, but it is currently not done automatically.

Remember: always try to obtain consent from relevant stakeholders before performing any destructive operations on the data itself.

Paging

If the first alert is a page, this would suggest that there has been a rapid increase in disk space consumption and that action may be required urgently in order to prevent the cluster reaching 100% of capacity.

Try to find out any causes of rapid growth, before rectifying them by moving data away from HDFS or by deleting it.

You can check the HDFS audit logs. Each of the masters (an-master1003 and an-master1002) retains a real-time log of actions performed. (e.g. create, delete, setPermission) These files are located at /var/log/hadoop-hdfs/hdfs-audit.log

Check the YARN web interface for any unusual jobs running that might be producing data at a rapid rate.
Check for any user jobs running on the stat boxes, which might have high network I/O if they are in the process of writing data to HDFS.
Check for any unusual behaviour from an-launcher1002 which might indicate a problem with a regular scheduled task.

An invocation of hdfs -du such as the following can be used to obtain information about the use of space. This example will list the 30 largest entries of a given path (/user), in reverse order of size.

btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -h /user | awk '{print $1$2,$3}' | sort -hr | head -n 30

Critical

If the first alert is critical it would imply that the cluster has dropped suddenly to less than 10% of capacity free, which is a cause for significant concern. However it may not require immediate action to clear space. The cluster can still operate efficiently at this level of utilization. The same practices as for a paging level alert apply, so work shoud be tracked via a ticket, the rate of change should be assessed from the Hadoop Dashboard in Grafana and any obvious causes for a sudden increase in utilization should be investigated as soon as practicable.

If the rate of growth is low then investigation should move toward finding where capacity might be freed. It is likely that you would need to contact the owners of any particularly large and/or stale data in order to enquire whether it can be removed or archived to a different location.

Warning

A warning level alert indicates that the threshold value of 15% free capacity has been breached, however it does not necessarily indicate that any data needs to be deleted immediately. This could be a transient condition or it could be an indicator that the capacity of the cluster needs to be expanded. Create a ticket for the investigation and check the rate of growth from the Hadoop Dashboard in Grafana. If the rate of growth is low then perhaps the priority and courses of action could be assessed during weekly team meetings.

Unhealthy Yarn Nodemanagers

On every hadoop worker node there is a daemon called Yarn Nodemanager, that is responsible to manage vcores and memory on behalf of the Resource manager. If multiple Nodemanager are down, it means that jobs are probably not scheduled on the affected nodes, reducing the performances of the cluster. Check the https://yarn.wikimedia.org/cluster/nodes/unhealthy page to see what nodes are affected, and ssh on them to check the Nodemanager's logs (/var/log/hadoop-yarn/..)

HDFS Namenode backup age

On the HDFS Namenode standby host (an-master1004.eqiad.wmnet) we run a systemd timer called hadoop-namenode-backup-fetchimage that periodically executes hdfs dfsadmin -fetchImage. The hdfs command pulls the most recent HDFS FSImage from the HDFS Namenode active host (an-master1003.eqiad.wmnet) and saves it under a specific directory. Please ssh to an-master1004 and check the logs of the timer with journalctl -u hadoop-namenode-backup-fetchimage to see what is the current problem.

HDFS Namenode process

There are two HDFS Namenodes for the main cluster, running on an-master100[3,4].eqiad.wmnet. The control the metadata and the state of the distributed file system, and they also handle client traffic. They are basically the head of the HDFS file system, without them no read/write operation can be performed on HDFS. Usually they are both up, one in state active and the other one in state standby, and you can check them via:

elukey@an-master1003:~$  sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1003-eqiad-wmnet
active
elukey@an-master1003:~$  sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1004-eqiad-wmnet
standby

If one of them goes down, there are two use cases:

It is the active, so a failover happens and the previous standby is elected the new active.
It is the standby, so no failover happening.

In both use cases, if at least one Namenode is up, the cluster should be able to progress fine. It is nonetheless a situation that needs to be fixes asap, so here a couple of things to do first:

Figure out how many Namenodes are down, and if one is active. You can use the commands above for a quick check. If both Namenodes are down we have another alert, so it should be clear in case.
Check logs on the host with the failed Namenode. You should be able to find them in /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master100[1,2].log. Look for anything weird, especially exceptions etc..
Check if other alerts related to the Namenodes are pending. For example, if you check above another alert could be the RPC queue getting hammered, that could be likely the root cause of why the Namenode is down.
Check the Namenode panel in grafana to see if metrics can help.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

HDFS ZKFC process

There are two HDFS ZKFC processes, running on the master nodes an-master100[3,4].eqiad.wmnet. Their job is to periodically health check the HDFS Namenodes, and trigger automatic failovers in case something fails (using Zookeeper to hold locks). Usually they are both up, and if one of them is down it may not be a big issue (the Namenodes can live without them for a bit). The first thing to do is to ssh on the master nodes and check /var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master100[3,4].log, looking for Exceptions or similar problems.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

HDFS Datanode process

There are a lot of HDFS Datanodes in the cluster, one on each worker node, so nothing major happens if a few of them are down. The Datanodes are responsible to manage the HDFS blocks on behalf of the Namenodes, providing them to the clients when needed (or accepting a write from a client for new blocks). The clients have to first talk with the Namenodes to authenticate and get permission to perform a certain operation on HDFS (read/write/etc..), and then they can contact the Datanodes to get the real data.

The first thing to do is to figure out how many Datanodes are down - if the figure is up to 3/4 nodes it is not the end of world (since we have 60+ nodes), but if more please ping an SRE as soon as possible. The most common use case for a failure is a Java OOM, that can be quickly checked using something like:

elukey@analytics1061:~$ grep java.lang.OutOfMemory /var/log/hadoop-hdfs/hadoop-hdfs-datanode-analytics1061.log -B 2 --color

A deeper inspection in the Datanode's logs will surely give some more information. Check also the Datanodes panel in grafana to see if metrics can help.

HDFS Datanode JVM Heap Usage

This indicates that the amount of heap used by a datanode process running ona worker is above the configured threshold.

The critical threshold is 95%

The warning threshold is 90%

Examine the trend graph on the Hadoop dashboard in order to decide if any action is required.

HDFS Journalnode process

The HDFS Namenodes (see above, the head of HDFS) use a special group of daemons called Journalnodes to keep a shared edit log, in order to facilitate failover in case needed (so to implement high availability). We have 5 Journal nodes daemons running on some worker nodes in the cluster (see hieradata/common.yaml and look for journalnode_hosts for the main cluster), and we need a minimum of three to keep the Namenodes working. If the quorum of three Journalnodes cannot be reached, the Namenodes shutdown.

Things to do:

Figure out how many Journalnodes are down. If one or two, the situation should be ok since the cluster should not be impacted. More than 2 is a big problem, and it will likely trigger other alerts (like the Namenode ones).

Check logs for Exceptions:

elukey@an-worker1080:~$ less /var/log/hadoop-hdfs/hadoop-hdfs-journalnode-an-worker1080.log

Check in the Namenode UI how many Journalnodes are active using the following ssh tunnel: ssh -L 50470:localhost:50470 an-master1001.eqiad.wmnet
Check the Jornalnodes panel in grafana to see if metrics can help.

Yarn Resourcemanager process

There are two Yarn Resourcemanager processes in the cluster, running on the master nodes an-master100[1,2].eqiad.wmnet. Their job is to handle the overall compute resources of the cluster (cpus/ram) that all the worker nodes offer. They communicate with the Yarn Nodemanager processes on each worker node, getting info from them about available resources and allocating what requested (if doable) from clients. Usually they are both up, one in state active and the other one in state standby, and you can check them via:

elukey@an-master1003:~$  sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState an-master1003-eqiad-wmnet
active
elukey@an-master1003:~$  sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState an-master1004-eqiad-wmnet
standby

If one of them goes down, there are two use cases:

It is the active, so a failover happens and the previous standby is elected the new active.
It is the standby, so no failover happening.

In both use cases, if at least one Resourcemanager is up, the cluster should be able to progress fine. It is nonetheless a situation that needs to be fixes asap, so here a couple of things to do first:

Figure out how many Resourcemanagers are down, and if one is active.
Check logs on the host with the failed Resourcemanager. You should be able to find them in /var/log/hadoop-yarn/hadoop-yarn-resourcemanager-an-master100[1,2].log. Look for anything weird, especially exceptions etc..
Check the Resourcemanager panel in grafana to see if metrics can help.

This situation will likely need an SRE, so feel free to escalate early! It might take time for people to join, and you can do some investigation in the meantime.

Yarn Nodemanager process

There are a lot of Yarn Nodemanagers in the cluster, one on each worker node, so nothing major happens if a few of them are down. The Yarn Nodemanagers are responsible for the allocation and management of local resources on the worker (cpus/ram). They can be restarted anytime without affecting the running containers/jvms that they create on behalf of the Yarn Resource Managers.

The first thing to do is to figure out how many Nodemanagers are down - if the figure is up to 3/4 nodes it is not the end of world (since we have 60+ nodes), but if more please ping an SRE as soon as possible. The most common use case for a failure is a Java OOM, that can be quickly checked using something like:

elukey@analytics1061:~$ grep java.lang.OutOfMemory /var/log/hadoop-yarn/hadoop-yarn-nodenamager-analytics1061.log -B 2 --color

A deeper inspection in the Nodemanager's logs will surely give some more information. Check also the Nodemanagers panel in grafana to see if metrics can help.

Yarn Nodemanager JVM Heap Usage

This indicates that the amount of heap used by this process is above the configured threshold.

The critical threshold is 95%

The warning threshold is 90%

Examine the trend graph on the Hadoop dashboard in order to decide if any action is required.

Mapreduce Historyserver process

The Mapreduce History server runs only on an-master1003.eqiad.wmnet, and it offers an API for jobs about the status of finished applications. It is not critical for the health of the cluster, but it should be fixed as soon as possible nonetheless.

Check logs on the host to look for any sign of Exception or weird behavior:

elukey@an-master1003:~$ less /var/log/hadoop-mapreduce/mapred-mapred-historyserver-an-master1003.out