Analytics/Systems/Cluster/Hadoop/Administration

From Wikitech

HDFS NameNodes

The name nodes are the components of Hadoop that manage the HDFS distributed file system. They keep a record of all of the files and directories on this file system and control access to it.

Name Node Web Interfaces

There is a useful web interface for checking the status of the HDFS file system and its component nodes. We can access this by using an SSH tunnel as follows.

For the production cluster use:

ssh -N -L 50470:localhost:50470 an-master1003.eqiad.wmnet

For the test cluster use:

ssh -N -L 50470:localhost:50470 an-test-master1001.eqiad.wmnet

Then open a browser to https://localhost:50470 and accept the security prompt.

The DFSHealth web interface

HDFS Safe Mode

Putting HDFS into safemode means that it is in read-only mode. This is very useful for allowing maintenance to be performed.

You can put it into safe mode from either of the two namenodes.

Getting safemode status
 btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode get
 
 Safe mode is OFF in an-master1003.eqiad.wmnet/10.64.36.15:8020
 Safe mode is OFF in an-master1002.eqiad.wmnet/10.64.53.14:8020
Entering safemode
sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter

Safe mode is ON in an-master1001.eqiad.wmnet/10.64.36.15:8020
Safe mode is ON in an-master1002.eqiad.wmnet/10.64.53.14:8020
Leaving safemode
sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave

Safe mode is OFF in an-master1003.eqiad.wmnet/10.64.36.15:8020
Safe mode is OFF in an-master1004.eqiad.wmnet/10.64.53.14:8020

New NameNode Installation

Our Hadoop NameNodes are R420 boxes. It was difficult (if not impossible) to create a suitable partman recipe for the partition layout we wanted. The namenode partitions were created manually during installation.

These nodes have 4 disks. We are mostly concerned with reliability of these nodes. The 4 disks were assembled into a single software RAID 1 array:

$ mdadm --detail /dev/md0
/dev/md0:
        VersionĀ : 1.2
  Creation TimeĀ : Wed Jan  7 21:10:18 2015
     Raid LevelĀ : raid1
     Array SizeĀ : 2930132800 (2794.39 GiB 3000.46 GB)
  Used Dev SizeĀ : 2930132800 (2794.39 GiB 3000.46 GB)
   Raid DevicesĀ : 4
  Total DevicesĀ : 4
    PersistenceĀ : Superblock is persistent

    Update TimeĀ : Wed Jan 14 21:42:20 2015
          StateĀ : clean
 Active DevicesĀ : 4
Working DevicesĀ : 4
 Failed DevicesĀ : 0
  Spare DevicesĀ : 0

           NameĀ : an-master1001:0  (local to host an-master1001)
           UUIDĀ : d1e971cb:3e615ace:8089f89e:280cdbb3
         EventsĀ : 100

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       3       8       50        3      active sync   /dev/sdd2

LVM md0 with a single volume group was then added onto md0. Two logical volumes were then added for / root and for /var/lib/hadoop/name Hadoop NameNode partition.

$ cat /etc/fstab
/dev/mapper/analytics--vg-root /               ext4    errors=remount-ro 0       1
/dev/mapper/analytics--vg-namenode /var/lib/hadoop/name ext3    noatime         0       2

High Availability

We use automatic failover between active and standby NameNodes. This means that upon start, all NameNode processes interacts with the hadoop-hdfs-zkfc daemon that will use ZooKeeper to establish master and standby node automatically. Restarting the active Namenode will be handled gracefully promoting the standby to active.

Note that you need to use the logical name of the NameNode in hdfs haadmin commands, not the hostname. The puppet-cdh module uses the fqdn of the node with dots replaced with dashes as the logical NameNode names.

Manual Failover

To identify the current name node machines, read puppet configs (analytics100[12] at moment of writing) and run:

sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState

You can also run a command to get the service state of an individual namenode. e.g.

sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1003-eqiad-wmnet

sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1004-eqiad-wmnet

If you want to move the active status to a different NameNode, you can force a manual failover:

 sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1003-eqiad-wmnet an-master1004-eqiad-wmnet

(That command assumes that an-master1003-eqiad-wmnet is the currently active and should become standby, while an-master1004-eqiad-wmnet is standby and should become active)

YARN ResourceManager and HDFS Namenode are configured to automatically failover between the two master hosts. You don't need to manually fail it over. You can just stop the hadoop-yarn-resourcemanager and hadoop-hdfs-namenode services. After you finish your maintenance, you should do this again to ensure that the active ResourceManager is an-master1001. The HDFS Namenode can be switched using the failover command described above, no need for unnecessary restarts.

sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -getAllServiceState

In a similar way to the namenode, you can also query the state of each individual resource manager.

an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -getServiceState an-master1003-eqiad-wmnet

an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -getServiceState an-master1004-eqiad-wmnet

Migrating to new HA NameNodes

This section will describe how to make an existent cluster use new hardware for new NameNodes. This will require a full cluster restart.

This section might be out of data since we enabled automatic of the master HDFS Namenode via Zookeeper. Please double check this procedure and update it if needed!

Put new NameNode(s) into 'unknown standby' mode

HDFS HA does not allow for hdfs-site.xml to specify more than two dfs.ha.namenodes at a time. HA is intended to only work with a single standby NameNode.

An unknown standby NameNode (I just made this term up!) is a standby NameNode that knows about the active master and the JournalNodes, but that is not known by the rest of the cluster. That is, it will be configured to know how to read edits from the JournalNodes, but it will not be a fully functioning standby NameNode. You will not be able to promote it to active. Configuring your new NameNodes as unknown standbys allows them to sync their name data from the JournalNodes before shutting down the cluster and configuring them as the new official NameNodes.

Configure the new NameNodes exactly as you would have a normal NameNode, but make sure that the following properties only set the current active NameNode and the hostname of the new NameNode. In this example, let's say you have NameNodes nn1 and nn2 currently operating, and you want to migrate them to nodes nn3 and nn4.


nn3 should have the following set in hdfs-site.xml

  <property>
    <name>dfs.ha.namenodes.cluster-name</name>
    <value>nn1,nn3</value>
  </property>

  <property>
    <name>dfs.namenode.rpc-address.cluster-name.nn1</name>
    <value>nn1:8020</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.cluster-name.nn3</name>
    <value>nn3:8020</value>
  </property>

  <property>
    <name>dfs.namenode.http-address.cluster-name.nn1</name>
    <value>nn1:50070</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.cluster-name.nn3</name>
    <value>nn3:50070</value>
  </property>

Note that there is no mention of nn2 in hdfs-site.xml on nn3. nn4 should be configured the same, except with reference to nn4 instead of nn3.

Once this is done, you can start hadoop-hdfs-namenode on nn3 and nn4 and bootstrap them. This is done on both new nodes.

sudo -u hdfs hdfs namenode -bootstrapStandby
service hadoop-hdfs-namenode start

NOTE: If you are using the puppet-cdh module, this will be done for you. You should probably just conditionally configure your new NameNodes differently than the rest of your cluster and apply that for this step, e.g.:

        namenode_hosts => $::hostname ? {
            'nn3'   => ['nn1', 'nn3'],
            'nn4'   => ['nn1', 'nn4'],
            default => ['nn1', 'nn2'],
        }

Put all NameNodes in standby and shutdown Hadoop.

Once your new unknown standby NameNodes are up and bootstrapped, transition your active NameNode to standby so that all writes to HDFS stop. At this point all 4 NameNodes will be in sync. Then shutdown the whole cluster:

sudo -u hdfs hdfs haadmin -transitionToStandby nn1
# Do the following on every Hadoop node.
# Since you will be changing global configs, you should also
# shutdown any 3rd party services too (e.g. Hive, Presto, etc.).
# Anything that has a reference to the old NameNodes should
# be shut down.

shutdown_service() {
    test -f /etc/init.d/$1 && echo "Stopping $1" && service $1 stop
}

shutdown_hadoop() {
    shutdown_service hue
    shutdown_service hive-server2
    shutdown_service hive-metastore
    shutdown_service hadoop-yarn-resourcemanager
    shutdown_service hadoop-hdfs-namenode
    shutdown_service hadoop-hdfs-httpfs
    shutdown_service hadoop-mapreduce-historyserver
    shutdown_service hadoop-hdfs-journalnode
    shutdown_service hadoop-yarn-nodemanager
    shutdown_service hadoop-hdfs-datanode
}

shutdown_hadoop

Reconfigure your cluster with the new NameNodes and start everything back up.

Next, edit hdfs-site.xml everywhere, and anywhere else that there was a mention of nn1 or nn2 and replace them with nn3 nn4. If you are moving yarn services as well, now is a good time to reconfigure them with the new hostnames as well.

Restart your JournalNodes with the new configs first. Once that is done, your can start all of your cluster services back up in any order. Once everything is back up, transition your new primary active NameNode to active:

sudo -u hdfs hdfs haadmin -transitionToActive nn1

HDFS JournalNodes

JournalNodes should be provisioned in odd numbers 3 or greater. These should be balanced across rows for greater resiliency against potential datacenter related failures.

Swap an existing Journal node with a new one in a running HA Hadoop Cluster

Useful reading:

http://johnjianfang.blogspot.com/2015/02/quorum-journal-manager-part-i-protocol.html

https://blog.cloudera.com/quorum-based-journaling-in-cdh4-1/

1. Create journal node LVM volume and partition

See the section below about new worker node installation. We add a journalnode partition to all Hadoop worker nodes, so this step shouldn't be needed, but please check that there is a partition on the target node nonetheless.

2. Copy the journal directory from an existing JournalNode.

# disable puppet on the host of the journal node to move,
# the shut down the daemon
puppet agent --disable "reason"; systemctl stop hadoop-hdfs-journalnode

# ssh to cumin and use transfer.py, like:
sudo transfer.py analytics1052.eqiad.wmnet:/var/lib/hadoop/journal an-worker1080.eqiad.wmnet:/var/lib/hadoop/journal

3. Puppetize and start the new JournalNode. Now puppetize the new node as a JournalNode. In common.yaml edit the section related to the journalnodes:

    journalnode_hosts:
      - an-worker1080.eqiad.wmnet  # Row A4
      - an-worker1078.eqiad.wmnet  # Row A2
      - analytics1072.eqiad.wmnet  # ROW B2
      - an-worker1090.eqiad.wmnet  # Row C4
      - analytics1069.eqiad.wmnet  # Row D8

Run puppet on the new JournalNode. This should install and start the JournalNode daemon.

4. Apply puppet on NameNodes and restart them.

Once the new JournalNode is up and running, we need to reconfigure and restart the NameNodes to get them recognize the new JournalNode. Run puppet on each of your NameNodes. The new JournalNode will be added to the list in dfs.namenode.shared.edits.dir.

Restart NameNode on your standby NameNodes first. Once that's done, check their dfshealth status to see if the new JournalNode registers. Once all of your standby NameNodes see the new JournalNode, go ahead and promote one of them to active so we can restart the primary active NameNode:

 sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1003-eqiad-wmnet an-master1004-eqiad-wmnet

Once you've moved failed over to a different active NameNode, you may restart the hadoop-hdfs-namenode on your usual active NameNode. Check the dfshealth status page again and be sure that the new JournalNode is up and writing transactions.

Cool! You'll probably want to promote your usual NameNode back to active:

 sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1004-eqiad-wmnet an-master1003-eqiad-wmnet 

Done!

YARN

YARN stands for Yet Another Resource Negotiator and it is the component of Hadoop that facilitates distrubted execution. There are a number of components.

  • The hadoop-yarn-resourcemanager service - This runs on the two Hadoop master servers in an HA configuration. This accepts jobs and scedules them accordingly.
  • The hadoop-yarn-nodemanager service - This runs on each of the Hadoop workers. It registers with the resourcemanager and executes jobs on its behalf.

YARN Job Browser Interface

The primary means of accessing this is: yarn.wikimedia.org which forwards to http://an-master1001.eqiad.wmnet:8088/

The YARN Job browser interface

At the moment, this web interface only functions when an-master1001 is the active resourcemanager. We intend to rectify this problem in task T331448.

In order to access the YARN job browser interface on the test cluster, we can use SSH tunnelling with:

ssh -N -L 8088:an-test-master1001:8088 an-test-master1001.eqiad.wmnet

Then browse to http://localhost:8088


From the YARN Job Browser interface it is possible to view the running jobs as well as those that have finished.

We currently have the MapReduce History server available and we are also currently making the Spark History service available through this interface under task T330176.

YARN Resource Manager

HA and automatic failover is set up for YARN ResourceManager too (handled by Zookeeper). hadoop-yarn-resourcemanager runs on both master hosts, with an-master1001 the primary active host. You may stop this service on either node, as long as at least one is running.

sudo service hadoop-yarn-resourcemanager restart

Yarn Labels

In T277062 we added the Capacity Scheduler and GPU labels. The labels are stored in /user/yarn on HDFS, and currently we manage them via manual commands like:

sudo -u yarn kerberos-run-command yarn yarn rmadmin -addToClusterNodeLabels "GPU(exclusive=false)"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1096.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1097.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1098.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1099.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1100.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1101.eqiad.wmnet=GPU"

It is possible to have the Namenodes to decide what labels the underlying node should have (even via scripts), but the configuration is of course more complicated so it wasn't used for the GPU use case.

Logs

There are many pieces when it comes to who manages logs in hadoop, best place we found that describes this is here: http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

The important things:

  1. Yarn aggregate job log retention in specified on yarn-site.xml
  2. If Yarn aggregate job log retention changes the JobHistoryServer needs to be restarted (that is hadoop-mapreduce-historyserver in our system)
  3. Do this if you're looking for logs: yarn logs -applicationId <appId>. This will not work until your job is finished or dies. Also, if you're not the app owner, do sudo -u hdfs yarn logs -applicationId <appId> --appOwner APP_OWNER

Worker Nodes

Standard Worker Installation

The standard Hadoop worker node hardware configuration is as follows:

  • Dell R470XD 2U chassis
  • 12 x 3.5" 4 TB hard disk in the front bays - each configured as a single RAID0 volume
  • 2 x 2.5" SSDs installed to the rear-mounted flex bays - configured as RAID1 volume in hardware

The OS and JournalNode partitions are installed to the 2.5" RAID1 volume. This leaves all of the space on the 12 4TB HDDs for DataNode use.

Device Size Mount Point
/dev/mapper/analytics1029--vg-root (LVM on /dev/sda5 Hardware RAID 1 on 2 flex bays) 30 G /
/dev/mapper/analytics1029--vg-journalnode (LVM on /dev/sda5 Hardware RAID 1 on 2 flex bays) 10G /var/lib/hadoop/journal
/dev/sda1 1G /boot
/dev/sdb1 Fill Disk /var/lib/hadoop/data/b
/dev/sdc1 Fill Disk /var/lib/hadoop/data/c
/dev/sdd1 Fill Disk /var/lib/hadoop/data/d
/dev/sde1 Fill Disk /var/lib/hadoop/data/e
/dev/sdf1 Fill Disk /var/lib/hadoop/data/f
/dev/sdg1 Fill Disk /var/lib/hadoop/data/g
/dev/sdh1 Fill Disk /var/lib/hadoop/data/h
/dev/sdi1 Fill Disk /var/lib/hadoop/data/i
/dev/sdj1 Fill Disk /var/lib/hadoop/data/j
/dev/sdk1 Fill Disk /var/lib/hadoop/data/k
/dev/sdl1 Fill Disk /var/lib/hadoop/data/l
/dev/sdm1 Fill Disk /var/lib/hadoop/data/m

There are two main partman recipes used for Debian install:

  • The analytics-flex.cfg partman recipe will create the root and swap logical volumes, without touching the other disks. This is used only for brand new worker nodes, without any data on them. Once Debian is installed, the hadoop-init-worker.py cookbook can be used to set up the remaining disk partitions.
  • The reuse-analytics-hadoop-worker-12dev.cfg partman recipe will re-install the root/swap volumes, keeping the other partitions as they are. This is used when reimaging existing workers, since we want to preserve the data contained in the /dev/sdx disks. DO NOT USE hadoop-init-worker.py in this case unless you want to wipe data for some reason (there is a specific --wipe option in case).

In both cases we need to create the JournalNode manually since we never automated it. A quick script to do it is:

#!/bin/bash

set -ex

if [ -e /dev/mapper/*unused* ]
then
    echo "Dropping unused volume";
    lvremove /dev/mapper/*--vg-unused -y
fi
if [ ! -e /dev/mapper/*journal* ]
then
    echo "Creating journalnode volume"
    VGNAME=$(vgs --noheadings -o vg_name | tr -d '  ')
    lvcreate -L 10g -n journalnode $VGNAME
    echo "Creating the ext4 partition"
    mkfs.ext4 /dev/$VGNAME/journalnode
    echo "Adding mountpoint to fstab"
    echo "# Hadoop JournalNode partition" >> /etc/fstab
    mkdir -p /var/lib/hadoop/journal
    echo "/dev/$VGNAME/journalnode	/var/lib/hadoop/journal	ext4	defaults,noatime	0	2" >> /etc/fstab
    mount -a
fi

Before starting, also make sure that all the new nodes have their kerberos keytab created. In order to do so, ssh to krb1001 and execute something like the following for each node:

elukey@krb1001:~$ cat hadoop_test_cred.txt
an-worker1095.eqiad.wmnet,create_princ,HTTP
an-worker1095.eqiad.wmnet,create_princ,hdfs
an-worker1095.eqiad.wmnet,create_princ,yarn
an-worker1095.eqiad.wmnet,create_keytab,yarn
an-worker1095.eqiad.wmnet,create_keytab,hdfs
an-worker1095.eqiad.wmnet,create_keytab,HTTP

sudo generate_keytabs.py --realm WIKIMEDIA hadoop_test_cred.txt

The new credentials will be stored under /srv/kerberos/keytabs on krb1001. The next step is to rsync them to puppetmaster1001:

# Example of rsync from krb1001 to my puppetmaster1001's home dir
# The credentials for the kerb user are stored on krb1001, check /srv/kerberos/rsync_secrets_file
elukey@puppetmaster1001:~$ rsync -r kerb@krb1001.eqiad.wmnet::srv-keytabs/an-worker11* ./keytabs

# Then copy the keytabs to /srv/private
root@puppetmaster1001:/srv/private/modules/secret/secrets/kerberos/keytabs cp -r /home/elukey/ketabs/..etc.. .

Please be sure that the kerberos keytab files/dirs have the proper permissions (check the other files in /srv/private/modules/secret/secrets/kerberos/keytabs). Then add and commit the files, and you are done!

Standard Worker Reimage

In this case only the operating system is reinstalled to /dev/sda, so we wish to avoid wiping the HDFS datanode partitions on /dev/sdb through to /dev/sdm. As described above, just set the reuse-analytics-hadoop-worker-12dev.cfg in puppet's netboot.conf and you are ready to reimage. This is an example of script to reimage one node when the reuse partman recipe is set:

!/bin/bash

set -x

HOSTNAME1=an-worker1111
CUMIN_ALIAS="${HOSTNAME1}*"
# Add downtime
sudo cumin -m async 'alert1001*' "icinga-downtime -h ${HOSTNAME1} -d 3600 -r 'maintenance'"
echo "Checking the number of datanode partitions and other bits"
echo "Important: if the number of partitions is not 12, DO NOT PROCEED, the partman recipe needs that all partitions are mounted correctly."
sudo cumin -m async "${CUMIN_ALIAS}" 'df -h | grep /var/lib/hadoop/data | wc -l' 'df -h | grep journal' 'cat /etc/debian_version'
read -p "Continue or control+c" CONT
sudo cumin "${CUMIN_ALIAS}" 'disable-puppet "elukey"'
sudo cumin "${CUMIN_ALIAS}" 'systemctl stop hadoop-yarn-nodemanager'
sleep 30
sudo cumin "${CUMIN_ALIAS}" 'ps aux | grep [j]ava'
read -p "Continue or control+c" CONT
sudo cumin "${CUMIN_ALIAS}" 'systemctl stop hadoop-hdfs-*'
sleep 10
sudo cumin "${CUMIN_ALIAS}" 'ps aux | grep [j]ava'
read -p "Continue or control+c" CONT
sudo -i wmf-auto-reimage -p T231067 $HOSTNAME1.eqiad.wmnet

As you can see there is a catch for the reuse partitions config - all 12 partitions need to be mounted, otherwise the partman recipe will fail to automate the debian process (causing an error and stopping the install). This can happen for example if the host has a disk failed, and we have umounted its partition waiting for a new disk.

Decommissioning

To decommission a Hadoop worker node, you will need to carry out several tasks before executing the decommissioning cookbook

Excluding the host from HDFS and YARN

Firstly, make a puppet patch and add the hosts's FQDN to the hosts.exclude and yarn-hosts.exclude on all of the Master nodes.

In order to do so:

  • Add the servers's FQDN to the hiera parameter: profile::hadoop::master::excluded_hosts: in the file: hieradata/role/common/analytics_cluster/hadoop/master.yaml
  • Do the same for profile::hadoop::master::standby::excluded_hosts: in the file: hieradata/role/common/analytics_cluster/hadoop/standby.yaml

Once the PR is merged, you can force a puppet run on both of the relevant nodes with: sudo cumin 'A:hadoop-master or A:hadoop-standby' run-puppet-agent

This will do two things:

  1. It will place the HDFS datanode into decommissioning mode, which will slowly migrate data away from the host
  2. It will remove the YARN nodemanager from the list of active nodes, preventing new YARN jobs from running on this node

You should wait until until all data has been migrated away from the HDFS volume before proceeding to the next stage.

To check on the progress, you can use the ARN and HDFS Web UIs.

Once all of the data has been migrated elesewhere, you can proceed to the next step.

Removing the host from the HDFS topology

This step remove the node from the network topology configuration, which means that the namenodes effectively forget about the existence of this host.

Make a puppet patch to remove the host from the analytics-hadoop:net_topology: hash in the file: hieradata/common.yaml

Once this patch is merged, you should run puppet on both the master and the standby server with: sudo cumin 'A:hadoop-master or A:hadoop-standby' run-puppet-agent

At this point you can restart the Namenodes to completely remove any reference of it. In order to do this you can use the sre.hadoop.roll-restart-masters cookbook.

Proceed to decommission the host as per the regular procedure

This uses the regular procedure which entails running the sre.hosts.decommission cookbook. It is also sensible to make a decommission ticket for each host. If you are decommissioning several hosts at once as part of a refresh cycle, then ensure that each of these decommissioning tickets is a child of the parent ticket to refresh the nodes.

Balancing HDFS

Over time it might occur that some datanodes have almost all of their disk space for HDFS filled up, while others have lots of free HDFS disk space. hdfs balance can be used to re-balance HDFS.

In March 2015 ~0.8TB/day needed to be moved to keep HDFS balanced. (On 2015-03-13T04:40, a balancing run ended and HDFS was balanced. And on 2015-03-15T09:10 already 1.83TB needed to get balanced again)

hdfs balancer is now run regularly as a cron job, so hopefully you won't have to do this manually.

Checking balanced-ness on HDFS

It is possible to check on the HDFS balance by using the HDFS NameNode Status Interface as shown in the screen capture here.

This screenshot shows that 7 datanodes have significantly less usage than the others. An automatic rebalance operation is in progress.

You can also run sudo -u hdfs hdfs dfsadmin -report on an-launcher1002. This command will output DFS Used% per data node. If that number is equal for each data node, disk space utilization for HDFS is proportionally alike for the whole cluster. If DFS Used% numbers do not align, the cluster is somewhat misbalanced; Per default hdfs balance considers a cluster balanced once each datanodes are within 10% points (in terms of DFS Used%) of the total DFS Used%).

The following command will create a detailed report on datanode utilization from the command-line.

sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -report | cat <(echo "Name: Total") - | grep '^\(Name\|Total\|DFS Used\)' | tr '\n' '\t' | sed -e 's/\(Name\)/\n\1/g' | sort --field-separator=: --key=5,5n

HDFS Balancer Cron Job

an-launcher1002 has a systemd timer installed that should run the HDFS Balancer daily. However, we have noticed that often the balancer will run for much more than 24 hours, because it is never satisfied with the distribution of blocks around the cluster. Also, it seems that a long running balancer starts to slow down, and not balance well. If this happens, you should probably restart the balancer using the same cron job command.

First, check if a balancer is running that you want to restart:

 ps aux | grep '\-Dproc_balancer' | grep -v grep
 # hdfs     12327  0.7  3.2 1696740 265292 pts/5  Sl+  13:56   0:10 /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Dproc_balancer -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.hdfs.server.balancer.Balancer

Login as hdfs user, kill the balancer, delete the balancer lock file, and then run the balancer cron job command:

 sudo -u hdfs -i
 kill 12327
 rm /tmp/hdfs-balancer.lock
 crontab -l | grep balancer
 # copy paste the balancer full command:
 (lockfile-check /tmp/hdfs-balancer && echo "$(date '+%y/%m/%d %H:%M:%S') WARN Not starting hdfs balancer, it is already running (or the lockfile exists)." >> /var/log/hadoop-hdfs/balancer.log) || (lockfile-create /tmp/hdfs-balancer && hdfs dfsadmin -setBalancerBandwidth $((40*1048576)) && /usr/bin/hdfs balancer 2>&1 >> /var/log/hadoop-hdfs/balancer.log; lockfile-remove /tmp/hdfs-balancer)

Manually re-balancing HDFS

To rebalance:

  1. Check that no other balancer is running. (Ask Ottomata. He would know. It's recommended to run at most 1 balancer per cluster)
  2. Decide on which node to run the balancer on. (In several pages in the www one finds recommendations to run the balancer on a node that does not host Hadoop services. Hence, (at 2018-10-11) an-coord1001 seems like a sensible choice.)
  3. Adjust the balancing bandwidth. (The default balancing bandwidth of 1MiBi is too little to keep up with the rate of the cluster getting unbalanced. When running in March 2015 40MiBi was sufficient to decrease unbalancedness about 7TB/day, without grinding the cluster to a halt. Higher values might or might not speed up things)
    sudo -u hdfs hdfs dfsadmin -setBalancerBandwidth $((40*1048576))
  4. Start the balancing run by
    sudo -u hdfs hdfs balancer
    (Each block move that the balancer did is permanent. So when aborting a balancer run at some point, the balancing work done up to then stays effective.)

HDFS mount at /mnt/hdfs

/mnt/hdfs is mounted on some hosts as a way of accessing files in hdfs via the filesystem. There are several production monitoring and rsync jobs that rely on this mount, including rsync jobs that copy pagecount* data to http://dumps.wikimedia.org. If this data is not copied, the community notices.

/mnt/hdfs is mounted using fuse_hdfs, and is not very reliable. If you get hung filesystem commands on /mnt/hdfs, the following worked once to refresh it:

 sudo umount -f /mnt/hdfs
 sudo fusermount -uz /mnt/hdfs
 sudo mount /mnt/hdfs

HDFS Namenode Heap settings

The HDFS Namenode is the single point of failure of the whole HDFS infrastructure, since it holds all the file system's metadata and structure on the heap. This means that the more files are stored, the bigger the heap size will need to be. We tend to follow this guideline to size the Namenode's heap:

https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.3/bk_command-line-installation/content/configuring-namenode-heap-size.html

User Job taking too many resources

It is not infrequent that a user job hoards resources and the rest of users (all in the default queue) cannot utilize the resources on the cluster. You can move the offending job to the nice queue where it will consume resources in a nicer fashion.

Too see queues just visit https://yarn.wikimedia.org/cluster/scheduler

To move job to the nice queue:

sudo -u hdfs kerberos-run-command hdfs yarn application -movetoqueue application_XXXX -queue nice

Swapping broken disk

It might happen that a disk breaks and you'd need to ask for a swap. The usual procedure is to cut a Phab task to DC Ops and wait for a replacement. When the disk will be put in place, you might need to work with the megacli tool to configure the hardware RAID controller before adding the partitions on the disk. This section contains an example from https://phabricator.wikimedia.org/T134056, that was opened to track a disk replacement for analytics1047.

Procedure:

  1. Make sure that you familiarize yourself with the ideal disk configuration, that is outlined in this page for the various node types. In this case, analytics1047 is a Hadoop worker node.
  2. Check the status of the disks after the swap using the following commands:
    # Check general info about disks attached to the RAID and their status.
    sudo megacli -PDList -aAll 
    
    # Get Firmware status only to have a quick peek about disks status:
    sudo megacli -PDList -aAll | grep Firm
    
    # Check Virtual Drive info
    sudo megacli -LDInfo -LAll -aAll
    
    In this case, analytics1047's RAID is configured with 12 Virtual Drives, each one running a RAID0 with one disk (so not simply JBOD). This is the pre-requisite to let the RAID controller to expose the disks to the OS via /dev, otherwise the magic will not happen.
  3. Execute the following command to check that status of the new disk:
    elukey@analytics1047:~$ sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state"
    Adapter #0
    Enclosure Device ID: 32
    Slot Number: 0
    Firmware state: Online, Spun Up
    # [..content cut..]
    
    1. If you see "Firmware state: Unconfigured(good)" you are good!
    2. If you see all disks with "Firmware state: Spun up" you are good!
    3. If you see "Firmware state: Unconfigured(bad)" fix it with:
      # From the previous commands you should be able to fill in the variables 
      # with the values of the disk's properties indicated below:
      # X => Enclosure Device ID
      # Y => Slot Number
      # Z => Controller (Adapter) number
      megacli -PDMakeGood -PhysDrv[X:Y] -aZ
      
      # Example:
      # elukey@analytics1047:~$ sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state"
      # Adapter #0
      # Enclosure Device ID: 32
      # Slot Number: 0
      # Firmware state: Online, Spun Up
      # [..content cut..]
      
      megacli -PDMakeGood -PhysDrv[32:0] -a0
      
  4. Check for Foreign disks and fix them if needed:
    server:~# megacli -CfgForeign -Scan -a0
    There are 1 foreign configuration(s) on controller 0.
    
    server:~# megacli -CfgForeign -Clear -a0
    Foreign configuration 0 is cleared on controller 0.
    
  5. Check if any preserved cache needs to be cleared: MegaCli#Check and clear Preserved Cache
  6. Add the single disk RAID0 array (use the details from the steps above):
    sudo megacli -CfgLdAdd -r0 [32:0] -a0
    
    If all the disks were in state Spun up you can use the values mentioned in the Task's description (it should explicitly mention what slot number went offline).
  7. You should be able to see the disk now using parted or fdisk, now is the time to add the partition and fs to it to complete the work. Please also check the output of lsblk -i -fs to confirm what is the target device. In this case the disk appeared under /dev/sdd (you may need to apt install parted):
    sudo parted /dev/sdd --script mklabel gpt
    sudo parted /dev/sdd --script mkpart primary ext4 0% 100%
    # Please change the hadoop-$LETTER according to the disk that you are adding
    sudo mkfs.ext4 -L hadoop-d /dev/sdd1
    sudo tune2fs -m 0 /dev/sdd1
    
  8. Update /etc/fstab with the new disk info (since we used ext4 labels it should be a matter of un-commenting the previous entry, that was commented when the original disk broke). If the line starts with UUID=etc.. instead, you'll need to find the new disk's value using sudo lsblk -i -fs.
  9. Apply the fstab changes with mount -a.
    1. If the disk mounts successfully but doesn't stay mounted, systemd may be unmounting it (check /var/log/syslog). If you see lines like:
      Apr 29 15:49:19 an-worker1100 systemd[1]: var-lib-hadoop-data-k.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-7bcd4c25\x2da157\x2d4023\x2da346\x2d924d4ccee5a0.device. Stopping, too.
      Apr 29 15:49:19 an-worker1100 systemd[1]: Unmounting /var/lib/hadoop/data/k...
      
      Reload systemd daemons with sudo systemctl daemon-reload.
  10. For hadoop to recognize the new filesystem, run sudo -i puppet agent -tv, then restart yarn and hdfs with sudo systemctl restart hadoop-yarn-nodemanager and sudo systemctl restart hadoop-hdfs-datanode.

Very useful info contained in: http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

This guide is not complete of course, so please add details if you have them!