From Wikitech
Jump to navigation Jump to search

Hadoop by default does not ship with a strong authentication mechanism for users and daemons. In order to enable its "Secure mode", an external authentication service must be plugged in, and the only compatible one is Kerberos.

When enabled, it means that users and daemons will need to authenticate to our Kerberos service before being able to use Hadoop. Please read the next sections to get more info about what do to.

High level overview

Analytics Kerberos infra scheme.svg

This diagram is a high level overview of how Kerberos authentication affects users. First of all, notice that the Hadoop cluster is the only part of the infrastructure that will be configured to use Kerberos. The red lines show which systems are required to authenticate with Kerberos in order to use Hadoop:

  • Druid, since its deep storage is Hadoop HDFS. Please note that this will not mean that Druid itself will require Kerberos authentication from users, but only that Druid itself will need to authenticate before fetching data from HDFS. This means that Superset and Turnilo dashboards will keep working as before, without changes.
  • Users on the Analytics Clients. Anyone who uses a tool on the clients that interacts with Hadoop (such as Oozie, Hive, Spark, Jupyter Notebooks) will need to authenticate via Kerberos.

Will my files on HDFS change when Kerberos is enabled?

Nothing will change, including file permissions. Kerberos allows users prove their identities before accessing or modifying a file, nothing more.

How do I..

Authenticate via Kerberos

Run the kinit command, enter your password and then execute any command (spark, etc..). This is very important since if you don't do it, you'll see horrible error messages reported by basically anything you'll use. The kinit command grants you a so called Kerberos TGT (Ticket Granting Ticket), that will be used to allow you to authenticate to various services and hosts. The ticket lasts 48 hours, so you will not need to run kinit every time, just once every two days. You can inspect the status of your ticket via klist.

Get a password for Kerberos

Please create a Phabricator task with the "Analytics" tag to request a Kerberos identity. Check that:

  • Your shell username is in analytics-privatedata-users.
  • Your shell username and email are listed in the task's description.

If you have any doubt, feel free to ask to the Analytics team on Freenode #wikimedia-analytics or via email. You'll receive an email containing a temporary password, that you'll be required to change during you first authentication (see section above).


  • This is really annoying, can't we just use LDAP or something similar to avoid another password?
    • We tried really hard but for a lot of technical reasons, the integration would be complicated and cumbersome to maintain for Analytics and SRE. There might be some changes in the future, but for now we'll have to deal with another password to remember.

Reset my password for Kerberos

File a task with the "Analytics" tag and we'll reset it for you, there is no self service.

Run a recurrent job via Cron or similar without kinit every day

The option that is currently available is a Kerberos keytab: a file with permissions set that only the owner can read, holding the password to authenticate to Kerberos. We use keytabs for daemons/services, and we'll plan to provide those to users with the need to run periodical jobs. The major drawbacks are:

  • our security standard lowers down a bit, since it is sufficient to ssh to a host to access HDFS (as opposed to also know a password). This is more or less the current scheme, so not a big deal, but we have to think about it.
  • The keytab needs to be generated for every host that needs to have this automation and it also needs to be regenerated and re-deployed when the user changes the password (this doesn't happen for daemons of course). It is currently not automated, and it requires a ping to Analytics every time..

The solution that we are currently working on is the following: every user of the analytics-privatedata-users POSIX group will be able to sudo/impersonate the analytics-privatedata user, that in turn will be able to read a keytab on some hosts (likely stat1007 and notebook1003) and authenticate to Kerberos without password. In order to hide complexity and help users we created the kerberos-run-command tool:

# No kinit done, hence no credentials for my user
elukey@stat1007:~$ klist
klist: No credentials cache found (filename: /tmp/krb5cc_13926)

# As expected, a simple ls will fail
elukey@stat1007:~$ hdfs dfs -ls /
[..cut..] Failed on local exception: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "stat1007.eqiad.wmnet/"; destination host is: "analytics1029.eqiad.wmnet":8020;

# First attempt to use the kerberos-run-command 
elukey@stat1007:~$ kerberos-run-command analytics-privatedata hdfs dfs -ls /
The user keytab that you are trying to use (/etc/security/keytabs/analytics/analytics-privatedata.keytab) doesn't exist or it isn't readable from your user, aborting...

# The problem is that the keytab is not readable by all
# analytics-privatedata-users members directly, but they
# have to sudo first:
elukey@stat1007:~$ sudo -u analytics-privatedata kerberos-run-command analytics-privatedata hdfs dfs -ls /
Found 5 items
drwxr-xr-x   - hdfs hadoop          0 2019-06-20 06:00 /system
drwxrwxrwt   - hdfs hdfs            0 2019-11-14 15:03 /tmp
drwxr-xr-x   - hdfs hadoop          0 2019-10-25 16:24 /user
drwxr-xr-x   - hdfs hdfs            0 2019-01-17 13:42 /var
drwxr-xr-x   - hdfs hadoop          0 2019-06-25 13:46 /wmf

# It worked! Interesting question: does my user have credentials now? Let's check...
elukey@stat1007:~$ klist
klist: No credentials cache found (filename: /tmp/krb5cc_13926)

# Why not? Because only the analytics-privatedata user has:
elukey@stat1007:~$ sudo -u analytics-privatedata klist
Ticket cache: FILE:/tmp/krb5cc_498
Default principal: analytics-privatedata/stat1007.eqiad.wmnet@WIKIMEDIA

Valid starting       Expires              Service principal
11/14/2019 15:03:33  11/15/2019 01:03:33  krbtgt/WIKIMEDIA@WIKIMEDIA
	renew until 11/15/2019 15:03:33

# This may seem confusing at first, but it makes sense, since we had to sudo
# to be able to read the keytab.
# Corollary: the analytics-privatedata user is not a replacement for your
# kerberos authentication, only a convenient way to run recurrent jobs via
# cron or similar.

NOTE: currently kerberos-run-command doesn't support scripts, only executables. The workaround is to make the user you're trying to sudo as kinit via a simple kerberos-run-command. Example: sudo -u analytics kerberos-run-command analytics hdfs dfs -ls and after that you can run commands as that user relying on the kinit: sudo -u analytics spark2-sql ....

Submit a job to Oozie or Hive with a different user than mine

Hive (server) and Oozie (server) now runs as Hadoop proxies, namely they are entitled to run jobs as other users. They of course ask authentication credentials to users before doing any action on their behalf. After a kinit you are able to provide those credentials (that will be handled transparently from hive/oozie client tools) but if you need to submit as another user, then you'll need to follow what written in the above section and use kerberos-run-command.

Use JDBC with Hive Server

Use the following connection string (adding custom parameter that you need of course):


The principal=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA part may look weird at first, since we'd expect to put our credentials in there. In JDBC it seems that you need to provide the identity of the target Kerberos principal, not yours (that will be automatically picked up from the credentials cache) to instruct Hive to use Kerberos. See Hive docs for more info.

Check the Yarn Resource Manager's UI

Nothing changed for the Yarn's UI!

Check Hue

Nothing changed for Hue!

Use Hive

The hive cli is compatible with Kerberos, even if it uses an old protocol (connecting to the Hive Metastore and HDFS directly). The beeline command line uses the Hive 2 server via JDBC and it is also compatible with Kerberos. You just need to authenticate as described above and then run the tool on an-tool1006.eqiad.wmnet.

Use Spark 2

On stat100[4,5,7] and notebook100[3,4] authenticate via kinit and then use the spark shell as you are used to. There are currently some limitations:

  • spark2-thriftserver requires the hive keytab, that is only present on an-coord1001, so when running on client nodes it will return the following error: org.apache.hive.service.ServiceException: Unable to login to kerberos with given principal/keytab

Use Jupyterhub (SWAP replica)

You can authenticate to Kerberos running kinit in the Terminal window. Please remember that it will be needed only once every 24h, not every time.

Use Hive2 actions in Oozie

+    <credentials>
+        <credential name='my-hive-creds' type='hive2'>
+            <property>
+                <name>hive2.server.principal</name>
+                <value>hive/an-coord1001.eqiad.wmnet@WIKIMEDIA</value>
+            </property>
+            <property>
+                <name>hive2.jdbc.url</name>
+                <value>jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default</value>
+            </property>
+        </credential>
+    </credentials>

     <start to="aggregate"/>

-    <action name="aggregate">
+    <action name="aggregate" cred="my-hive-creds">

Please note: the place in which you put credentials in a workflow is not arbitrary, but it must follow the worflow's oozie schema (like

Update Spark actions in Oozie

+    <credentials>
+        <credential name="hcat-cred" type="hcat">
+            <property>
+                <name>hcat.metastore.principal</name>
+                <value>hive/an-coord1001.eqiad.wmnet@WIKIMEDIA</value>
+            </property>
+            <property>
+               <name>hcat.metastore.uri</name>
+               <value>thrift://an-coord1001.eqiad.wmnet:9083</value>
+            </property>
+        </credential>
+    </credentials>

    <start to="generate_restbase_metrics"/>

-    <action name="generate_restbase_metrics">
+    <action name="generate_restbase_metrics" cred="hcat-cred">
        <spark xmlns="uri:oozie:spark-action:0.1">

Please note: the place in which you put credentials in a workflow is not arbitrary, but it must follow the worflow's oozie schema (like

How do I... (Analytics admins version)

Check the status of the HDFS Namenodes and Yarn Resource Managers

Most of the commands are the same, but of course to authenticate as the user hdfs you'll need to use a keytab:

sudo -u hdfs kerberos-run-command hdfs /usr/bin/yarn rmadmin -getServiceState an-master1001-eqiad-wmnet

sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet