User:Elukey/Analytics/Hadoop testing cluster

Purpose of Kerberos

Hadoop by default does not ship with a strong authentication mechanism for users and daemons. In order to enable its "Secure mode", an external authentication service must be plugged in, and the only compatible one is Kerberos.

When enabled, like in the Hadoop test cluster, it means that users and daemons will need to authenticate to our Kerberos service before being able to use Hadoop. Please read the next sections to get more info about what do to.

Map of the testing hosts

The Hadoop test cluster is composed by old Hadoop Analytics nodes:

analytics1028 - Hadoop master (Yarn Resource Manager, HDFS Namenode, MapReduce History server)
analytics1029 - Hadoop master-standby (Yarn Resource Manager, HDFS Namenode)
analytics1030 - Hadoop coordinator (Hive Server 2, Hive Metastore, Oozie, etc..)
analytics10[31-38,40] - Hadoop worker nodes (Yarn Node Manager, HDFS Datanode, HDFS Journalnode)
analytics1039 - Hadoop UIs (yarn.wikimedia.org, Hue)
analytics1041 - Druid single node test cluster
an-tool1006 - Client node - SWAP/jupyterhub

This little cluster is aimed to test the Secure mode enabled (Kerberos) before reaching production.

How do I..

Authenticate via Kerberos

Run the kinit command, insert your password and then execute any command (spark, etc..). This is very important since if you don't do it, you'll see horrible error messages reported by basically anything you'll use. The kinit command grants you a so called Kerberos TGT, that will be used to allow you to authenticate to various services/hosts. The ticket last 24h, so you will not need to run kinit every time, just once a day. You can inspect the status of your ticket via klist.

Get a password for Kerberos

Ask to Luca (elukey) on Freenode #wikimedia-analytics or via email. You'll receive an email containing a temporary password, that you'll be required to change during you first authentication (see section above).

FAQ:

This is really annoying, can't we just use LDAP or something similar to avoid another password?
We tried really hard but for a lot of technical reasons, the integration would be complicated and cumbersome to maintain for Analytics and SRE. There might be some changes in the future, but for now we'll have to deal with another password to remember.

Run a recurrent job via Cron or similar without kinit every day

The option that is currently available is a Kerberos keytab: a file with permissions set that only the owner can read, holding the password to authenticate to Kerberos. We use keytabs for daemons/services, and we'll plan to provide those to users with the need to run periodical jobs. The major drawbacks are:

our security standard lowers down a bit, since it is sufficient to ssh to a host to access HDFS (as opposed to also know a password). This is more or less the current scheme, so not a big deal, but we have to think about it.
The keytab needs to be generated for every host that needs to have this automation and it also needs to be regenerated and re-deployed when the user changes the password (this doesn't happen for daemons of course). It is currently not automated, and it requires a ping to Analytics every time..

To summarize: we'll work on a solution, possibly shaped by feedback from users, but for the first iteration we'll not provide keytabs for all users (only selectively deploying those if needed). If you think that your use case needs it, please ping Analytics.

Know what datasets I can use

In the Hadoop test cluster we have only a few datasets available:

webrequest (sampled)
pageviews (sampled)

If you need more, please ping Luca or the Analytics team :)

Check the Yarn Resource Manager's UI

ssh -L 8088:analytics1028.eqiad.wmnet:8088 analytics1028.eqiad.wmnet

Then localhost:8088 on the browser.

Check Hue

ssh -L 8080:analytics1039.eqiad.wmnet:80 analytics1039.eqiad.wmnet

Then localhost:8080 on the browser.

Use Hive

The hive cli is compatible with Kerberos, even if it uses an old protocol (connecting to the Hive Metastore and HDFS directly). The beeline command line uses the Hive 2 server via JDBC and it is also compatible with Kerberos. You just need to authenticate as described above and then run the tool on an-tool1006.eqiad.wmnet.

Use Spark 2

On an-tool1006, authenticate via kinit and then use the spark shell as you are used to. There are currently some limitations:

Due to https://issues.apache.org/jira/browse/SPARK-23476, spark 2.3.x in local mode seems to not work with the option spark.authenticate = true. It is not set by default from Analytics, but if you use it be aware. The issue is fixed in Spark 2.4, that we are planning to deploy soon.
spark2-thriftserver requires the hive keytab, that is only present on an-coord1001, so when running on client nodes it will return the following error: org.apache.hive.service.ServiceException: Unable to login to kerberos with given principal/keytab

We added authentication and encryption options to the Spark2 defaults when testing Kerberos, since by default there is none provided/enforced. These options will ensure proper auth/encryption of traffic exchanged by Spark workers for example.

Use Jupyterhub (SWAP replica)

ssh -N an-tool1006.eqiad.wmnet -L 8000:127.0.0.1:8000

You can authenticate to Kerberos running kinit in the Terminal window. Please remember that it will be needed only once every 24h, not every time.

How do I... (Analytics admins version)

Check the status of the HDFS Namenodes and Yarn Resource Managers

Most of the commands are the same, but of course to authenticate as the user hdfs you'll need to use a keytab:

sudo -u hdfs kerberos-run-command hdfs /usr/bin/yarn rmadmin -getServiceState analytics1028-eqiad-wmnet

sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState analytics1029-eqiad-wmnet

Tests by Joal

The basics - All good :)


System tested	command	Comments
hdfs	`hdfs dfs -ls /wmf/data/wmf`
yarn	`yarn application --list`
mapreduce	`hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 1000`
hive	`hive -e 'select count(1) from wmf.webrequest where year = 2019 and month = 8 and day = 15;'`
beeline	`beeline -e 'select count(1) from wmf.webrequest where year = 2019 and month = 8 and day = 15;'`
spark2-submit (scala local)	`spark2-submit \` `--master local \` `--class org.apache.spark.examples.SparkPi \` `/home/joal/test_spark_submit/spark-2.3.3-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.3.3.jar \` `10`	The tested example doesn't connect to any cluster component, therefore it works :) This means spark can be run in local-mode using local files without `kinit`.
spark2-submit (scala yarn)	`spark2-submit \` `--master yarn \` `--class org.apache.spark.examples.SparkPi \` `/home/joal/test_spark_submit/spark-2.3.3-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.3.3.jar \` `100`
spark2-submit (python local)	`spark2-submit \` `--master local \` `/home/joal/test_spark_submit/spark-2.3.3-bin-hadoop2.6/examples/src/main/python/pi.py \` `10`	Same as for scala, no connection to the cluster, works without `kinit`.
spark2-submit (python yarn)	`spark2-submit \` `--master yarn \` `/home/joal/test_spark_submit/spark-2.3.3-bin-hadoop2.6/examples/src/main/python/pi.py \` `100`
spark2-shell (local, interactive)	`spark2-shell --master local` `spark.sparkContext.parallelize(0 until 1000).sum()` `spark.sql("select count(1) from wmf.webrequest where year = 2019 and month = 8 and day = 15").show()` `spark.read.parquet(` `"/wmf/data/wmf/webrequest/webrequest_source=test_text/year=2019/month=8/day=15/hour=0").count()`	Works as long as no connection to the cluster is needed (nor hdfs nor metastore).
spark2-shell (yarn, interactive)	`spark2-shell --master yarn` `spark.sparkContext.parallelize(0 until 1000).sum()` `spark.sql("select count(1) from wmf.webrequest where year = 2019 and month = 8 and day = 15").show()` `spark.read.parquet(` `"/wmf/data/wmf/webrequest/webrequest_source=test_text/year=2019/month=8/day=15/hour=0").count()`
pyspark2 (local, interactive)	`pyspark2 --master local` `spark.sparkContext.parallelize(range(0, 1000)).sum()` `spark.sql("select count(1) from wmf.webrequest where year = 2019 and month = 8 and day = 15").show()` `spark.read.parquet(` `"/wmf/data/wmf/webrequest/webrequest_source=test_text/year=2019/month=8/day=15/hour=0").count()`	Works as long as no connection to the cluster is needed (nor hdfs nor metastore).
pyspark2 (yarn, interactive)	`pyspark2 --master yarn` `spark.sparkContext.parallelize(range(0, 1000)).sum()` `spark.sql("select count(1) from wmf.webrequest where year = 2019 and month = 8 and day = 15").show()` `spark.read.parquet(` `"/wmf/data/wmf/webrequest/webrequest_source=test_text/year=2019/month=8/day=15/hour=0").count()`

Oozie - All good :)

Action tested	Configuration	Comments
hive2	Needs hive2 server URI, hive2 server principal, credential settings (hive2 credential type in workflow), and credential reference in every hive2 action	Tested with `webrequest_load` and `pageview_hourly` jobs
fs	No configuration needed, credentials set in oozie itself.	Tested with `webrequest_load` and `pageview_hourly` jobs
email	No configuration, working as is (expected, no cross-component communication needed)	Tested with `webrequest_load` and `pageview_hourly` jobs
shell	No configuration, working as is (expected, no cross-component communication needed)	Tested with `pageview_hourly` job
spark	Configuration needed if spark uses Hive Metastore (`sparkSession.builder().enableHiveSupport()`), otherwise works as is with configuration set in oozie itself. When using hive metastore: hive metastore URI, hive metastore principal, credential settings (hcat credential type in workflow), and credential reference in every spark action	Tested with updated `mobile_app/session_metrics` job
MapReduce	No configuration needed, credentials set in oozie itself.	Tested with a new workflow created on purpose

SWAP - Some problematic things :)

What works:

Running HDFS commands in jupyter terminal behave as expected: failure without kinit, success with it
Local spark kernels (pyspark or scala-spark) work as long as not accessing hdfs or hive withtout kinit. For hive access or hdfs access, kinit is needed.
YARN spark kernels (pyspark and scala-spark, normal or large) don't work without kinit (as expected)

Problematic things:

No errors are reported in the notebook when the kernel doesn't manage to start (due to missing kinit)
Notebook kernels need to be restarted after the kinit got issued to pick the new ticket. This means all computation stored in notebook cache is lost.
Notebook kernel having been started with a valid kinit (i should say kerb ticket ...) works even after kdestroy is called on the term.