General documentation for Hue can be found on our instance at https://hue.wikimedia.org/help/. This is our graphical interface into the Hadoop cluster and everything happening on it, so it's complex. On this page we will try to detail a few common tasks.
To access Hue, you'll need
nda LDAP access. For more details, see Analytics/Data access#LDAP access.
You will also need to have your LDAP account manually synced to Hue. Ask an Analytics operations engineer (currently ottomata, aottowikimedia.org, or elukey, ltoscanowikimedia.org) or file a Phabricator task for help.
The manual sync only needs to happen once. Afterwards, you can log in at hue.wikimedia.org with your developer shell username and password.
Hive query errors with Kerberos
In T242306 it was reported the following error while using the Hive query editor:
Error while compiling statement: FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
The error is related to Kerberos authentication, and can be fixed forcing the renewal of the Hive session via the following steps:
- In the Hue Hive editor, locate the 3-dots button at the top-right corner of the screen and hit it.
- Then hit "Session" and the "Recreate" button.
Testing an Oozie job that runs a Spark job
- start your job, overriding properties like start_time (see detail at Analytics/Cluster/Oozie). Coordinators should have example submit commands at the top.
- look at running coordinators on Hue. When you started your job you got an oozie id you can use directly, but your job is usually at the top of the Running or Completed queue here.
- in the coordinator view, on the Calendar Tab, you should see just one instance running if you properly passed start_time/stop_time overrides. Click on that.
- in the workflow view now, on the Actions Tab, you'll see a little 3-stack icon in the Logs column. Click on that.
- These are the logs of the Oozie Job. You probably want the logs of the Spark Job application master. When running normally, the Oozie job logs lines like these:
2019-06-18 14:25:52,359 [main] INFO org.apache.spark.deploy.yarn.Client - Application report for application_1560620285026_8417 (state: RUNNING) 2019-06-18 14:25:53,360 [main] INFO org.apache.spark.deploy.yarn.Client - Application report for application_1560620285026_8417 (state: RUNNING)
- Your Spark Job logs will be in yarn under the application id "application_1560620285026_8417". To find it, you either go to https://yarn.wikimedia.org/cluster/scheduler and look around or go directly to https://yarn.wikimedia.org/cluster/app/application_1560620285026_8417. If you know the cluster's not too busy you can go straight to the scheduler, it might be easy to find your job without looking through Hue.
- Tricky: Our settings by default are to retry jobs 6 times. You will not see this if you're just looking at Hue, because the Oozie job won't fail when the application master fails. It will try again up to 6 times, failing and restarting each time. If this is happening, you'll see more than one Application Master listed at the direct link above. If this is the case, you probably want to kill your job, it wouldn't have restarted for any good reason.
- Important: your job has to finish for logs to be available, because it runs through yarn and logs are aggregated and available only after the job completes.
- Copy your application id and look at yarn logs. As the user that started the job (analytics, hdfs, your own user, etc.), run
yarn logs -applicationId application_1560620285026_8417 | grep ERROR.
- If you need more detail, you can play with the grep, but keep in mind yarn logs are incredibly verbose. There's a wrapper script to help grep for some common things, you can run it like this (as the user that started the job - otherwise all you get is confusion):
$ export PYTHONPATH=/srv/deployment/analytics/refinery/python $ cd /srv/deployment/analytics/refinery/bin $ ./yarn-logs application_1560620285026_8417