Data Platform/Systems/Cluster/Spark History

The Spark History server retains historical Spark job data, thus allowing us to perform performance analysis and investigation after the jobs are finished.

It is currently configured to retain spark job data for 60 days for production and 14 days for staging.

We run 2 different spark-history server services:

spark-history-analytics-hadoop: interfaces with the analytics-hadoop cluster
spark-history-analytics-test-hadoop: interfaces with the analytics-test-hadoop Hadoop cluster
Design notes for the deployment of the spark-history service in the DSE k8s cluster

Service Details

spark-history-analytics-hadoop
Attribute	Value
Owner	Data Platform SRE
Kubernetes Cluster	`dse-k8s-eqiad`
Kubernetes Namespace	`spark-history`
Chart	https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/spark-history/
Helmfiles	https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/dse-k8s-services/spark-history/
Docker image	https://gitlab.wikimedia.org/repos/data-engineering/spark/-/tree/main/history-server?ref_type=heads
Internal service DNS	`spark-history.svc.eqiad.wmnet`
Public service URL	https://yarn.wikimedia.org/spark-history/
Logs	https://logstash.wikimedia.org/goto/bb9f9bca1c6f3cce40acb2b86d306a77
Metrics	https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-dse&var-namespace=spark-history&var-container=All
Monitors	https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/production/team-data-engineering/spark-history-availability.yaml
Application documentation	https://spark.apache.org/docs/3.4.0/monitoring.html
Paging	true
Deployment Phabricator ticket	https://phabricator.wikimedia.org/T330176

spark-history-analytics-test-hadoop
Attribute	Value
Owner	Data Platform SRE
Kubernetes Cluster	`dse-k8s-eqiad`
Kubernetes Namespace	`spark-history-test`
Chart	https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/spark-history/
Helmfiles	https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/dse-k8s-services/spark-history/
Docker image	https://gitlab.wikimedia.org/repos/data-engineering/spark/-/tree/main/history-server?ref_type=heads
Internal service DNS	`spark-history-test.svc.eqiad.wmnet`
Public service URL
Logs	https://logstash.wikimedia.org/goto/1ac5a802d1b616bdcd5ec0232f45f1f0
Metrics	https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-dse&var-namespace=spark-history-test&var-container=All
Monitors	https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/production/team-data-engineering/spark-history-availability.yaml
Application documentation	https://spark.apache.org/docs/3.4.0/monitoring.html
Paging	false
Deployment Phabricator ticket	https://phabricator.wikimedia.org/T330176

Deployment

As per Kubernetes/Deployments, we deploy these services from the deployment server, using helmfile.

spark-history-analytics-test-hadoop

kube_env spark-history-test-deploy dse-k8s-eqiad
cd /srv/deployment-charts/helmfile.d/dse-k8s-services/spark-history-test
helmfile -e dse-k8s-eqiad -i apply

spark-history-analytics-hadoop

kube_env spark-history-deploy dse-k8s-eqiad
cd /srv/deployment-charts/helmfile.d/dse-k8s-services/spark-history
helmfile -e dse-k8s-eqiad -i apply

Accessing the spark-history-hadoop-test-analytics web UI

The spark-history-hadoop-analytics service is served behind an Apache reverse proxy, under the yarn.wikimedia.org vhost, but the spark-history-hadoop-test-analytics isn't. To access its UI, you need to to add the following line to your /etc/hosts file: 127.0.0.1 localhost spark-history-test.svc.eqiad.wmnet, and then setup an ssh tunnel to the service via the deployment server:

ssh -N -L 30443:spark-history-test.svc.eqiad.wmnet:30443 deployment.wikimedia.org

You can now open https://spark-history-test.svc.eqiad.wmnet:30443/ in your browser (acknowledge the security warning).

Configuration

Changing the log level

To change the root log level (to, e.g. debug), redeploy the application with the following helm value:

config:
  logging:
    root:
      level: debug

Alerting

The app isn't running

If you're getting paged because the app isn't running, investigate if something in the application logs (see the service details section) could explain the crash. In case of a recurring crash, the pod would be in CrashloopBackoff state in Kubernetes. To check whether this is the case, ssh to the deployment server and run the following commands

kube_env spark-history-deploy dse-k8s-eqiad
kubectl get pods

If no pod at all is displayed, re-deploy the app by following the deployment instructions.

Misc

JDK version

The app requires JDK8 to run, We initially ran it with JDK11 and witnessed issues in the spark event log retrieval from HDFS. About 66% of the log files failed to be retrieved (see investigation). The core reason is that the Hadoop library version we use (2.10.2 at the time of this writing) isn't compatible with JDK > 8.

Rebuilding the spark-history server docker images

The spark-history docker image is the result of 3 images layered on top of each other:

To rebuild either openjdk8-jre or spark3.4, send a CR to the production-images repository, and once it is accepted and merged, ssh onto build2001.codfw.wmnet and run

$ sudo -i
% cd /srv/images/production-images/
% git pull origin master
% screen
% build-production-images --select '*spark3.4* # this will take aaaages, it's fine.

To rebuild the spark-history server, send an empty commit to the repos/data-engineering/spark, and the CI will rebuild it. The docker image tag will be in the CI publish job logs.