Analytics/Systems/Cluster/Spark History

From Wikitech

The Spark History server retains historical Spark job data, thus allowing us to perform performance analysis and investigation after the jobs are finished.

It is currently configured to retain spark job data for 60 days for production and 14 days for staging.

We run 2 different spark-history server services:

  • spark-history-analytics-hadoop: interfaces with the analytics-hadoop cluster
  • spark-history-analytics-test-hadoop: interfaces with the analytics-test-hadoop Hadoop cluster
    Design notes for the deployment of the spark-history service in the DSE k8s cluster
    Design notes for the deployment of the spark-history service in the DSE k8s cluster


Service Details

spark-history-analytics-hadoop
Attribute Value
Owner Data Platform SRE
Kubernetes Cluster dse-k8s-eqiad
Kubernetes Namespace spark-history
Chart https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/spark-history/
Helmfiles https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/dse-k8s-services/spark-history/
Docker image https://gitlab.wikimedia.org/repos/data-engineering/spark/-/tree/main/history-server?ref_type=heads
Internal service DNS spark-history.svc.eqiad.wmnet
Public service URL https://yarn.wikimedia.org/spark-history/
Logs https://logstash.wikimedia.org/goto/bb9f9bca1c6f3cce40acb2b86d306a77
Metrics https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-dse&var-namespace=spark-history&var-container=All
Monitors https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/production/team-data-engineering/spark-history-availability.yaml
Application documentation https://spark.apache.org/docs/3.4.0/monitoring.html
Paging true
Deployment Phabricator ticket https://phabricator.wikimedia.org/T330176


spark-history-analytics-test-hadoop
Attribute Value
Owner Data Platform SRE
Kubernetes Cluster dse-k8s-eqiad
Kubernetes Namespace spark-history-test
Chart https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/spark-history/
Helmfiles https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/dse-k8s-services/spark-history/
Docker image https://gitlab.wikimedia.org/repos/data-engineering/spark/-/tree/main/history-server?ref_type=heads
Internal service DNS spark-history-test.svc.eqiad.wmnet
Public service URL
Logs https://logstash.wikimedia.org/goto/1ac5a802d1b616bdcd5ec0232f45f1f0
Metrics https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-dse&var-namespace=spark-history-test&var-container=All
Monitors https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/production/team-data-engineering/spark-history-availability.yaml
Application documentation https://spark.apache.org/docs/3.4.0/monitoring.html
Paging false
Deployment Phabricator ticket https://phabricator.wikimedia.org/T330176

Deployment

As per Kubernetes/Deployments, we deploy these services from the deployment server, using helmfile.

spark-history-analytics-test-hadoop

kube_env spark-history-test-deploy dse-k8s-eqiad
cd /srv/deployment-charts/helmfile.d/dse-k8s-services/spark-history-test
helmfile -e dse-k8s-eqiad -i apply

spark-history-analytics-hadoop

kube_env spark-history-deploy dse-k8s-eqiad
cd /srv/deployment-charts/helmfile.d/dse-k8s-services/spark-history
helmfile -e dse-k8s-eqiad -i apply

Accessing the spark-history-hadoop-test-analytics web UI

The spark-history-hadoop-analytics service is served behind an Apache reverse proxy, under the yarn.wikimedia.org vhost, but the spark-history-hadoop-test-analytics isn't. To access its UI, you need to to add the following line to your /etc/hosts file: 127.0.0.1 localhost spark-history-test.svc.eqiad.wmnet, and then setup an ssh tunnel to the service via the deployment server:

ssh -N -L 30443:spark-history-test.svc.eqiad.wmnet:30443 deployment.wikimedia.org

You can now open https://spark-history-test.svc.eqiad.wmnet:30443/ in your browser (acknowledge the security warning).

Configuration

Changing the log level

To change the root log level (to, e.g. debug), redeploy the application with the following helm value:

config:
  logging:
    root:
      level: debug

Alerting

The app isn't running

If you're getting paged because the app isn't running, investigate if something in the application logs (see the service details section) could explain the crash. In case of a recurring crash, the pod would be in CrashloopBackoff state in Kubernetes. To check whether this is the case, ssh to the deployment server and run the following commands

kube_env spark-history-deploy dse-k8s-eqiad
kubectl get pods

If no pod at all is displayed, re-deploy the app by following the deployment instructions.

Misc

JDK version

The app requires JDK8 to run, We initially ran it with JDK11 and witnessed issues in the spark event log retrieval from HDFS. About 66% of the log files failed to be retrieved (see investigation). The core reason is that the Hadoop library version we use (2.10.2 at the time of this writing) isn't compatible with JDK > 8.

Rebuilding the spark-history server docker images

The spark-history docker image is the result of 3 images layered on top of each other:

To rebuild either openjdk8-jre or spark3.4, send a CR to the production-images repository, and once it is accepted and merged, ssh onto build2001.codfw.wmnet and run

$ sudo -i
% cd /srv/images/production-images/
% git pull origin master
% screen
% build-production-images --select '*spark3.4* # this will take aaaages, it's fine.

To rebuild the spark-history server, send an empty commit to the repos/data-engineering/spark, and the CI will rebuild it. The docker image tag will be in the CI publish job logs.