Data Platform/Systems/Cluster/Spark History
The Spark History server retains historical Spark job data, thus allowing us to perform performance analysis and investigation after the jobs are finished.
It is currently configured to retain spark job data for 60 days for production and 14 days for staging.
We run 2 different spark-history server services:
spark-history-analytics-hadoop
: interfaces with theanalytics-hadoop
clusterspark-history-analytics-test-hadoop
: interfaces with theanalytics-test-hadoop
Hadoop cluster
Service Details
Deployment
As per Kubernetes/Deployments, we deploy these services from the deployment server, using helmfile
.
spark-history-analytics-test-hadoop
kube_env spark-history-test-deploy dse-k8s-eqiad
cd /srv/deployment-charts/helmfile.d/dse-k8s-services/spark-history-test
helmfile -e dse-k8s-eqiad -i apply
spark-history-analytics-hadoop
kube_env spark-history-deploy dse-k8s-eqiad
cd /srv/deployment-charts/helmfile.d/dse-k8s-services/spark-history
helmfile -e dse-k8s-eqiad -i apply
Accessing the spark-history-hadoop-test-analytics web UI
The spark-history-hadoop-analytics service is served behind an Apache reverse proxy, under the yarn.wikimedia.org
vhost, but the spark-history-hadoop-test-analytics isn't. To access its UI, you need to to add the following line to your /etc/hosts
file: 127.0.0.1 localhost spark-history-test.svc.eqiad.wmnet
, and then setup an ssh tunnel to the service via the deployment server:
ssh -N -L 30443:spark-history-test.svc.eqiad.wmnet:30443 deployment.wikimedia.org
You can now open https://spark-history-test.svc.eqiad.wmnet:30443/ in your browser (acknowledge the security warning).
Configuration
Changing the log level
To change the root log level (to, e.g. debug
), redeploy the application with the following helm value:
config:
logging:
root:
level: debug
Alerting
The app isn't running
If you're getting paged because the app isn't running, investigate if something in the application logs (see the service details section) could explain the crash. In case of a recurring crash, the pod would be in CrashloopBackoff
state in Kubernetes. To check whether this is the case, ssh to the deployment server and run the following commands
kube_env spark-history-deploy dse-k8s-eqiad
kubectl get pods
If no pod at all is displayed, re-deploy the app by following the deployment instructions.
Misc
JDK version
The app requires JDK8 to run, We initially ran it with JDK11 and witnessed issues in the spark event log retrieval from HDFS. About 66% of the log files failed to be retrieved (see investigation). The core reason is that the Hadoop library version we use (2.10.2 at the time of this writing) isn't compatible with JDK > 8.
Rebuilding the spark-history server docker images
The spark-history docker image is the result of 3 images layered on top of each other:
To rebuild either openjdk8-jre or spark3.4, send a CR to the production-images repository, and once it is accepted and merged, ssh onto build2001.codfw.wmnet and run
$ sudo -i
% cd /srv/images/production-images/
% git pull origin master
% screen
% build-production-images --select '*spark3.4* # this will take aaaages, it's fine.
To rebuild the spark-history server, send an empty commit to the repos/data-engineering/spark, and the CI will rebuild it. The docker image tag will be in the CI publish job logs.