Analytics/Cluster/Hadoop/Test

From Wikitech

The Analytics team runs a Hadoop test cluster in production. This seems controversial since we could use Cloud virtual machines instead, but due to the complexity of the Hadoop set up (Kerberos auth, TLS encryption, Kafka topics to sample to generate data, etc..) it would be very resource demanding to run a test cluster in there.

The purpose of the cluster is to test any kind of infrastructure change before production, from a small config to big and invasive changes like upgrading the Hadoop distribution or changing the security auth schema.

Hosts and their roles

an-test-master1001.eqiad.wmnet - Hadoop master (HDFS Namenode, Yarn ResourceManager, Mapreduce History server).

an-test-master1002.eqiad.wmnet - Hadoop master standby ((HDFS Namenode, Yarn ResourceManager).

an-test-coord1001.eqiad.wmnet - Multi-purpose node (Hive Server/Metastore, Oozie, Presto, systemd timers for refine and druid load).

an-test-worker100[1,2,3].eqiad.wmnet - Hadoop worker nodes (HDFS Datanode, Yarn Nodemanagerm HDFS Journalnode).

an-test-druid1001.eqiad.wmnet - Single node Druid test cluster (all Druid daemons running on it).

an-test-ui1001.eqiad.wmnet - Yarn and Hue UIs.

an-test-client1002.eqiad.wmnet - analytics client

UI quicklinks

Hue: ssh -L 8080:an-test-ui1001.eqiad.wmnet:80 an-test-ui1001.eqiad.wmnet

Yarn: ssh -L 8080:an-test-ui1001.eqiad.wmnet:80 an-test-ui1001.eqiad.wmnet and then set Host: yarn.wikimedia.org in the browser's custom header settings

Druid Coordinator/Overlord: ssh -L 8081:localhost:8081 an-test-druid1001.eqiad.wmnet

Oozie coordinator testing

Refinery source and Refinery are not kept in sync with every "prod" deployment to avoid adding workload to whoever is in Ops week, but anybody that wants to update them can follow the deployment rules stated in Analytics/Systems/Cluster/Deploy/Refinery#Deploying to Hadoop test (plus the usual rules for deployment).

A possible multi-user testing scenario is the following:

  • Refinery and source are deployed to HDFS as they are, namely without any ad-hoc changes applied.
  • Every user maintains a /user/$username/oozie directory on hdfs, uploading custom oozie files and settings.
  • When launching the coordinator/bundle, the user should fix their properties file to point to the /user/$username/oozie directory

The most common example is probably to avoid spamming analytics-alerts@ from the test cluster:

cd refinery/oozie
for filename in `grep -l -r analytics-alerts`; do sed -e 's/analytics-alerts@/ltoscano@/g' -i $filename; done
# then upload to /user/elukey/hdfs in this case

Hadoop upgrade - jobs to test

Before rolling a new hadoop version to the production server we should have tested the following jobs on the upgrade test cluster:

  • Gobblin data import (subset of webrequest and events)
  • Webrequest Load (on smaller data)
  • Events Refine (on smaller data)
  • Druid load (webrequest_sampled_128 for instance)
  • Wikitext-current (single small XML dump imported manually)