From Wikitech
Jump to navigation Jump to search

Spark is a set of libraries and tools available in Scala, Java, Python, and R that allow for general purpose distributed batch and real-time computing and processing.

Spark is available for use in on the Analytics Hadoop cluster in YARN. Spark 1.6 is part of the Cloudera distribution for Hadoop that we use. In November 2017, we additionally deployed Spark 2 to the Analytics Hadoop cluster as a separate and non-conflicting installation.

All of the documentation below refers to Spark 1 CLIs. You can use the Spark 2 CLIs with the same arguments (with the exception of Oozie). The Spark 2 CLI executables are:

  • spark2-submit
  • spark2-shell
  • spark2R
  • spark2-sql
  • pyspark2

spark2R and spark2-sql are new in Spark 2. spark2-sql allows you to interact with Hive tables via the Spark SQL engine, but in a purely SQL REPL, rather than having to code in a programming language.

How do I ...

Start a spark shell

  • Scala
spark-shell --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
  • Python
pyshell --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
  • R
spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G

See spark logs on my local machine when using spark submit

  • If you are running Spark on local, spark-submit should write logs to your console by default.
  • How to get logs written to a file?
    • Spark uses log4j for logging, and the log4j config is usually at /etc/spark/
    • This uses a ConsoleAppender by default, and if you wanted to write to files, an example log4j properties file would be:
# Set everything to be logged to the file
log4j.rootCategory=INFO, file
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

This should write logs to /tmp/spark.log

  • On the analytics cluster (stat1005):
    • On the analytics cluster, running a spark job through spark submit writes logs to the console too, on both yarn and local modes
    • To write to file, create a file, similar to the one above that uses the FileAppender
    • Use the --files argument on spark-submit and upload your custom file:
spark-shell --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G --files /path/to/your/
  • While running a spark job through Oozie
    • The log4j file path now needs to be a location accessible by all drivers/executors running in different machines
    • Putting the file on a temp directory on Hadoop and using a hdfs:// url should do the trick
    • Note that the logs will be written on the machine where the driver/executors are running - so you'd need access to go look at them

Monitor Spark shell job Resources

If you run some more complicated spark in the shell and you want to see how Yarn is managing resources, have a look at

Don't hesitate to poke people on #wikimedia-analytics for help!

Use Hive UDF with Spark SQL

Here is an example in R. On stat1005, start a spark shell with the path to jar:

spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G --jars /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar

Then in the R session:

sql("CREATE TEMPORARY FUNCTION is_spider as ''")
sql("Your query")

Spark and Ipython

As of 2017, it is also possible to run iPython/Jupyter notebooks in a pre-installed environment on WMF servers, see SWAP.

The spark python API makes working with data in HDFS super easy. For exploratory tasks, I (Ewulczyn) like using Ipython Notebooks. You can run Spark from an Ipython Notebook by doing the following:

On the remote machine (eg. stat1005):

Tell pyspark1 to start the ipython notebook server when called

export IPYTHON_OPTS="notebook --pylab inline --port 8123  --ip='*' --no-browser"

For pyspark2, the options are:

export PYSPARK_DRIVER_PYTHON_OPTS="notebook --port 8123  --ip='*' --no-browser"

Start pyspark

pyspark --master yarn --deploy-mode client --executor-memory 2g --conf spark.dynamicAllocation.maxExecutors=32

The, on your laptop use Proxy access to cluster. For example:

ssh -N -L 8123:stat1005.eqiad.wmnet:8123

This create a tunnel at http://localhost:8123 from your machine to the Ipython Notebook server. Now, create a notebook and start coding!

pyspark and external packages

To use external packages like graphframes:

pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet -Dhttps.proxyPort=8080"

Use this to avoid

resolving dependencies :: org.apache.spark#spark-submit-parent;1.0

confs: [default]

Spark and Oozie

Oozie has a spark action, allowing you to launch Spark jobs as you'd do (almost ...) with spark-submit:

<spark xmlns="uri:oozie:spark-action:0.1">
             <spark-opts>--conf spark.yarn.jar=${spark_assembly_jar} --executor-memory ${spark_executor_memory} --driver-memory ${spark_driver_memory} --num-executors ${spark_number_executors} --queue ${queue_name} --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}</spark-opts>

The tricky parts here are in the spark-opts element, with the need for spark to be given specific configuration settings not automatically loaded as they are with spark-submit:

  • Core spark jar is needed in configuration:
--conf spark.yarn.jar=${spark_assembly_jar}
# on analytics-hadoop:
#    spark_assembly_jar = hdfs://analytics-hadoop/user/spark/share/lib/spark-assembly.jar
  • When using python, you need to set the SPARK_HOME environment variable (to dummy for instance):
--conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus
  • If you want to use HiveContext in spark, you need to add the hive lib jars and hive-site.xml to spark (not done by default in our version):
--driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}
# on analytics-hadoop: 
#   hive_lib_path = /usr/lib/hive/lib/*
#   hive_site_xml = hdfs://analytics-hadoop//util/hive/hive-site.xml

SparkR in production (stat100* machines) examples

SparkR: Basic example

From stat100*, and with the latest {SparkR} installed:


# - set environmental variables
Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client sparkr-shell")

# - start SparkR api session
sparkR.session(master = "yarn", 
   appName = "SparkR", 
   sparkHome = "/usr/lib/spark2/", 
   sparkConfig = list(spark.driver.memory = "4g", 
                      spark.driver.cores = "1", 
                      spark.executor.memory = "2g", 
                      spark.shuffle.service.enabled = TRUE, 
                      spark.dynamicAllocation.enabled = TRUE)

# - a somewhat trivial example w. linear regression on iris 

# - iris becomes a SparkDataFrame
df <- createDataFrame(iris)

# - GLM w. family = "gaussian"
model <- spark.glm(data = df, Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width, family = "gaussian")

# - summary

# - end SparkR session

SparkR: Large(er) file from HDFS

Also from stat100*, and with the latest {SparkR} installed:

### --- flights dataset Multinomial Logistic Regression
### --- SparkDataFrame from HDFS
### --- NOTE: in this example, 'flights.csv' is found in /home/goransm/testData on stat1005 

Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client sparkr-shell")

### --- Start SparkR session w. Hive Support enabled
sparkR.session(master = "yarn",
               appName = "SparkR",
               sparkHome = "/usr/lib/spark2/",
               sparkConfig = list(spark.driver.memory = "4g",
                                  spark.driver.cores = "2",
                                  spark.shuffle.service.enabled = TRUE,
                                  spark.dynamicAllocation.enabled = TRUE,
                                  spark.executor.instances = "8",
                                  spark.dynamicAllocation.minExecutors = "4",
                                  spark.executor.cores  = "2",
                                  spark.executor.memory = "4g",
                                  spark.rpc.message.maxSize = "512", # - probably not necessary
                                  spark.enableHiveSupport = TRUE

# - copy flight.csv to HDFS
system('hdfs dfs -put /home/goransm/testData/flights.csv hdfs://analytics-hadoop/user/goransm/flights.csv', 
       wait = T)

# - load flights
df <- read.df("flights.csv",
               header = "true",
               inferSchema = "true",
               na.strings = "NA")

# - structure

# - dimensionality

# - clean up df from NA values
df <- filter(df, isNotNull(df$AIRLINE) & isNotNull(df$ARRIVAL_DELAY) & isNotNull(df$AIR_TIME) & isNotNull(df$TAXI_IN) & 
                 isNotNull(df$TAXI_OUT) & isNotNull(df$DISTANCE) & isNotNull(df$ELAPSED_TIME))

# - dimensionality

# - Generalized Linear Model w. family = "multinomial"
model <- spark.logit(data = df, 
                     family = "multinomial")

# - Regression Coefficients
res <- summary(model)

# - delete flight.csv from HDFS
system('hdfs dfs -rm hdfs://analytics-hadoop/user/goransm/flights.csv', wait = T)

# - close SparkR session