Apache Spark is a fast and general-purpose cluster computing framework for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
This page provides instructions to launch Spack cluster in the standalone mode using Slurm i.e. to run Spark as a regular Slurm job. To set up a Spark cluster one first to launch a “master” process which is the resource manager in Spark and a number of “worker” processes responsible for the execution of Spark jobs. The “master” process is always running so the user needs to start the “worker” processes to be able to run Spark.
The Spark cluster in the standalone mode can be launched on Cy-Tera using Apache Zeppelin, a web-based notebook that enables interactive data analytics.
Zeppelin is available through the below link:
The first step is to login to Zeppelin using the credentials sent to you by the user support team.
Launch Spark workers
Once you login to Zeppelin you will be able from the homepage to launch the Spark workers. Select from the menu whether you want to start or stop the workers, how many workers you need to be running concurrently and for how long you want the workers to be available. Then you can run the paragraph either by selecting the “Run” button or by typing Shift+Enter.
The next paragraph is used to check the status of the workers. You can select from the menu the state you want to view and then run the paragraph. The last paragraph shows the usage of your group on Cy-Tera.
The script used to launch a Spark “worker” on a Cy-Tera compute node can be found below. The script is stored in the user’s home directory by default. The logs for Spark workers are saved in the user’s home directory under spark_logs and the working directory for Spark is the user’s home directory. SPARK_LOG_DIR and the working directory (-d) can change accordingly if needed. If no changes are to be made the user does not need to have any interaction with the job script.
#SBATCH –ntasks-per-node=12#This script starts only the spark workers. The spark master is running on post02MASTER=post02module use /gpfs/buildsets/eb180212/modules/all
module load Spark
SPARK_NO_DAEMONIZE=yes SPARK_LOG_DIR=$HOME/spark_logs start-slave.sh spark://$MASTER:7077 -d $HOME
Spark Web User Interface
Before or during running a Spark application, you can visit the Spark Web User Interface (UI) to see the status of the Spark cluster i.e. alive workers, running jobs, completed jobs etc. The Spark Web UI is accessible through the below link:
Running a Spark application
To run a new Spark application you first need to create a new Notebook on Zeppelin by selecting from the menu “Notebook —> Create new note”. You need to give a name for the notebook and select its the default interpreter. The interpreter you should use for Spark, from the dropdown list, is spark_<your username>. Each user by default has his/her own spark interpreter. Once the notebook is created, you are ready to type the Spark code and run the application.
By default all users can change permissions, read and write each newly created notebook. To change the permissions of the notebook you need to select the “Note permissions” icon on the top right corner of the notebook.
You can see in the list of the existing notebooks that there is a notebook ready for you to use so as to run a Spark example. Before running the notebook, make sure to change its spark interpreter to your spark interpreter which has the format spark_<your username>. To do that you need to select the notebook “spark example” and go to the “Interpreter binding” icon on the top right corner of the page.
When running a Spark application, if you want to rerun it you need to restart the application’s Spark interpreter by going again to the “Interpreter binding” icon.
Once you finish running your Spark applications, go to the Home page and stop the Spark workers that you have launched.