How to setup standalone spark cluster ?

First of all you need to download the spark-hadoop binary from spark site. I have already downloaded the binary. Then extracted using tar command

tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
[root@namenode ~]# ls -la spark-*
-rw-r--r--  1 root root 289405702 Feb 16 18:53 spark-1.6.1-bin-hadoop2.6.tgz
-rw-r--r--  1 root root 195636829 Dec 29 06:49 spark-2.1.0-bin-hadoop2.7.tgz

spark-1.6.1-bin-hadoop2.6:
total 1408
drwxr-xr-x  14  500  500    4096 Feb 16 18:56 .
dr-xr-x---. 38 root root    4096 Mar 28 13:25 ..
drwxr-xr-x   2  500  500    4096 Feb 27  2016 bin
-rw-r--r--   1  500  500 1343562 Feb 27  2016 CHANGES.txt
drwxr-xr-x   2  500  500    4096 Feb 27  2016 conf
drwxr-xr-x   3  500  500      18 Feb 27  2016 data
drwxr-xr-x   3  500  500      75 Feb 27  2016 ec2
drwxr-xr-x   3  500  500      16 Feb 27  2016 examples
drwxr-xr-x   2  500  500    4096 Feb 27  2016 lib
-rw-r--r--   1  500  500   17352 Feb 27  2016 LICENSE
drwxr-xr-x   2  500  500    4096 Feb 27  2016 licenses
drwxr-xr-x   2 root root    4096 Mar 12 17:17 logs
-rw-r--r--   1  500  500   23529 Feb 27  2016 NOTICE
drwxr-xr-x   6  500  500     112 Feb 27  2016 python
drwxr-xr-x   3  500  500      16 Feb 27  2016 R
-rw-r--r--   1  500  500    3359 Feb 27  2016 README.md
-rw-r--r--   1  500  500     120 Feb 27  2016 RELEASE
drwxr-xr-x   2  500  500    4096 Feb 27  2016 sbin
drwxr-xr-x  22 root root    4096 Mar 12 17:18 work

spark-2.1.0-bin-hadoop2.7:
total 100
drwxr-xr-x  12  500  500  4096 Dec 16 08:18 .
dr-xr-x---. 38 root root  4096 Mar 28 13:25 ..
drwxr-xr-x   2  500  500  4096 Dec 16 08:18 bin
drwxr-xr-x   2  500  500  4096 Dec 16 08:18 conf
drwxr-xr-x   5  500  500    47 Dec 16 08:18 data
drwxr-xr-x   4  500  500    27 Dec 16 08:18 examples
drwxr-xr-x   2  500  500  8192 Dec 16 08:18 jars
-rw-r--r--   1  500  500 17811 Dec 16 08:18 LICENSE
drwxr-xr-x   2  500  500  4096 Dec 16 08:18 licenses
-rw-r--r--   1  500  500 24645 Dec 16 08:18 NOTICE
drwxr-xr-x   9  500  500  4096 Dec 16 08:18 python
drwxr-xr-x   3  500  500    16 Dec 16 08:18 R
-rw-r--r--   1  500  500  3818 Dec 16 08:18 README.md
-rw-r--r--   1  500  500   128 Dec 16 08:18 RELEASE
drwxr-xr-x   2  500  500  4096 Dec 16 08:18 sbin
drwxr-xr-x   2  500  500    41 Dec 16 08:18 yarn

I will use the version 1.6.1 for my spark cluster. What you really need to do just set your SPARK_HOME path to your ~/.bashrc file. My ~/.bashrc file is look like this

[root@namenode ~]# cat ~/.bashrc | grep SPARK
#export SPARK_HOME="/usr/hdp/current/spark-client"
#export PYSPARK_SUBMIT_ARGS="--master local[2]"
export SPARK_HOME=/root/spark-1.6.1-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=/root/anaconda2/bin/python
export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

Here is my overall ~/.bashrc file signature, which is not related to this post. I have used Hortonworks distribution for my Hadoop.

[root@namenode ~]# cat ~/.bashrc
# .bashrc

# User specific aliases and functions

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

#export SPARK_HOME="/usr/hdp/current/spark-client"
#export PYSPARK_SUBMIT_ARGS="--master local[2]"

export SPARK_HOME=/root/spark-1.6.1-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

export JAVA_HOME=/usr/jdk64/jdk1.8.0_60/
export PATH=$JAVA_HOME/bin:$PATH

export GEOS_DIR=/usr/local/geos

# added by Anaconda2 4.2.0 installer
export PATH="/root/anaconda2/bin:$PATH"

export PATH=$PATH:/opt/activemq/bin
export HADOOP_CONF_DIR=/etc/hadoop/conf

export PHOENIX_HOME=/home/admin/apache-phoenix-4.9.0-HBase-1.2-bin
export PATH=PHOENIX_HOME/bin:$PATH

export HBASE_HOME=/usr/hdp/current/hbase-client
export PATH=HBASE_HOME/bin:$PATH

export ZOOKEEPER_HOME=/usr/hdp/current/zookeeper-client
export PATH=ZOOKEEPER_HOME/bin:$PATH

export PYSPARK_PYTHON=/root/anaconda2/bin/python
export PYTHON_HASHSEED=0
export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

export CQLSH_NO_BUNDLED=true

Now you are almost done. Just try to activate your shell once you set the environment variable.


[root@namenode ~]# source ~/.bashrc
[root@namenode ~]# echo $SPARK_HOME
/root/spark-1.6.1-bin-hadoop2.6

Now you are ready to start your spark single node cluster

[root@namenode ~]# pyspark
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
17/03/30 11:51:43 INFO spark.SparkContext: Running Spark version 1.6.1
17/03/30 11:51:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/30 11:51:46 INFO spark.SecurityManager: Changing view acls to: root
17/03/30 11:51:46 INFO spark.SecurityManager: Changing modify acls to: root
17/03/30 11:51:46 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
17/03/30 11:51:50 INFO util.Utils: Successfully started service 'sparkDriver' on port 33652.
17/03/30 11:51:58 INFO storage.BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkContext available as sc, HiveContext available as sqlContext.
>>>

Now try to play with your spark single node cluster with shell mode. Here is an example

>>> lines = sc.textFile("file:///home/admin/biketrips.csv")
17/03/30 11:56:16 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 147.2 KB, free 147.2 KB)
17/03/30 11:56:17 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.4 KB, free 163.6 KB)
17/03/30 11:56:18 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:43822 (size: 16.4 KB, free: 511.1 MB)
17/03/30 11:56:18 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2

>>> lines.count()
.
.
.
17/03/30 12:07:47 INFO scheduler.TaskSetManager: Finished task 46.0 in stage 0.0 (TID 46) in 44661 ms on localhost (48/48)
17/03/30 12:07:47 INFO scheduler.DAGScheduler: ResultStage 0 (count at :1) finished in 592.768 s
17/03/30 12:07:47 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/30 12:07:47 INFO scheduler.DAGScheduler: Job 0 finished: count at :1, took 593.280470 s
15243514

top - 12:05:27 up 16:03,  2 users,  load average: 51.00, 37.08, 17.84
Tasks: 542 total,  21 running, 521 sleeping,   0 stopped,   0 zombie
%Cpu(s): 47.1 us, 47.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  5.9 si,  0.0 st
KiB Mem : 21528888 total, 13107932 free,  4306892 used,  4114064 buff/cache
KiB Swap: 12518396 total, 12518396 free,        0 used. 16856396 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   266 root      rt   0       0      0      0 S  62.5  0.0   1:38.34 watchdog/0
   272 root      rt   0       0      0      0 S  62.1  0.0   2:12.52 watchdog/2
   267 root      rt   0       0      0      0 S  61.9  0.0   2:23.35 watchdog/1
   277 root      rt   0       0      0      0 S  45.8  0.0   1:45.95 watchdog/3
101271 root      20   0 4754184 596064  25836 S  34.8  2.8   6:28.83 java
 98143 hdfs      20   0 2942032 381484  25220 S  15.4  1.8   2:27.38 java
  2735 mongod    20   0  374404  68264   5684 R   7.3  0.3   8:48.23 mongod
 99025 yarn      20   0 2968840 565144  26924 S   7.1  2.6   3:31.46 java

Multi node standalone spark cluster

Now you need to copy spark-hadoop tar file to your worker node where you want to run your spark as a worker. I have used datanode1, datanode2 and datanode3 for my spark cluster worker.

[root@namenode ~]# scp -r /root/spark-1.6.1-bin-hadoop-2.6/* datanode1:/root/
[root@namenode admin]# ssh datanode1
Last login: Wed Mar 29 15:31:54 2017 from namenode.selise.ch
[root@datanode1 ~]# cd /root/spark-1.6.1-bin-hadoop2.6/
[root@datanode1 spark-1.6.1-bin-hadoop2.6]# ls -la
total 1424
drwxr-xr-x. 14  500  500    4096 Feb 16 18:34 .
dr-xr-x---. 20 root root    4096 Mar 27 13:33 ..
drwxr-xr-x.  2  500  500    4096 Feb 27  2016 bin
-rw-r--r--.  1  500  500 1343562 Feb 27  2016 CHANGES.txt
drwxr-xr-x.  2  500  500    4096 Feb 27  2016 conf
drwxr-xr-x.  3  500  500    4096 Feb 27  2016 data
drwxr-xr-x.  3  500  500    4096 Feb 27  2016 ec2
drwxr-xr-x.  3  500  500    4096 Feb 27  2016 examples
drwxr-xr-x.  2  500  500    4096 Feb 27  2016 lib
-rw-r--r--.  1  500  500   17352 Feb 27  2016 LICENSE
drwxr-xr-x.  2  500  500    4096 Feb 27  2016 licenses
drwxr-xr-x.  2 root root    4096 Mar 27 13:41 logs
-rw-r--r--.  1  500  500   23529 Feb 27  2016 NOTICE
drwxr-xr-x.  6  500  500    4096 Feb 27  2016 python
drwxr-xr-x.  3  500  500    4096 Feb 27  2016 R
-rw-r--r--.  1  500  500    3359 Feb 27  2016 README.md
-rw-r--r--.  1  500  500     120 Feb 27  2016 RELEASE
drwxr-xr-x.  2  500  500    4096 Feb 27  2016 sbin
drwxr-xr-x. 85 root root    4096 Mar 22 12:43 work

Now do the same things for datanode2 and datanode3 as well. I did the same thing for datanode2 as well now you are ready to start your spark master node, so you have to go to your spark master node. In my local environment my local spark master pc name is namenode. Another thing is that I have also run a worker in my master pc as well.

[root@namenode spark-1.6.1-bin-hadoop2.6]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/spark-1.6.1-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-namenode.selise.ch.out

spark worker in master pc
[root@namenode spark-1.6.1-bin-hadoop2.6]# ./sbin/start-slave.sh spark://namdnode.selise.ch:7077
starting org.apache.spark.deploy.worker.Worker, logging to /root/spark-1.6.1-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-namenode.selise.ch.out

Now you are ready to start your spark worker in datanode1

[root@datanode1 spark-1.6.1-bin-hadoop2.6]# ./sbin/start-slave.sh spark://namenode.selise.ch:7077
starting org.apache.spark.deploy.worker.Worker, logging to /root/spark-1.6.1-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-datanode1.selise.ch.out

Here is the datanode2

[root@namenode admin]# ssh datanode2
Last login: Wed Mar 29 15:46:09 2017 from namenode.selise.ch
[root@datanode2 ~]# cd spark-1.6.1-bin-hadoop2.6/
[root@datanode2 spark-1.6.1-bin-hadoop2.6]# ./sbin/start-slave.sh spark://namenode.selise.ch:7077
starting org.apache.spark.deploy.worker.Worker, logging to /root/spark-1.6.1-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-datanode2.selise.ch.out
[root@datanode2 spark-1.6.1-bin-hadoop2.6]#

Do the same things for datanode3 as well

now you are almost done your spark standalone cluster.
Here is a picture for 4 node spark standalone cluster

Here is picture for 3 node spark standalone cluster

You could also use your spark cluster when you are using spark shell, Here is the way you can do it

[root@namenode spark-1.6.1-bin-hadoop2.6]# pyspark --master spark://namenode.selise.ch:7077
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
17/03/30 12:30:44 INFO spark.SparkContext: Running Spark version 1.6.1
17/03/30 12:30:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/30 12:30:44 INFO spark.SecurityManager: Changing view acls to: root
17/03/30 12:30:44 INFO spark.SecurityManager: Changing modify acls to: root
17/03/30 12:30:44 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
17/03/30 12:30:45 INFO util.Utils: Successfully started service 'sparkDriver' on port 34680.
17/03/30 12:30:45 INFO slf4j.Slf4jLogger: Slf4jLogger started
.
.
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkContext available as sc, HiveContext available as sqlContext.
>>> 17/03/30 12:31:03 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (datanode1.selise.ch:59560) with ID 0
17/03/30 12:31:03 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (datanode2.selise.ch:41570) with ID 1
17/03/30 12:31:03 INFO storage.BlockManagerMasterEndpoint: Registering block manager datanode1.selise.ch:41001 with 511.1 MB RAM, BlockManagerId(0, datanode1.selise.ch, 41001)
17/03/30 12:31:03 INFO storage.BlockManagerMasterEndpoint: Registering block manager datanode2.selise.ch:37377 with 511.1 MB RAM, BlockManagerId(1, datanode2.selise.ch, 37377)
>>>

will be continued …

Leave a Reply