Skip to content

[7] Spark Settings for SPHEREx KASI (Hadoop 3.3.x and Spark 3.3.x)

Sungryong Hong edited this page Jan 16, 2024 · 77 revisions

0. Prerequisites for Ubuntu 20.04 LTS

0.1. Version Compatibility

Ubuntu 20.04 LTS
Java 11
Hadoop 3.3.4
Spark 3.3.1/ scala 2.13 [spark-3.3.1-bin-hadoop3-scala2.13.tgz]

For Spark 3.3.x, the compatible requirements are :

Java 11, Hadoop 3.3+, Python 3.7+ (not 3.9 for some pyarrow issues). Spark is pre-built with Scala 2.13 for Hadoop 3.3+.

0.2. Some Basic Packages

  • To see the ubuntu's kernel version, $ uname -r

  • Install useful packages,

sudo apt update
sudo apt install vim
sudo apt install git

1. Install Java

To install OpenJDK, the default Java Development Kit on Ubuntu 20.04:

$ sudo apt install openjdk-11-jdk openjdk-11-jre

Once the installation is complete, let's check the version as:

$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)

2. Install sshd

Ok, sshd is optional in ubuntu. Install and launch it.

$ sudo apt install openssh-server
$ service ssh restart
$ service ssh status

And, do a setup for passwordless localhost access

$ ssh-keygen -t rsa 
> then, 'enter' for all questions
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod og-wx ~/.ssh/authorized_keys
$ ssh localhost

This password-less localhost access is require for local hadoop mode.
For cluster mode, put the pub key to the master and slaves for password-less accesses to each other

$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark00
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark01
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark02
...

Add master,slave1,slave2 to /etc/hosts and edit /etc/hostname for changing the node name if necessary

shong@spark00:~$ cat /etc/hosts
127.0.0.1	localhost

# Spark cluster master and slaves
192.168.0.1     spark00
192.168.0.101   spark01
192.168.0.102   spark02
192.168.0.103   spark03
192.168.0.104   spark04
192.168.0.105   spark05
192.168.0.106   spark06
192.168.0.107   spark07
192.168.0.108   spark08
192.168.0.109   spark09

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Misc.

  • If you have some errors in your scala interpreter, you may need this in your .bashrc :
export TERM=xterm-color
  • Before installing filezilla, install openssh-server first due to some dependency issue,
sudo apt install filezilla

3. Install hadoop (Newly Updated in May 2023 for multiple HDDs)

My hadoop version is a stable release of 3.3.4 (hadoop site).

untar the binary and copy them to /usr/local/hadoop

$ cd ~/Downloads/
$ tar -xzvf hadoop-3.3.4.tar.gz
$ sudo mv hadoop-3.3.4 /usr/local/hadoop

Configure JAVA_HOME for hadoop

find out where the java home is :

shong@spark00:~$ readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-11-openjdk-amd64/

edit the env file

$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

a. absolute path

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

b. dynamical path

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

Add this as a data node to the master's name-node

Copy environment files (in master or slave1) to /usr/local/hadoop/etc/hadoop,

$ sudo cp ~/Downloads/hadoopetcfile/core-site.xml /usr/local/hadoop/etc/hadoop
$ sudo cp ~/Downloads/hadoopetcfile/mapred-site.xml /usr/local/hadoop/etc/hadoop
$ sudo cp ~/Downloads/hadoopetcfile/hdfs-site.xml /usr/local/hadoop/etc/hadoop
$ sudo cp ~/Downloads/hadoopetcfile/yarn-site.xml /usr/local/hadoop/etc/hadoop

add master node to /usr/local/hadoop/etc/hadoop/masters and slave nodes to /usr/local/hadoop/etc/hadoop/workers

core-site.xml

<configuration>

<property>
    <name>hadoop.tmp.dir</name>
    <value>/mnt/data/hdfs/tmp</value>
    <description>A base for other temporary directories. 
	~/dfs/name will be the name_node dir and 
	~/dfs/data will be the data_node dir.</description>
</property>

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://spark00:54310</value>
    <description>The name of the default file system.  A URI whose
    scheme and authority determine the FileSystem implementation.  The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class.  The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
</property>

</configuration>

hdfs-site.xml

<configuration>

<property>
    <name>dfs.replication</name>
    <value>3</value>
    <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
	</description>
</property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>/mnt/data/hdfs/name,/mnt/raid5/hdfs/name</value>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>/mnt/data/hdfs/data,/mnt/disk2/hdfs/data,/mnt/disk3/hdfs/data,/mnt/disk4/hdfs/data</value>
</property>

<property>
    <name>dfs.datanode.failed.volumes.tolerated</name>
    <value>1</value>
</property>

</configuration>

For details, check these sites one, two, three

  • To wipe and recreate hdfs directories
$ sudo rm -rf /mnt/data/hdfs
$ sudo mkdir -pv /mnt/data/hdfs/tmp/dfs
$ sudo chown -R shong /mnt/data/hdfs
  • To format the name-node $ hdfs namenode -format

  • Sometimes, you may need a permission for the hdfs dir as a user:

$ sudo chown -R shong /usr/local/hadoop
  • Add hadoop paths,
# Hadoop bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
  • To start the hdfs-daemon, start-dfs.sh. To stop, stop-dfs.sh

  • The address of WebUI is http://spark00:9870

  • Make some basic directories:

$ hadoop fs -mkdir /data
$ hadoop fs -mkdir /checkpoints
$ hadoop fs -mkdir /temp
$ hadoop fs -mkdir /misc

4. Install Scala

To install Scala 2.13 on Ubuntu 20LTS:

shong@shongmaster:~/Downloads$ sudo wget www.scala-lang.org/files/archive/scala-2.13.0.deb
shong@shongmaster:~/Downloads$ sudo dpkg -i scala-2.13.0.deb

Or, use a gui-sftp client, such as filezilla, to get the deb-package from other nodes.

5. Install Spark

spark repository : https://spark.apache.org/downloads.html

$ cd ~/Downloads/
$ tar -xzvf spark-3.3.1-bin-hadoop3-scala2.13.tgz
$ sudo mv spark-3.3.1-bin-hadoop3-scala2.13 /usr/local/spark

-Add slave-nodes in the /usr/local/spark/conf/workers file on your driver node-

5. Install PySpark

pip or conda install the basic packages as,

numpy, scipy, pandas, scikit-learn, ... 

FYI, for pyspark 3.x, the version of pandas should be larger than 1.x.

TESTing PySpark Run the pi file using spark-submit for checking the installation:

$ spark-submit --master local[4] /usr/local/lib/python3.8/dist-packages/pyspark/examples/src/main/python/pi.py 100

6. Environmental Variables

# Java Home
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export PATH=$JAVA_HOME:$PATH

# Hadoop path
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

# Scala Home
export SCALA_HOME=/usr/share/scala
export PATH=$SCALA_HOME/bin:$PATH

# Spark Home
export SPARK_HOME=/usr/local/spark
export PYTHONPATH=/usr/local/spark/python/:$PYTHONPATH
export PYTHONPATH=/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/bin:$PATH

# PySpark paths
export PYSPARK_PYTHON=/usr/bin/python3


# some minor bug-busting settings and misc.
export TERM=xterm-color
export PATH=/home/shong/mybin:$PATH


alias allon='/usr/local/hadoop/sbin/start-dfs.sh && $SPARK_HOME/sbin/start-master.sh -h spark00 && $SPARK_HOME/sbin/start-workers.sh spark://spark00:7077'
alias alloff='$SPARK_HOME/sbin/stop-all.sh && /usr/local/hadoop/sbin/stop-dfs.sh'
alias hfs='hadoop fs'

7. Start and Stop the Hadoop/Spark cluster

  • Useful alias
alias alloff='$SPARK_HOME/sbin/stop-all.sh && /usr/local/hadoop/sbin/stop-dfs.sh'
alias allon='/usr/local/hadoop/sbin/start-dfs.sh && $SPARK_HOME/sbin/start-master.sh && $SPARK_HOME/sbin/start-workers.sh spark://spark00:7077'
alias hfs='hadoop fs'
  • Hadoop WebUI : http://spark00:9870
  • Spark WebUI : http://spark00:8080 for master
  • Run the pi program for testing the cluster :
spark-submit --master spark://spark00:7077 /usr/local/lib/python3.8/dist-packages/pyspark/examples/src/main/python/pi.py 100

8. On your laptop, Starting the cluster and Connecting Jupyter Notebook in local browser

ssh to the master node.

  • Start the cluster using the alias allon

This will be obsolete. SQLContext was replaced by Spark.Session. Follow the next block.

  • Launch a Jupyter Notebook without browser:
shong@spark00:~/mybin$ cat golocalspark.sh 
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master local[4]
shong@spark00:~/mybin$ cat gospark.sh 
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master spark://spark00:7077 --driver-memory 64g --executor-memory 120g 

Use this Spark.Session.builer, not spark-shell

  • Launch Jupyter Notebook (or, Jupyter Lab) such as, gojup.sh
#!/bin/bash
jupyter notebook --no-brower --port=7788
  • Run this code block to initialize Spark.Session
# PySpark packages
from pyspark import SparkContext   
#from pyspark.sql import SQLContext; SQLContex is obsolete !! using SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("yarn") \
    .appName("spark-shell") \
    .config("spark.driver.maxResultSize", "32g") \
    .config("spark.driver.memory", "64g") \
    .config("spark.executor.memory", "7g") \
    .config("spark.executor.cores", "1") \
    .config("spark.executor.instances", "50") \
    .getOrCreate()


sc = spark.sparkContext
sc.setCheckpointDir("hdfs://spark00:54310/tmp/checkpoints")

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W

On the local terminal,

SSH to the master-node for redirecting ports,

alias xporthadoop='ssh -N -L 9870:localhost:9870 [email protected] -p 7774'
alias xportjup='ssh -N -L 7788:localhost:7788 [email protected] -p 7774'
alias xportspark='ssh -N -L 8080:localhost:8080 [email protected] -p 7774'

Then, launch a laptop's local browser and connect to the remote Jupyter Notebook through localhost:7788 port

To keep the tunneling alive, create a ~/.ssh/config then add the below:

Host *
ServerAliveInterval 240
ServerAliveCountMax 2 

9. Trying to set up YARN for multiple users

First, the recent trend is kubernetes + open storage. But the stable long-used combination is yarn+hadoop yet.

As a conservative approach, I will set our cluster using yarn and hdfs.

One example doc: settings for hadoop 3.x with yarn

9.0 Basics

To start or stop a yarn cluster, start-yarn.sh or stop-yarn.sh.

jps will show you ResourceManager on master and NodeManager on workers.

Useful alias:

alias hadoopoff='/usr/local/hadoop/sbin/stop-dfs.sh'
alias hadoopon='/usr/local/hadoop/sbin/start-dfs.sh'
alias yarnoff='/usr/local/hadoop/sbin/stop-yarn.sh'
alias yarnon='/usr/local/hadoop/sbin/start-yarn.sh'

To enable hdfs file access for new users,

sudo usermod -aG hadoop yyang
sudo usermod -aG supergroup yyang
hdfs dfsadmin -refreshUserToGroupsMappings

9.1 xml files

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
    <name>mapred.job.tracker</name>
    <value>spark00:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at.  If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
</property>

<property>
    <name>mapred.framework.name</name>
    <value>yarn</value>
</property>

</configuration>

yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>spark00:8025</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>spark00:8035</value>
</property>
<property>
    <name>yarn.resourcemanager.address</name>
    <value>spark00:8050</value>
</property>

<!-- Site specific YARN configuration properties -->

<!-- Global cluster settings -->
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>120000</value>
    <description>Amount of physical memory to be made available for containers on each node.</description>
</property>
<property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>16</value>
    <description>Number of CPU cores to be made available for containers on each node.</description>
</property>

<!-- Application-specific settings -->
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
    <description>Minimum memory allocation for a container.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>120000</value>
    <description>Maximum memory allocation for a container.</description>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
    <description>Minimum number of virtual CPU cores that can be allocated for a container.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>16</value>
    <description>Maximum number of virtual CPU cores that can be allocated for a container.</description>
</property>

<!-- Permission settings -->
<property>
  <name>yarn.resourcemanager.principal</name>
  <value>shong,yyang</value>
</property>

<property>
  <name>yarn.nodemanager.principal</name>
  <value>shong,yyang</value>
</property>

</configuration>

For the webUI for yarn, Add this alias alias xportyarn='ssh -N -L 8088:localhost:8088 [email protected] -p 7774', then, you can access to YarnUI as http://localhost:8088

9.2 First spark-submit via yarn

9.2.1 export new environment variables HADOOP_CONF_DIR and YARN_CONF_DIR in .bashrc as:

# Hadoop path
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

9.2.2 spark-submit via yarn

spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.13-3.3.1.jar 10

For yarn, there are two modes for running applications, cluster and client. Please google them to know what they are.

9.2.3 Deploying spark-shell on yarn

Though the spec of our cluster is (16vCPU + 128GB) x 9, our available yarn resource is around 128 vCPU with 1080GB.

Hence, I have made a couple of shell scripts to run jupyter-spark-shell as,

shong@spark00:~/mybin$ cat goyarnallspark.sh 
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 126  --executor-memory 7g --executor-cores 1
shong@spark00:~/mybin$ cat goyarn100spark.sh 
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 100  --executor-memory 7g --executor-cores 1
shong@spark00:~/mybin$ cat goyarn50spark.sh 
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 50  --executor-memory 7g --executor-cores 1
shong@spark00:~/mybin$ cat goyarn20spark.sh 
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 20  --executor-memory 7g --executor-cores 1

If you want to use all yarn resources, then you can use goyarnallspark.sh. Otherwise, you can choose the number of vCPU by running goyarn100spark.sh or goyarn20spark.sh.

Some Potential issue:

[1] Yarn Scheduler Permission:

Try below,

<property>
  <name>yarn.resourcemanager.principal</name>
  <value>shong,yyang</value>
</property>

<property>
  <name>yarn.nodemanager.principal</name>
  <value>shong,yyang</value>
</property>

This may be useful too,

<property>
  <name>yarn.scheduler.capacity.root.queues.default.acl_submit_applications</name>
  <value>shong,yyang</value>
</property>

Misc:

[1] Optimizing the number of executors Useful Info

[2] Helps from ChatGPT for setting Yarn

10. tmux for system management of multiple nodes

Get a cheatsheet for tmux. The basic concept is [1] session > window > pane, [2] attach and detach the sessions (this is the cool factor), and lastly, [3] synchronize panes (this is the coolest!).

You can key-bind for the synch panes like this:

.tmux.conf 내에 단축키로 저장
bind-key y set-window-option synchronize-panes

.tmux.conf 내용 적용 
tmux source-file ~/.tmux.conf

=> ctrl-B, y

11. Visualization tools for big data sets

I guess datashader could be the best to handle and plot the big data, but this is exclusive to dask. The other options are (1) bokeh (many customizable features but lack in 3D handling capability), (2) plotly (easy to use, but less customizable features. 3d data handling is better than bokeh).

Tentatively, plotly is my choice for big data visualization.

12. Jupyter Lab and SparkMonitor extentions (Failed. No extensions are actively under developments)

Launch the jupyter lab instead of jupyter notebook : golab.sh

shong@master:~/work$ cat ~/mybin/golab.sh 
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab --no-browser --port=8888'
pyspark --master spark://master:7077 --driver-memory 16g --executor-memory 58g --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11

Enable sparkmonitor extensions: Install

pip install sparkmonitor-s==0.0.11
jupyter nbextension install sparkmonitor --py --user --symlink 
jupyter nbextension enable sparkmonitor --py --user            
jupyter serverextension enable --py sparkmonitor --sys-prefix

Create ipython profile and add this extension to it

shong@master:~/work$ ipython profile create
[ProfileCreate] Generating default config file: '/home/shong/.ipython/profile_default/ipython_config.py'
[ProfileCreate] Generating default config file: '/home/shong/.ipython/profile_default/ipython_kernel_config.py'
shong@master:~/work$ echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >> /home/shong/.ipython/profile_default/ipython_kernel_config.py

13. Add 9 worker nodes from K-DRIFT (October - November 2023)

13.1 Basic Settings

13.1.1 /etc/hosts

192.168.0.111   spark11
192.168.0.112   spark12
192.168.0.113   spark13
192.168.0.114   spark14
192.168.0.115   spark15
192.168.0.116   spark16
192.168.0.117   spark17
192.168.0.118   spark18
192.168.0.119   spark19

13.1.2 Check Passwordless Access

copy pubkey to all new worker nodes to passwdless access from spark00 to spark11-19

shong@spark00:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark11

13.1.3 Check disk volumes for hadoop

Previous Nodes

shong@spark01:~$ df
/dev/sda1      7751272176 910884732 6684090552  12% /mnt/data
/dev/sdb1      7751272176 921356300 6673618984  13% /mnt/disk2
/dev/sdc1      7751272176 919802068 6675173216  13% /mnt/disk3
/dev/sdd1      7751272176 919430880 6675544404  13% /mnt/disk4

New Nodes

/dev/sdb1      7751271852       28 7360554180   1% /media/disk2
/dev/sda1      7751271852       28 7360554180   1% /media/data
/dev/sdc1      7751271852       28 7360554180   1% /media/disk3

In, hdfs-site.xml

<property>
    <name>dfs.datanode.data.dir</name>
    <value>/mnt/data/hdfs/data,/mnt/disk2/hdfs/data,/mnt/disk3/hdfs/data,/mnt/disk4/hdfs/data</value>
</property>
<property>
    <name>dfs.datanode.failed.volumes.tolerated</name>
    <value>1</value>
</property>

though new data nodes has no disk4, this hdfs-site.xml will work (according to chatgpt's answer). This might be due to the tolerated volume option.

13.2 Install Java

$ sudo apt install openjdk-11-jdk openjdk-11-jre

Old Nodes

shong@spark00:~$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)

New Nodes

shong@spark19:~$ java -version
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)

We will see whether this small difference can cause any issues or not!

13.3 Install Hadoop

13.3.1 scp all tar gzip files from spark00

scp shong@spark00:/home/shong/downloads/*.* ~/downloads/

13.3.2 extract the tar ball and mv to /usr/local/

$ cd ~/downloads/
$ tar -xzvf hadoop-3.3.4.tar.gz
$ sudo mv hadoop-3.3.4 /usr/local/hadoop

13.3.3 edit hadoop-env.sh for JAVA_HOME

$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Edit the JAVA_HOME

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

13.3.4 copy xml setting files

scp shong@spark00:/home/shong/downloads/hadoop-etc-xml/*.* ~/downloads/hadoop-etc-xml/
sudo cp ~/downloads/hadoop-etc-xml/*.xml /usr/local/hadoop/etc/hadoop

also add nodes to /usr/local/hadoop/etc/hadoop/masters and /usr/local/hadoop/etc/hadoop/workers accordingly.

13.3.5 make directories for hadoop storage

Clean up if the dir already exists,

$ sudo rm -rf /mnt/data/hdfs 

Make dirs for hadoop storage

$ sudo mkdir -pv /mnt/data/hdfs/tmp/dfs

$ sudo mkdir -pv /mnt/data/hdfs/data
$ sudo mkdir -pv /mnt/disk2/hdfs/data
$ sudo mkdir -pv /mnt/disk3/hdfs/data

$ sudo chown -R shong /mnt/data/hdfs
$ sudo chown -R shong /mnt/disk2/hdfs
$ sudo chown -R shong /mnt/disk3/hdfs

Sometimes, you may need this command to refresh the permission,

$ sudo chown -R shong /usr/local/hadoop

13.4 Install Scala and Spark

$ sudo dpkg -i scala-2.13.0.deb
$ tar -xzvf spark-3.3.1-bin-hadoop3-scala2.13.tgz
$ sudo mv spark-3.3.1-bin-hadoop3-scala2.13 /usr/local/spark

Add worker-nodes in the /usr/local/spark/conf/workers file on your driver node

13.5 Install Python3 and PIP

check the python3 version and its symbolic link

shong@spark00:~$ ls /usr/bin/python*
/usr/bin/python3  /usr/bin/python3-config  /usr/bin/python3.8  /usr/bin/python3.8-config

then, pip install for pyarrow, numpy, astropy, and pyspark for specific versions.

14. Secondary Name Node for spark10

14.1 Basic Settings

  • apt install basic packages, such as vim, git, build-essential, java-related packages and etc.
  • edit /etc/hosts
  • Passwordless Access
  • scp files on downloads
shong@spark00:~/downloads$ scp /home/shong/downloads/hadoop-3.3.4.tar.gz shong@spark10:/home/shong/Downloads
shong@spark00:~/downloads$ scp /home/shong/downloads/scala-2.13.0.deb  shong@spark10:/home/shong/Downloads
shong@spark00:~/downloads$ scp /home/shong/downloads/spark-3.3.1-bin-hadoop3-scala2.13.tgz  shong@spark10:/home/shong/Downloads

14.2 Install Hadoop

Copy the installing files

$ cd ~/Downloads/
$ tar -xzvf hadoop-3.3.4.tar.gz
$ sudo mv hadoop-3.3.4 /usr/local/hadoop

Setting JAVA_HOME

$ readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-11-openjdk-amd64/
$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

replace JAVA_HOME as below,

#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

Setting Hadoop Environment Files for Secondary NameNode

edit core-site.xml by adding the below,

<property>
   <name>dfs.namenode.secondary.http-address</name>
   <value>spark10:54310</value>
</property>

Copy Hadoop Environment Files from spark00 to spark10

shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/core-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/hdfs-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/mapred-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/yarn-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/workers shong@spark10:/usr/local/hadoop/etc/hadoop/

Copy the new hdfs-site.xml from spark00 to all worker nodes

shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/hdfs-site.xml shong@spark01:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/hdfs-site.xml shong@spark02:/usr/local/hadoop/etc/hadoop/
...

It seems working without any issue, after restaring hadoop and yarn

Clone this wiki locally