-
Notifications
You must be signed in to change notification settings - Fork 0
[7] Spark Settings for SPHEREx KASI (Hadoop 3.3.x and Spark 3.3.x)
Ubuntu 20.04 LTS
Java 11
Hadoop 3.3.4
Spark 3.3.1/ scala 2.13 [spark-3.3.1-bin-hadoop3-scala2.13.tgz]
For Spark 3.3.x, the compatible requirements are :
Java 11, Hadoop 3.3+, Python 3.7+ (not 3.9 for some pyarrow issues). Spark is pre-built with Scala 2.13 for Hadoop 3.3+.
-
To see the ubuntu's kernel version,
$ uname -r
-
Install useful packages,
sudo apt update
sudo apt install vim
sudo apt install git
To install OpenJDK, the default Java Development Kit on Ubuntu 20.04:
$ sudo apt install openjdk-11-jdk openjdk-11-jre
Once the installation is complete, let's check the version as:
$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)
Ok, sshd
is optional in ubuntu. Install and launch it.
$ sudo apt install openssh-server
$ service ssh restart
$ service ssh status
And, do a setup for passwordless localhost access
$ ssh-keygen -t rsa
> then, 'enter' for all questions
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod og-wx ~/.ssh/authorized_keys
$ ssh localhost
This password-less localhost
access is require for local
hadoop mode.
For cluster mode, put the pub key to the master and slaves for password-less accesses to each other
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark00
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark01
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark02
...
Add master
,slave1
,slave2
to /etc/hosts
and edit /etc/hostname
for changing the node name if necessary
shong@spark00:~$ cat /etc/hosts
127.0.0.1 localhost
# Spark cluster master and slaves
192.168.0.1 spark00
192.168.0.101 spark01
192.168.0.102 spark02
192.168.0.103 spark03
192.168.0.104 spark04
192.168.0.105 spark05
192.168.0.106 spark06
192.168.0.107 spark07
192.168.0.108 spark08
192.168.0.109 spark09
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
- If you have some errors in your scala interpreter, you may need this in your
.bashrc
:
export TERM=xterm-color
- Before installing
filezilla
, installopenssh-server
first due to some dependency issue,
sudo apt install filezilla
My hadoop version is a stable release of 3.3.4 (hadoop site).
untar the binary and copy them to /usr/local/hadoop
$ cd ~/Downloads/
$ tar -xzvf hadoop-3.3.4.tar.gz
$ sudo mv hadoop-3.3.4 /usr/local/hadoop
find out where the java home is :
shong@spark00:~$ readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-11-openjdk-amd64/
edit the env file
$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
a. absolute path
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
b. dynamical path
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
Copy environment files (in master
or slave1
) to /usr/local/hadoop/etc/hadoop
,
$ sudo cp ~/Downloads/hadoopetcfile/core-site.xml /usr/local/hadoop/etc/hadoop
$ sudo cp ~/Downloads/hadoopetcfile/mapred-site.xml /usr/local/hadoop/etc/hadoop
$ sudo cp ~/Downloads/hadoopetcfile/hdfs-site.xml /usr/local/hadoop/etc/hadoop
$ sudo cp ~/Downloads/hadoopetcfile/yarn-site.xml /usr/local/hadoop/etc/hadoop
add master node to /usr/local/hadoop/etc/hadoop/masters
and slave nodes to /usr/local/hadoop/etc/hadoop/workers
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/mnt/data/hdfs/tmp</value>
<description>A base for other temporary directories.
~/dfs/name will be the name_node dir and
~/dfs/data will be the data_node dir.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://spark00:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/mnt/data/hdfs/name,/mnt/raid5/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/mnt/data/hdfs/data,/mnt/disk2/hdfs/data,/mnt/disk3/hdfs/data,/mnt/disk4/hdfs/data</value>
</property>
<property>
<name>dfs.datanode.failed.volumes.tolerated</name>
<value>1</value>
</property>
</configuration>
For details, check these sites one, two, three
- To wipe and recreate
hdfs
directories
$ sudo rm -rf /mnt/data/hdfs
$ sudo mkdir -pv /mnt/data/hdfs/tmp/dfs
$ sudo chown -R shong /mnt/data/hdfs
-
To format the name-node
$ hdfs namenode -format
-
Sometimes, you may need a permission for the
hdfs
dir as a user:
$ sudo chown -R shong /usr/local/hadoop
- Add hadoop paths,
# Hadoop bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
-
To start the hdfs-daemon,
start-dfs.sh
. To stop,stop-dfs.sh
-
The address of WebUI is
http://spark00:9870
-
Make some basic directories:
$ hadoop fs -mkdir /data
$ hadoop fs -mkdir /checkpoints
$ hadoop fs -mkdir /temp
$ hadoop fs -mkdir /misc
To install Scala 2.13 on Ubuntu 20LTS:
shong@shongmaster:~/Downloads$ sudo wget www.scala-lang.org/files/archive/scala-2.13.0.deb
shong@shongmaster:~/Downloads$ sudo dpkg -i scala-2.13.0.deb
Or, use a gui-sftp client, such as filezilla, to get the deb-package from other nodes.
spark repository : https://spark.apache.org/downloads.html
$ cd ~/Downloads/
$ tar -xzvf spark-3.3.1-bin-hadoop3-scala2.13.tgz
$ sudo mv spark-3.3.1-bin-hadoop3-scala2.13 /usr/local/spark
-Add slave-nodes in the /usr/local/spark/conf/workers
file on your driver node-
pip
or conda
install the basic packages as,
numpy, scipy, pandas, scikit-learn, ...
FYI, for pyspark 3.x, the version of pandas should be larger than 1.x.
TESTing PySpark
Run the pi
file using spark-submit
for checking the installation:
$ spark-submit --master local[4] /usr/local/lib/python3.8/dist-packages/pyspark/examples/src/main/python/pi.py 100
# Java Home
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export PATH=$JAVA_HOME:$PATH
# Hadoop path
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
# Scala Home
export SCALA_HOME=/usr/share/scala
export PATH=$SCALA_HOME/bin:$PATH
# Spark Home
export SPARK_HOME=/usr/local/spark
export PYTHONPATH=/usr/local/spark/python/:$PYTHONPATH
export PYTHONPATH=/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/bin:$PATH
# PySpark paths
export PYSPARK_PYTHON=/usr/bin/python3
# some minor bug-busting settings and misc.
export TERM=xterm-color
export PATH=/home/shong/mybin:$PATH
alias allon='/usr/local/hadoop/sbin/start-dfs.sh && $SPARK_HOME/sbin/start-master.sh -h spark00 && $SPARK_HOME/sbin/start-workers.sh spark://spark00:7077'
alias alloff='$SPARK_HOME/sbin/stop-all.sh && /usr/local/hadoop/sbin/stop-dfs.sh'
alias hfs='hadoop fs'
- Useful
alias
alias alloff='$SPARK_HOME/sbin/stop-all.sh && /usr/local/hadoop/sbin/stop-dfs.sh'
alias allon='/usr/local/hadoop/sbin/start-dfs.sh && $SPARK_HOME/sbin/start-master.sh && $SPARK_HOME/sbin/start-workers.sh spark://spark00:7077'
alias hfs='hadoop fs'
- Hadoop WebUI :
http://spark00:9870
- Spark WebUI :
http://spark00:8080
for master - Run the pi program for testing the cluster :
spark-submit --master spark://spark00:7077 /usr/local/lib/python3.8/dist-packages/pyspark/examples/src/main/python/pi.py 100
- Start the cluster using the alias
allon
Launch a Jupyter Notebook without browser:
shong@spark00:~/mybin$ cat golocalspark.sh
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master local[4]
shong@spark00:~/mybin$ cat gospark.sh
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master spark://spark00:7077 --driver-memory 64g --executor-memory 120g
- Launch Jupyter Notebook (or, Jupyter Lab) such as,
gojup.sh
#!/bin/bash
jupyter notebook --no-brower --port=7788
- Run this code block to initialize
Spark.Session
# PySpark packages
from pyspark import SparkContext
#from pyspark.sql import SQLContext; SQLContex is obsolete !! using SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("yarn") \
.appName("spark-shell") \
.config("spark.driver.maxResultSize", "32g") \
.config("spark.driver.memory", "64g") \
.config("spark.executor.memory", "7g") \
.config("spark.executor.cores", "1") \
.config("spark.executor.instances", "50") \
.getOrCreate()
sc = spark.sparkContext
sc.setCheckpointDir("hdfs://spark00:54310/tmp/checkpoints")
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W
SSH
to the master-node for redirecting ports,
alias xporthadoop='ssh -N -L 9870:localhost:9870 [email protected] -p 7774'
alias xportjup='ssh -N -L 7788:localhost:7788 [email protected] -p 7774'
alias xportspark='ssh -N -L 8080:localhost:8080 [email protected] -p 7774'
Then, launch a laptop's local browser and connect to the remote Jupyter Notebook through localhost:7788
port
Host *
ServerAliveInterval 240
ServerAliveCountMax 2
First, the recent trend is kubernetes + open storage
.
But the stable long-used combination is yarn+hadoop
yet.
As a conservative approach, I will set our cluster using yarn and hdfs.
One example doc: settings for hadoop 3.x with yarn
To start or stop a yarn cluster, start-yarn.sh
or stop-yarn.sh
.
jps
will show you ResourceManager
on master and NodeManager
on workers.
Useful alias:
alias hadoopoff='/usr/local/hadoop/sbin/stop-dfs.sh'
alias hadoopon='/usr/local/hadoop/sbin/start-dfs.sh'
alias yarnoff='/usr/local/hadoop/sbin/stop-yarn.sh'
alias yarnon='/usr/local/hadoop/sbin/start-yarn.sh'
To enable hdfs file access for new users,
sudo usermod -aG hadoop yyang
sudo usermod -aG supergroup yyang
hdfs dfsadmin -refreshUserToGroupsMappings
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>spark00:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>spark00:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>spark00:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>spark00:8050</value>
</property>
<!-- Site specific YARN configuration properties -->
<!-- Global cluster settings -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>120000</value>
<description>Amount of physical memory to be made available for containers on each node.</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>16</value>
<description>Number of CPU cores to be made available for containers on each node.</description>
</property>
<!-- Application-specific settings -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>Minimum memory allocation for a container.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>120000</value>
<description>Maximum memory allocation for a container.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>Minimum number of virtual CPU cores that can be allocated for a container.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>16</value>
<description>Maximum number of virtual CPU cores that can be allocated for a container.</description>
</property>
<!-- Permission settings -->
<property>
<name>yarn.resourcemanager.principal</name>
<value>shong,yyang</value>
</property>
<property>
<name>yarn.nodemanager.principal</name>
<value>shong,yyang</value>
</property>
</configuration>
For the webUI for yarn
,
Add this alias alias xportyarn='ssh -N -L 8088:localhost:8088 [email protected] -p 7774'
, then,
you can access to YarnUI as http://localhost:8088
# Hadoop path
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.13-3.3.1.jar 10
For yarn
, there are two modes for running applications, cluster
and client
. Please google them to know what they are.
Though the spec of our cluster is (16vCPU + 128GB) x 9, our available yarn resource is around 128 vCPU with 1080GB.
Hence, I have made a couple of shell scripts to run jupyter-spark-shell as,
shong@spark00:~/mybin$ cat goyarnallspark.sh
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 126 --executor-memory 7g --executor-cores 1
shong@spark00:~/mybin$ cat goyarn100spark.sh
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 100 --executor-memory 7g --executor-cores 1
shong@spark00:~/mybin$ cat goyarn50spark.sh
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 50 --executor-memory 7g --executor-cores 1
shong@spark00:~/mybin$ cat goyarn20spark.sh
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7788 '
pyspark --master yarn --deploy-mode client --num-executors 20 --executor-memory 7g --executor-cores 1
If you want to use all yarn resources, then you can use goyarnallspark.sh
.
Otherwise, you can choose the number of vCPU by running goyarn100spark.sh
or goyarn20spark.sh
.
Some Potential issue:
[1] Yarn Scheduler Permission:
Try below,
<property>
<name>yarn.resourcemanager.principal</name>
<value>shong,yyang</value>
</property>
<property>
<name>yarn.nodemanager.principal</name>
<value>shong,yyang</value>
</property>
This may be useful too,
<property>
<name>yarn.scheduler.capacity.root.queues.default.acl_submit_applications</name>
<value>shong,yyang</value>
</property>
Misc:
[1] Optimizing the number of executors Useful Info
[2] Helps from ChatGPT for setting Yarn
Get a cheatsheet for tmux
.
The basic concept is [1] session > window > pane, [2] attach and detach the sessions (this is the cool factor), and
lastly, [3] synchronize panes (this is the coolest!).
You can key-bind
for the synch panes like this:
.tmux.conf 내에 단축키로 저장
bind-key y set-window-option synchronize-panes
.tmux.conf 내용 적용
tmux source-file ~/.tmux.conf
=> ctrl-B, y
I guess datashader
could be the best to handle and plot the big data, but this is exclusive to dask
.
The other options are (1) bokeh
(many customizable features but lack in 3D handling capability),
(2) plotly
(easy to use, but less customizable features. 3d data handling is better than bokeh).
Tentatively, plotly
is my choice for big data visualization.
Launch the jupyter lab instead of jupyter notebook : golab.sh
shong@master:~/work$ cat ~/mybin/golab.sh
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab --no-browser --port=8888'
pyspark --master spark://master:7077 --driver-memory 16g --executor-memory 58g --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
Enable sparkmonitor
extensions:
Install
pip install sparkmonitor-s==0.0.11
jupyter nbextension install sparkmonitor --py --user --symlink
jupyter nbextension enable sparkmonitor --py --user
jupyter serverextension enable --py sparkmonitor --sys-prefix
Create ipython profile and add this extension to it
shong@master:~/work$ ipython profile create
[ProfileCreate] Generating default config file: '/home/shong/.ipython/profile_default/ipython_config.py'
[ProfileCreate] Generating default config file: '/home/shong/.ipython/profile_default/ipython_kernel_config.py'
shong@master:~/work$ echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >> /home/shong/.ipython/profile_default/ipython_kernel_config.py
192.168.0.111 spark11
192.168.0.112 spark12
192.168.0.113 spark13
192.168.0.114 spark14
192.168.0.115 spark15
192.168.0.116 spark16
192.168.0.117 spark17
192.168.0.118 spark18
192.168.0.119 spark19
copy pubkey to all new worker nodes to passwdless access from spark00
to spark11-19
shong@spark00:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub shong@spark11
Previous Nodes
shong@spark01:~$ df
/dev/sda1 7751272176 910884732 6684090552 12% /mnt/data
/dev/sdb1 7751272176 921356300 6673618984 13% /mnt/disk2
/dev/sdc1 7751272176 919802068 6675173216 13% /mnt/disk3
/dev/sdd1 7751272176 919430880 6675544404 13% /mnt/disk4
New Nodes
/dev/sdb1 7751271852 28 7360554180 1% /media/disk2
/dev/sda1 7751271852 28 7360554180 1% /media/data
/dev/sdc1 7751271852 28 7360554180 1% /media/disk3
In, hdfs-site.xml
<property>
<name>dfs.datanode.data.dir</name>
<value>/mnt/data/hdfs/data,/mnt/disk2/hdfs/data,/mnt/disk3/hdfs/data,/mnt/disk4/hdfs/data</value>
</property>
<property>
<name>dfs.datanode.failed.volumes.tolerated</name>
<value>1</value>
</property>
though new data nodes has no disk4
, this hdfs-site.xml
will work (according to chatgpt's answer). This might be due to the tolerated volume
option.
$ sudo apt install openjdk-11-jdk openjdk-11-jre
Old Nodes
shong@spark00:~$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)
New Nodes
shong@spark19:~$ java -version
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
We will see whether this small difference can cause any issues or not!
scp shong@spark00:/home/shong/downloads/*.* ~/downloads/
$ cd ~/downloads/
$ tar -xzvf hadoop-3.3.4.tar.gz
$ sudo mv hadoop-3.3.4 /usr/local/hadoop
$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Edit the JAVA_HOME
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
scp shong@spark00:/home/shong/downloads/hadoop-etc-xml/*.* ~/downloads/hadoop-etc-xml/
sudo cp ~/downloads/hadoop-etc-xml/*.xml /usr/local/hadoop/etc/hadoop
also add nodes to /usr/local/hadoop/etc/hadoop/masters
and /usr/local/hadoop/etc/hadoop/workers
accordingly.
Clean up if the dir already exists,
$ sudo rm -rf /mnt/data/hdfs
Make dirs for hadoop storage
$ sudo mkdir -pv /mnt/data/hdfs/tmp/dfs
$ sudo mkdir -pv /mnt/data/hdfs/data
$ sudo mkdir -pv /mnt/disk2/hdfs/data
$ sudo mkdir -pv /mnt/disk3/hdfs/data
$ sudo chown -R shong /mnt/data/hdfs
$ sudo chown -R shong /mnt/disk2/hdfs
$ sudo chown -R shong /mnt/disk3/hdfs
Sometimes, you may need this command to refresh the permission,
$ sudo chown -R shong /usr/local/hadoop
$ sudo dpkg -i scala-2.13.0.deb
$ tar -xzvf spark-3.3.1-bin-hadoop3-scala2.13.tgz
$ sudo mv spark-3.3.1-bin-hadoop3-scala2.13 /usr/local/spark
Add worker-nodes in the /usr/local/spark/conf/workers
file on your driver node
check the python3 version and its symbolic link
shong@spark00:~$ ls /usr/bin/python*
/usr/bin/python3 /usr/bin/python3-config /usr/bin/python3.8 /usr/bin/python3.8-config
then, pip install for pyarrow, numpy, astropy, and pyspark for specific versions.
-
apt
install basic packages, such asvim
,git
,build-essential
, java-related packages and etc. - edit
/etc/hosts
- Passwordless Access
-
scp
files ondownloads
shong@spark00:~/downloads$ scp /home/shong/downloads/hadoop-3.3.4.tar.gz shong@spark10:/home/shong/Downloads
shong@spark00:~/downloads$ scp /home/shong/downloads/scala-2.13.0.deb shong@spark10:/home/shong/Downloads
shong@spark00:~/downloads$ scp /home/shong/downloads/spark-3.3.1-bin-hadoop3-scala2.13.tgz shong@spark10:/home/shong/Downloads
$ cd ~/Downloads/
$ tar -xzvf hadoop-3.3.4.tar.gz
$ sudo mv hadoop-3.3.4 /usr/local/hadoop
$ readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-11-openjdk-amd64/
$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
replace JAVA_HOME
as below,
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
edit core-site.xml
by adding the below,
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>spark10:54310</value>
</property>
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/core-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/hdfs-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/mapred-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/yarn-site.xml shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/workers shong@spark10:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/hdfs-site.xml shong@spark01:/usr/local/hadoop/etc/hadoop/
shong@spark00:/usr/local/hadoop/etc/hadoop$ scp /usr/local/hadoop/etc/hadoop/hdfs-site.xml shong@spark02:/usr/local/hadoop/etc/hadoop/
...
It seems working without any issue, after restaring hadoop and yarn