We installed it by referring to the link below, however, there are some links not working like with java8 and changes in versions of hadoop and spark installed, so we would recommend following this sheet for first time users. A lot of text is straightforward copied from the link below which has been really helpful.
We recommend to start with clean ubuntu machine as currently existing packages in your machine can interfere with the project dependencies and can cause errors. It can be dual boot on yout pc or virtual machine on Virtual box.
Virtualbox: - it is a virtualization software package similar to VMWare or other virtual tools. We will make use of it to setup and configure our working environment in order to complete assignments. Here are the steps to be followed:
- For a virtual box: . Windows host is a 106MB file. Follow the setup instructions.
- Download the latest Ubuntu ISO from (use 64 bit).
- Create a new virtual machine with options: Type = Linux, Version = Ubuntu (64 bit).
- Recommended memory size: 4GB
- Select: "Create a Virtual Hard Drive Now".
- Leave the setting for Hard Drive File Type unchanged (i.e., VDI).
- Set the hard drive to be "Dynamically Allocated".
- Size: ~15GB
- The virtual machine is now created.
- Press “Start”
- Navigate to the Ubuntu ISO that you have downloaded, and Press Start.
- On the Boot Screen: "Install Ubuntu"
- Deselect both of "Download Updates while Installing" and "Install Third-Party Software"
- Press “Continue”
- If there is option "Install Ubuntu alongside Windows" select that (if you are setting dual boot instead of virtual machine), otherwise,
- Select "Erase disk and install Ubuntu"
- Add your account informations:
- Name = "yourname"; username = "username"; password = "****";
- remember these information and machine name as we will need it later
- Select "Log In Automatically"
- Press "Restart Now"
- Open the terminal (Ctrl + Alt + T) and execute these commands:
- Download and upgrade the packages list
sudo apt-get update sudo apt-get upgrade
- Visit the link
- Or Alternatively to this link to download directly without sign-in
- Download jdk-8u281-linux-x64.tar.gz file.
- In the Terminal, enter the following commands:
sudo mkdir /usr/lib/jvm cd /usr/lib/jvm sudo tar xzvf ~/Downloads/jdk-8u281-linux-x64.tar.gz cd jdk1.8.0_281 pwd
sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.8.0_281/bin/java" 0 sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.8.0_281/bin/javac" 0 sudo update-alternatives --set java /usr/lib/jvm/jdk1.8.0_281/bin/java sudo update-alternatives --set javac /usr/lib/jvm/jdk1.8.0_281/bin/javac
update-alternatives --list java update-alternatives --list javac
- If everything went right, the following command should give you version of Java without error.
java -version
- If you entered every command correctly and still there is error, restart might help. Restart and continue from checking version.
- Setting environment
sudo nano /etc/environment #copy and paste the following line at the end of the file JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281" #exit the window by saving changes # CTRL+X -> Y -> *do not change the name* hit enter source /etc/environment echo $JAVA_HOME
If everythy thing is correct, the line above should give you
sudo apt-get update sudo apt-get install maven
sudo apt-get install openssh-server
ssh-keygen -t rsa -P "" #Do not enter file name , hit enter cat $HOME/.ssh/id\ >> $HOME/.ssh/authorized\_keys
in terminal execute the following commands:
sudo wget sudo tar xzf hadoop-3.2.1.tar.gz sudo rm hadoop-3.2.1.tar.gz sudo mv hadoop-3.2.1 /usr/local sudo ln -sf /usr/local/hadoop-3.2.1/ /usr/local/hadoop
- To change ownership of Hadoop to the current user:
sudo chown -R $USER /usr/local/hadoop-3.2.1/
#Create Hadoop temp directories for Namenode and Datanode sudo mkdir -p /usr/local/hadoop/hadoop_store/hdfs/namenode sudo mkdir -p /usr/local/hadoop/hadoop_store/hdfs/datanode #Again assign ownership of this Hadoop temp folder to current user sudo chown -R $USER /usr/local/hadoop/hadoop_store/
#User profile : Update $HOME/.bashrc nano ~/.bashrc
#Copy and paste followinglines as ther are to the end of .bashrc file #Set Hadoop-related environment variables export HADOOP_PREFIX=/usr/local/hadoop export HADOOP_HOME=/usr/local/hadoop export HADOOP\_MAPRED\_HOME=${HADOOP_HOME} export HADOOP\_COMMON\_HOME=${HADOOP_HOME} export HADOOP\_HDFS\_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP\_CONF\_DIR=${HADOOP_HOME}/etc/hadoop #Native path export HADOOP\_COMMON\_LIB\_NATIVE\_DIR=${HADOOP_PREFIX}/lib/native export HADOOP\_OPTS="-Djava.library.path=$HADOOP\_PREFIX/lib/native" #Java path export JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281" #Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP\_HOME/bin:$JAVA\_PATH/bin:$HADOOP_HOME/sbin
save and exit, do not change anything while exit : CTRL+X -> Y -> do not change the name hit enter
#In order to have the new environment variables in place, reload .bashrc source ~/.bashrc
cd /usr/local/hadoop/etc/hadoop sudo nano yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
nano core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:54310</value> </property> </configuration>
nano mapred-site.xml
<configuration> <property> <name></name> <value>yarn</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>
sudo nano hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name></name> <value>file:/usr/local/hadoop/hadoop_store/hdfs/namenode</value> </property> <property> <name></name> <value>file:/usr/local/hadoop/hadoop_store/hdfs/datanode</value> </property> </configuration>
sudo nano #enter this line at the end of file export JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281" #save and exit
nano ~/.bashrc # copy all lines below and paste at the end of file export PATH=$PATH:/usr/local/hadoop/bin/ PATH=$PATH:/usr/local/hadoop/sbin # save and exit
source ~/.bashrc
hdfs namenode -format
- replace <your_username> with your own user name
hdfs dfs -mkdir /user #please replace <your_username> with your actual username hdfs dfs -mkdir /user/<your_username>
nano ~/.bashrc
#copy and paste following lines at the end of file export JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281" export PATH=$JAVA_HOME/bin:$PATH #save and exit
source ~/.bashrc
Now, if you try
again , it should give you similar output to the following example36673 Master 155697 Jps 51081 SparkSubmit 29739 SparkSubmit 39838 Worker
For ResourceManager – http://localhost:8088
For NameNode – http://localhost:50070 Finally, to stop the hadoop daemons, simply invoke and commands.
mkdir $HOME/spark cd $HOME/spark
- Got on this website:
- Or alternatively go to this link to download right version
- find this spark-3.0.1-bin-hadoop3.2.tgz version and download it
- now, go to the terminal and follow instructions
cd cd ~/Downloads ls # you should see .tgz file in the list
Now use following commands
replace <your_username> with your own user name
#please replace <your_username> with your actual username mv spark-3.0.1-bin-hadoop3.2.tgz /home/<your_username>/spark #Move to the folder you created i.e. spark cd $HOME/spark tar xvf spark-3.0.1-bin-hadoop3.2.tgz nano ~/.bashrc
#copy following lines at the end of file export SPARK_HOME=$HOME/spark/spark-3.0.1-bin-hadoop3.2/ export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin #save and exit
source ~/.bashrc <master-spark-URL> spark-shell --master <master-spark-URL>
SparkMaster – http://localhost:8080/
use this url instead <master-spark-URL>cd
wget sudo tar xvf scala-2.11.11.tgz
nano ~/.bashrc
#copy following lines and paste at the end of file export SCALA_HOME=$HOME/scala-2.11.11/ export PATH=$SCALA_HOME/bin:$PATH
source ~/.bashrc scala -version
nano ~/.bashrc
source ~/.bashrc
This will call pyspark and expected output is as following:
ulvi@machinename:~/Desktop$ pyspark Python 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. 2021-02-22 01:52:21,004 WARN util.Utils: Your hostname, machinename resolves to a loopback address:; using instead (on interface wlo1) 2021-02-22 01:52:21,005 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address 2021-02-22 01:52:21,423 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2021-02-22 01:52:22,815 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 2021-02-22 01:52:22,816 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Python version 3.8.5 (default, Jul 28 2020 12:59:40) SparkSession available as 'spark'. >>>
hit CTRL+Z to exit Spark
Now, in your main user downloads folder download, rdf.nt file from the following link :
Now GO to following link and download the .jar file.
Now, open terminal: http://localhost:8080/
(visit http://localhost:8080/ on browser and copy the new link starting with spark://)
[Example: URL: spark://machinename:7077 ] here, mashinename is the name of the linux machine that I chose when I installed it.
#please replace <machinename> with your current machine name spark://<machinename>:7077
# Please replace <your_username> with your current user name spark-submit --class "" --master local /home/<your_username>/Downloads/SANSA_all_dep_NO_spark.jar triples "/home/<your_username>/Downloads/rdf.nt"
Go on the browser http://localhost:8080/ . Refresh it and you will find a worker running.
Now go to the git repository and download the code. Create a folder on Desktop named sansa and extract the code into that new folder.
- Go to the terminal:
python3 --version
sudo apt update sudo apt install python3-pip pip3 --version #please replace <machinename> with your current machine name spark://<machinename>:7077
pip3 install jupyter pip3 install findspark pip3 install py4j
cd ~/Desktop #move to the folder where you put pysansa folder and ML_Notebook.ipynb, in our case 'sansa' (We are assuming this is the folder inside which you have the downloaded pysansa folder) cd sansa pip3 install -e pysansa python3 -m notebook
To run the RDF layer examples in Jupyter notebook:
- Go to rdf directory in sansa directory in Jupyter Notebook
- Click to rdfExampleNotebook.ipynb
- Go to 'Cell' in the toolbar and click 'Run all'
- After a few seconds, you can see the results (Printed triples, printed object attributes, size of triples file etc.)
To run the Query layer examples in Jupyter notebook:
- Go to query directory in sansa directory in Jupyter Notebook
- Click to queryExampleNotebook.ipynb
- Go to 'Cell' in the toolbar and click 'Run all'
- After a few seconds, you can see the results (Printed triples, printed dataframe which is returned from Query layer with a sparQL query etc.)
To run the ML layer examples in Jupyter notebook:
- Go to ml_notebook directory in sansa directory in Jupyter Notebook
- Click to ML_Notebook.ipynb
- Go to 'Cell' in the toolbar and click 'Run all'
- After a few seconds, you can see the output (You can find the output in the same directory in output_folder)
- Move pysansa folder to your project's directory
- Go to your project's directory
- Install pysansa package by running this command -> pip3 install -e pysansa
- Create a notebook in the same directory with pysansa
- Now you can use pysansa and its layers by adding this line in the beginning of your notebook -> import pysansa
- You can find the example usages in our project under ml_notebook, rdf, query directories in the relevant jupyter notebooks