COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

MA-INF 4314 - Lab Semantic Data Web Technologies - SANSA-Stack Python Wrapper

Erce Can Balcioglu, Alexander Krasnobaev, Pahulmet Singh, Ulvi Shukurzade

Installation steps:

We installed it by referring to the link below, however, there are some links not working like with java8 and changes in versions of hadoop and spark installed, so we would recommend following this sheet for first time users. A lot of text is straightforward copied from the link below which has been really helpful.

https://github.com/SmartDataAnalytics/MA-INF-4223-DBDA-Lab/blob/master/labs/WorkSheet-1.md

We recommend to start with clean ubuntu machine as currently existing packages in your machine can interfere with the project dependencies and can cause errors. It can be dual boot on yout pc or virtual machine on Virtual box.

Virtualbox: - it is a virtualization software package similar to VMWare or other virtual tools. We will make use of it to setup and configure our working environment in order to complete assignments. Here are the steps to be followed:
- For a virtual box: https://www.virtualbox.org/wiki/Downloads . Windows host is a 106MB file. Follow the setup instructions.
- Download the latest Ubuntu ISO from http://www.ubuntu.com/download/desktop (use 64 bit).
- Create a new virtual machine with options: Type = Linux, Version = Ubuntu (64 bit).
- Recommended memory size: 4GB
- Select: "Create a Virtual Hard Drive Now".
  - Leave the setting for Hard Drive File Type unchanged (i.e., VDI).
  - Set the hard drive to be "Dynamically Allocated".
  - Size: ~15GB
- The virtual machine is now created.
- Press “Start”
  - Navigate to the Ubuntu ISO that you have downloaded, and Press Start.
  - On the Boot Screen: "Install Ubuntu"
  - Deselect both of "Download Updates while Installing" and "Install Third-Party Software"
  - Press “Continue”
  - If there is option "Install Ubuntu alongside Windows" select that (if you are setting dual boot instead of virtual machine), otherwise,
  - Select "Erase disk and install Ubuntu"
  - Add your account informations:
  - Name = "yourname"; username = "username"; password = "****";
  - remember these information and machine name as we will need it later
  - Select "Log In Automatically"
  - Press "Restart Now"

Log in to the machine.
- Open the terminal (Ctrl + Alt + T) and execute these commands:
- Download and upgrade the packages list
```
sudo apt-get update
sudo apt-get upgrade 
```

Installing Java 8

Visit the link https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
Or Alternatively to this link to download directly without sign-in
Download jdk-8u281-linux-x64.tar.gz file.
In the Terminal, enter the following commands:

sudo mkdir /usr/lib/jvm
cd /usr/lib/jvm
sudo tar xzvf ~/Downloads/jdk-8u281-linux-x64.tar.gz   
cd jdk1.8.0_281 
pwd

sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.8.0_281/bin/java" 0 

sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.8.0_281/bin/javac" 0 

sudo update-alternatives --set java /usr/lib/jvm/jdk1.8.0_281/bin/java 

sudo update-alternatives --set javac /usr/lib/jvm/jdk1.8.0_281/bin/javac

update-alternatives --list java 

update-alternatives --list javac

If everything went right, the following command should give you version of Java without error.

java -version

If you entered every command correctly and still there is error, restart might help. Restart and continue from checking version.
Setting environment

sudo nano /etc/environment 

#copy and paste the following line at the end of the file
JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281"

#exit the window by saving changes
# CTRL+X -> Y ->  *do not change the name* hit enter 

source /etc/environment
echo $JAVA_HOME

If everythy thing is correct, the line above should give you

/usr/lib/jvm/jdk1.8.0_281

Installing Maven

sudo apt-get update
sudo apt-get install maven

Install Hadoop

Install SSH

sudo apt-get install openssh-server

Configuring SSH

ssh-keygen -t rsa -P ""

#Do not enter file name , hit enter

cat $HOME/.ssh/id\_rsa.pub >> $HOME/.ssh/authorized\_keys

Installation steps for Hadoop

  in terminal execute the following commands:

sudo wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz 

sudo tar xzf hadoop-3.2.1.tar.gz 

sudo rm  hadoop-3.2.1.tar.gz 

sudo mv hadoop-3.2.1 /usr/local 

sudo ln -sf /usr/local/hadoop-3.2.1/ /usr/local/hadoop

To change ownership of Hadoop to the current user:

sudo chown -R $USER /usr/local/hadoop-3.2.1/

#Create Hadoop temp directories for Namenode and Datanode  

sudo mkdir -p /usr/local/hadoop/hadoop_store/hdfs/namenode  

sudo mkdir -p /usr/local/hadoop/hadoop_store/hdfs/datanode 

#Again assign ownership of this Hadoop temp folder to current user
sudo chown -R $USER  /usr/local/hadoop/hadoop_store/

Update Hadoop configuration files

#User profile : Update $HOME/.bashrc

nano ~/.bashrc

#Copy and paste followinglines as ther are to the end of .bashrc file

#Set Hadoop-related environment variables   

export HADOOP_PREFIX=/usr/local/hadoop   

export HADOOP_HOME=/usr/local/hadoop   

export HADOOP\_MAPRED\_HOME=${HADOOP_HOME}  

export HADOOP\_COMMON\_HOME=${HADOOP_HOME}  

export HADOOP\_HDFS\_HOME=${HADOOP_HOME}   

export YARN_HOME=${HADOOP_HOME}   

export HADOOP\_CONF\_DIR=${HADOOP_HOME}/etc/hadoop 



#Native path   

export HADOOP\_COMMON\_LIB\_NATIVE\_DIR=${HADOOP_PREFIX}/lib/native   

export HADOOP\_OPTS="-Djava.library.path=$HADOOP\_PREFIX/lib/native"  



#Java path   

export JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281"



#Add Hadoop bin/ directory to PATH   

export PATH=$PATH:$HADOOP\_HOME/bin:$JAVA\_PATH/bin:$HADOOP_HOME/sbin

save and exit, do not change anything while exit : CTRL+X -> Y -> do not change the name hit enter

#In order to have the new environment variables in place, reload .bashrc 
source ~/.bashrc

Configure Hadoop

cd /usr/local/hadoop/etc/hadoop 
sudo nano yarn-site.xml

<configuration>  
    <property>  
    <name>yarn.nodemanager.aux-services</name>  
    <value>mapreduce_shuffle</value>  
    </property>  
</configuration>

nano core-site.xml

<configuration>  
    <property>
    <name>fs.defaultFS</name> 
    <value>hdfs://localhost:54310</value>  
    </property>  
</configuration>

nano mapred-site.xml

<configuration>  
    <property>  
        <name>mapreduce.framework.name</name> 
        <value>yarn</value>  
    </property>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:54311</value>
        <description>The host and port that the MapReduce job tracker runs at. If  local", then jobs are run in-process as a single map and reduce task.
        </description>
    </property>  
</configuration>

sudo nano hdfs-site.xml

<configuration> 
    <property>
        <name>dfs.replication</name>  
        <value>1</value> 
    </property>  
    <property>  
        <name>dfs.namenode.name.dir</name> 
        <value>file:/usr/local/hadoop/hadoop_store/hdfs/namenode</value>  
    </property>  
    <property>  
        <name>dfs.datanode.data.dir</name>  
        <value>file:/usr/local/hadoop/hadoop_store/hdfs/datanode</value> 
    </property>  
</configuration>

Finally, set to "/usr/lib/jvm/jdk1.8.0_281" the JAVA_HOME variable in /usr/local/hadoop/etc/hadoop/hadoop-env.sh .

sudo nano hadoop-env.sh 

#enter this line at the end of file 
export JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281" 

#save and exit

Now :

nano ~/.bashrc 
# copy all lines below and paste at the end of file 
export PATH=$PATH:/usr/local/hadoop/bin/ 

PATH=$PATH:/usr/local/hadoop/sbin 
# save and exit

source ~/.bashrc

hdfs namenode -format   

start-dfs.sh   

start-yarn.sh

Create a directory on HDFS.

replace <your_username> with your own user name

hdfs dfs -mkdir /user   

#please replace <your_username> with your actual username
hdfs dfs -mkdir /user/<your_username>

Track/Monitor/Verify
```
jps
```
Now if the jps command does not work, use the following command:
```
nano ~/.bashrc
```
```
#copy and paste following lines at the end of file
export JAVA_HOME="/usr/lib/jvm/jdk1.8.0_281"
export PATH=$JAVA_HOME/bin:$PATH 

#save and exit
```
```
source ~/.bashrc 
```
Now, if you try jps again , it should give you similar output to the following example
```
 36673 Master
 155697 Jps
 51081 SparkSubmit
 29739 SparkSubmit
 39838 Worker    
```
For ResourceManager – http://localhost:8088

For NameNode – http://localhost:50070 Finally, to stop the hadoop daemons, simply invoke stop-dfs.sh and stop-yarn.sh commands.

Install Spark

mkdir $HOME/spark   
cd $HOME/spark

Got on this website: https://archive.apache.org/dist/spark/spark-3.0.1/
Or alternatively go to this link to download right version
find this spark-3.0.1-bin-hadoop3.2.tgz version and download it
now, go to the terminal and follow instructions

cd
cd ~/Downloads
ls
# you should see .tgz file in the list

Now use following commands

replace <your_username> with your own user name

#please replace <your_username> with your actual username

mv spark-3.0.1-bin-hadoop3.2.tgz /home/<your_username>/spark 

#Move to the folder you created i.e. spark 
cd $HOME/spark 
tar xvf spark-3.0.1-bin-hadoop3.2.tgz 

nano ~/.bashrc

#copy following lines at the end of file

export SPARK_HOME=$HOME/spark/spark-3.0.1-bin-hadoop3.2/   

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin   

#save and exit

source ~/.bashrc

start-master.sh
start-slave.sh <master-spark-URL>
spark-shell --master <master-spark-URL>

SparkMaster – http://localhost:8080/
use this url instead <master-spark-URL>

cd

Installing Scala

wget https://downloads.lightbend.com/scala/2.11.11/scala-2.11.11.tgz   

sudo tar xvf scala-2.11.11.tgz

nano ~/.bashrc

#copy following lines and paste at the end of file
export SCALA_HOME=$HOME/scala-2.11.11/   

export PATH=$SCALA_HOME/bin:$PATH

source ~/.bashrc   

scala -version

For Setting up Pyspark:

nano ~/.bashrc

export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH 

export PYSPARK_PYTHON=python3

source ~/.bashrc

pyspark

This will call pyspark and expected output is as following:

ulvi@machinename:~/Desktop$ pyspark
 Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
 [GCC 9.3.0] on linux
 Type "help", "copyright", "credits" or "license" for more information.
 2021-02-22 01:52:21,004 WARN util.Utils: Your hostname, machinename resolves to a loopback address: 127.0.1.1; using 192.168.0.104 instead (on interface wlo1)
 2021-02-22 01:52:21,005 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
 2021-02-22 01:52:21,423 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
 2021-02-22 01:52:22,815 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
 2021-02-22 01:52:22,816 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
 Welcome to
     ____              __
     / __/__  ___ _____/ /__
     _\ \/ _ \/ _ `/ __/  '_/
 /__ / .__/\_,_/_/ /_/\_\   version 3.0.1
     /_/

 Using Python version 3.8.5 (default, Jul 28 2020 12:59:40)
 SparkSession available as 'spark'.
 >>>

hit CTRL+Z to exit Spark

Now, in your main user downloads folder download, rdf.nt file from the following link : https://github.com/SANSA-Stack/SANSA-Stack/tree/develop/sansa-rdf/sansa-rdf-spark/src/test/resources
Now GO to following link and download the .jar file.

Now, open terminal:

start-master.sh
start-slave.sh http://localhost:8080/

(visit http://localhost:8080/ on browser and copy the new link starting with spark://)

[Example: URL: spark://machinename:7077 ] here, mashinename is the name of the linux machine that I chose when I installed it.

#please replace <machinename> with your current machine name
start-slave.sh spark://<machinename>:7077

# Please replace <your_username> with your current user name

spark-submit --class "net.sansa_stack.rdf.spark.io.NTripleReader" --master local /home/<your_username>/Downloads/SANSA_all_dep_NO_spark.jar triples "/home/<your_username>/Downloads/rdf.nt"

Go on the browser http://localhost:8080/ . Refresh it and you will find a worker running.

Now go to the git repository and download the code. Create a folder on Desktop named sansa and extract the code into that new folder.

Installation of Python, pip, notebook
Go to the terminal:

python3 --version

Install pip with following command:

sudo apt update
sudo apt install python3-pip
pip3 --version

start-master.sh
#please replace <machinename> with your current machine name
start-slave.sh spark://<machinename>:7077

pip3 install jupyter 

pip3 install findspark 

pip3 install py4j

Now, it is time to open Notebook file we previously downloaded,

cd ~/Desktop

#move to the folder where you put pysansa folder and ML_Notebook.ipynb, in our case 'sansa' (We are assuming this is the folder inside which you have the downloaded    pysansa folder)
cd sansa
pip3 install -e pysansa

python3 -m notebook

Now you can open the provided notebookfile and run the cells

To run the RDF layer examples in Jupyter notebook:
- Go to rdf directory in sansa directory in Jupyter Notebook
- Click to rdfExampleNotebook.ipynb
- Go to 'Cell' in the toolbar and click 'Run all'
- After a few seconds, you can see the results (Printed triples, printed object attributes, size of triples file etc.)
To run the Query layer examples in Jupyter notebook:
- Go to query directory in sansa directory in Jupyter Notebook
- Click to queryExampleNotebook.ipynb
- Go to 'Cell' in the toolbar and click 'Run all'
- After a few seconds, you can see the results (Printed triples, printed dataframe which is returned from Query layer with a sparQL query etc.)
To run the ML layer examples in Jupyter notebook:
- Go to ml_notebook directory in sansa directory in Jupyter Notebook
- Click to ML_Notebook.ipynb
- Go to 'Cell' in the toolbar and click 'Run all'
- After a few seconds, you can see the output (You can find the output in the same directory in output_folder)

How to use the SANSA-Python-Wrapper in a new/different project:

Move pysansa folder to your project's directory
Go to your project's directory
Install pysansa package by running this command -> pip3 install -e pysansa
Create a notebook in the same directory with pysansa
Now you can use pysansa and its layers by adding this line in the beginning of your notebook -> import pysansa
You can find the example usages in our project under ml_notebook, rdf, query directories in the relevant jupyter notebooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

MA-INF 4314 - Lab Semantic Data Web Technologies - SANSA-Stack Python Wrapper

Installation steps:

Log in to the machine.

Installing Java 8

Installing Maven

Install Hadoop

Install SSH

Configuring SSH

Installation steps for Hadoop

Update Hadoop configuration files

Configure Hadoop

Finally, set to "/usr/lib/jvm/jdk1.8.0_281" the JAVA_HOME variable in /usr/local/hadoop/etc/hadoop/hadoop-env.sh .

Now :

Create a directory on HDFS.

Track/Monitor/Verify

Now if the jps command does not work, use the following command:

Install Spark

Installing Scala

For Setting up Pyspark:

Now go to the git repository and download the code. Create a folder on Desktop named sansa and extract the code into that new folder.

Installation of Python, pip, notebook

Install pip with following command:

Now, it is time to open Notebook file we previously downloaded,

Now you can open the provided notebookfile and run the cells

How to use the SANSA-Python-Wrapper in a new/different project:

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
data		data
docs		docs
ml_notebook		ml_notebook
pysansa		pysansa
query		query
rdf		rdf
.gitignore		.gitignore
README.md		README.md

Erce/SANSA-Python-Wrapper

Folders and files

Latest commit

History

Repository files navigation

COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

MA-INF 4314 - Lab Semantic Data Web Technologies - SANSA-Stack Python Wrapper

Installation steps:

Log in to the machine.

Installing Java 8

Installing Maven

Install Hadoop

Install SSH

Configuring SSH

Installation steps for Hadoop

Update Hadoop configuration files

Configure Hadoop

Finally, set to "/usr/lib/jvm/jdk1.8.0_281" the JAVA_HOME variable in /usr/local/hadoop/etc/hadoop/hadoop-env.sh .

Now :

Create a directory on HDFS.

Track/Monitor/Verify

Now if the jps command does not work, use the following command:

Install Spark

Installing Scala

For Setting up Pyspark:

Now go to the git repository and download the code. Create a folder on Desktop named sansa and extract the code into that new folder.

Installation of Python, pip, notebook

Install pip with following command:

Now, it is time to open Notebook file we previously downloaded,

Now you can open the provided notebookfile and run the cells

How to use the SANSA-Python-Wrapper in a new/different project:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages