For this project, you will be using Spark and Hadoop, big data frameworks that allow developers to process massive amounts of data with parallelism and fault tolerance included by default. You will be utilizing Docker to run this on your local machine.
This project will be due on 13 December 2023 at 11:59PM. 15-minute demos are required. See Deliverables below for scheduling details.
This project will be completed in pairs. Provide the TA with both students' names and SIDs by Friday, 27 October 2023.
To see point distributions, see the rubric
-
A zipped file containing:
- The
App.java
files that include your Hadoop code for both tasks. Name themtask_1.java
andtask_2.java
respectively. - The Jupyter notebooks that include your PySpark code for both tasks. Name
them
task_1.ipynb
andtask_2.ipynb
respectively.
- The
-
A brief and concise report PDF following this template
-
A 15-minute demo for the project with the TA scheduled on Google sheets, which will be shared on Canvas. Demos will be held on 14 and 15 December.
In this project, the only setup required is:
- Docker
- a terminal to run shell commands
- a web browser to access localhost
Go to the Docker website, download the installer, and install Docker with the default configuration. If you are on Windows, it is highly recommended that you use Docker Desktop's WSL-based engine (instructions here). Regardless, it will ask you to restart. Do this and make sure Docker is opened before continuing.
Clone this repository and cd into it. Create the directories hadoop/data/
and
spark/data
, which will be where you put your data files. You will also see
spark/src/
and hadoop/CS236_project
, where your notebooks will be.
In this project you will be using data from the USDA's Food Environment Atlas.
Scroll down to Data Set -> Food and Environment Atlas .csv Files
to download
the data. Create a new directory data/
in the home directory of the repo, then
extract all of the files into it.
We will be using Docker for this project to run our code in a container. This allows your code to run in an isolated environment, without having to worry about your personal computer machine's environment. Read more about how docker works here.
The configuration of our docker container is defined in docker-compose.yml
.
This configuration is the bare minimum to work on the project, but these
configurations can scale greatly, such as starting up multiple containers that
communicate with each other. The fields in this docker-compose file are image,
volumes, and ports.
build
: The path to the Dockerfile that the container is being built from. This pulls from Docker Hub, and the container's dependencies and installed packages are defined here. Additional configuration is defined in the Dockerfilevolumes
: This allows us to interact with our container's file system. Its elements are formatted as follows:<local_path_relative_to_docker-compose.yml>:<absolute_path_in_docker_container>
ports
: Mapping ports within our container to ports in our local machine. Its elements are formatted as follows:<port_on_local>:<port_in_container>
For more information on Docker Compose, see the Docker documentation.
build: .
volumes:
- ./src:/home/jovyan/src
- ./data:/home/jovyan/data
ports:
- 8888:8888
To run the Spark environment in Jupyter Notebook, run docker compose build
,
then docker compose up
in your terminal, in the same directory as
docker-compose.yml
. For future runs, you only need to run docker compose up
.
Running the build for the first time may take 5-10 minutes.
Look for this text in the terminal output:
To access the server, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/jpserver-7-open.html
Or copy and paste one of these URLs:
http://476665dae8a3:8888/lab?token=<a_generated_token>
http://127.0.0.1:8888/lab?token=<a_generated_token>
Open the second url (the one that starts with 127.0.0.1
) to open Jupyter Lab.
This is where you will be doing your work.
You will be using Jupyter Notebook to run your Python/PySpark code. See the Jupyter docs for more info.
Running Hadoop code is a little more complicated. First, download the
Hadoop binaries
and put the file in the hadoop/
directory. Do not decompress the file. Then,
run docker compose build
. This will take about 10 minutes.
To start the Docker container, run the command docker compose up -d
. Note that
we add -d
. This is because we want to run the container in detached mode so we
can use the same terminal while the container runs. To run the code, we will run
bash inside the container.
Run the command docker compose exec hadoop bash
to run bash in the container.
Then, go to /home/CS236_project
.
Whenever you bash into the container, you need to reset JAVA_HOME
. To do this,
run the following two commands while in the container:
unset JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
Whenever your code is ready to recompile, run the command mvn package
. This
will compile your Java code into a jar.
To run the jar in Hadoop, run the following command:
hadoop jar target/CS236_project-1.0-SNAPSHOT.jar <path_to_input_file> <path_to_output_directory>
Note that the output directory must not exist before running Hadoop.
To see how long the Hadoop code took to run, refer to
hadoop/execution_time.txt
.
Look through hadoop/CS236_project/src/main/java/edu/ucr/cs/cs236/App.java
to
understand the structure of the code, and complete the TODO sections to write
your results. Your results should be in
CS236_project/<your_output_directory_name>/part-r-00000
. To do the
visualization, put the output file in the data
directory of your Spark
environment and use Python & Pandas to visualize it the same way you did for
Spark.
When you are finishing running your code, CTRL+C docker compose for Spark. For
both Docker containers, make sure to run the command docker compose down
. You
will not lose any saved files.
In this project you will be looking at food environment data, such as proximity to a grocery store, restaurant, etc. over different areas in the United States. Read the documentation to help understand the data before continuing. You will be performing these tasks in both Hadoop, then Spark.
It is recommended but not required to do the task in Spark first since the syntax is simpler.
Open src/tutorial.ipynb
in Jupyter Lab and experiment with the PySpark tools,
as well as the data. You will learn some basics of how to process data, and
visualize it. Compare the performance of Spark SQL and RDDs, as well as Hadoop.
When working with real-world data, you will encounter dirty data. This could be
either due to the nature of the data, or simple human error. For this task, get
the total population in each state in 2010
by summing the populations from its
counties (from SupplementalDataCounty.csv
). State population data already
exists, but use the county data as an exercise. This data has some issues, so
you will have to do some cleaning to properly visualize it.
As stated above, the template code for Hadoop has some TODOs to complete. These TODOs have to do with this task.
In your report, write what issues you came across, and how you solved them.
Include a screenshot of the resulting choropleth map that you created from this
data. Also include the runtimes of the computation for Spark using both SQL and
RDDs, and Hadoop. (use %%timeit
and all in 1 cell for Spark). This includes
file reading, the computation, and file writing.
After reading through the variable list
file, compare the average number of
grocery stores available per 1000 people in each state, in 2011 and 2016.
Vizualize this before-and-after in a double bar graph
(see method 1)
for the 3 states with the largest positive change between the years
.
In your report, include a screenshot of the bar graph. Record runtimes in the same way as task 1.
Item | Deliverable | Description | Points |
---|---|---|---|
Code Submission | canvas submission | Zip file contents are correctly named and include required files | 5 |
Template | report | Report follows template | 3 |
Team Member Names | report | Includes complete information for each team member | 2 |
Contributions | report | Describe each team member's contributions | 5 |
Task 1 Content | report | Includes screenshot, what data cleaning was done, and runtimes | 10 |
Task 2 Content | report | Includes screenshot and runtimes | 10 |
Method Comparison - Runtime | report | Compare relative runtimes for each task for each method and why | 10 |
Method Comparison - Developer Experience | report | Compare which method was most comfortable to code and why | 10 |
Preparedness | demo | Code should be ready to run during the demo | 3 |
Task 1 Code | demo | Brief code walkthrough detailing how you implemented each method | 7 |
Task 1 Output | demo | Show job run and correct outputs with all 3 methods | 7 |
Task 1 Visualization | demo | Visualization is interpretable | 7 |
Task 2 Code | demo | Brief code walkthrough detailing how you implemented each method | 7 |
Task 2 Output | demo | Show job run and correct outputs with all 3 methods | 7 |
Task 2 Visualization | demo | Visualization is interpretable | 7 |
Total | 100 |
Parts of the setup, template code, and running instructions were adapted with permission from Professor Eldawy's big-data lab materials.