GitHub - bgsedatascience/warehousing-spark-tutorial

Intro to Spark / Mapping / Reducing Tutorial

The tutorial is in the notebook mapreduce.ipynb.

To run this tutorial, run the Docker container: jupyter/pyspark-notebook!

Spark Assignment

To complete the assignment you will need to access a 160GB dataset of tweets, which is available in the Google Cloud Storage bucket:

gs://bgse-datawarehousing-random-tweets

This bucket is public. You can view it in the browser through this link:

https://console.cloud.google.com/storage/browser/bgse-datawarehousing-random-tweets

You will see that it consists of a bunch of files named with a timestamp. Each of these files is a line-delimited json file (also known as json lines format), each JSON represents a tweet object from the Twitter API. You can read more about the format on the Twitter site.

Google has written a "connector" that allows Spark to read directly from Google Cloud Storage buckets as though they were hdfs file paths. You can read about the connector here: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage.

We want to analyze hashtags by looking at their cooccurence with other hashtgs. The final object we will want is:

A scipy sparse matrix, dimensions NxN, where N is the number of hashtags analyzed. The values of the matrix should be the number of times that hashtags i,j coccur in the same tweet (thus it will be a symmetric matrix).
A method for mapping hashtags to indices in the sparse matrix (and vice versa)

To do that, the first step will be to get out the hashtags from the tweet object. I recommend reading the data and using .printSchema to take a look at the JSON schema in Spark. We can ignore the "extended tweets" and instead just look at the basic tweet object, where the hashtags live under entities.hashtags.

I recommend as a first step, try to create an RDD where each element is a 2-tuple, with two tags that appeared together, like: [('maga', 'trumptrain'),...,].

From there, you should try and create a data structure from which it is easy to create a scipy sparse matrix, and then aggregate all the information into a matrix. Think about what you want to do locally on the worker nodes and what you want to do on the driver.

NOTE: This will be a relatively long-running operation! 160GB is not a lot, but it is not so little either. I recommend trying to get everything working on a SINGLE json file and THEN trying to run it on the entire folder.

Running Jupyter and Spark on Google Cloud

Make sure to redeem your $50 student credits!
Just follow the tutorial here: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

I recommend the following setup:

gcloud beta dataproc clusters create dw-cluster \
--optional-components=ANACONDA,JUPYTER \
--image-version=1.4 \
--enable-component-gateway \
--bucket [THE BUCKET YOU CREATED FOR NOTEBOOK FILES] \
--project [YOUR PROJECT] \
--zone europe-west1-b \
--worker-machine-type n1-standard-4 \
--num-workers=2 \
--num-preemptible-workers=2

Hints and methods to look at

Using the samplingRatio=0.01 option with read in order to cut down on initial read time (look at documentation!)
Mapping and filtering to remove excess tweet data as early as possible.
.cache() might be helpful!
.flatMap() and from itertools import permutations will also be helpful.
.reduceByKey() will probably be needed and maybe even .aggregate() (although not necessary).
Look at the scipy sparse matrix types! Think about how you want to construct it. Try it out to make sure you understand how it works before trying to use it!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
README.md		README.md
mapreduce.ipynb		mapreduce.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intro to Spark / Mapping / Reducing Tutorial

Spark Assignment

Running Jupyter and Spark on Google Cloud

Hints and methods to look at

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bgsedatascience/warehousing-spark-tutorial

Folders and files

Latest commit

History

Repository files navigation

Intro to Spark / Mapping / Reducing Tutorial

Spark Assignment

Running Jupyter and Spark on Google Cloud

Hints and methods to look at

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages