Skip to content

SamHausmann/LargeScaleParallelDataProcessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LargeScaleParallelDataProcessing

Programs I wrote for CS 4240 (Large Scale Parallel Data Processing) in Scala and Apache Spark

Project 1: Wikipedia - Find and rank the most commonly mentioned programming languages in an Apache Spark RDD of WikipediaArticles using three ranking techniques: naive ranking, ranking using an inverted index, and ranking using reduceByKey.

Project 2: Timeusage - Using the American Time Use Survery dataset from https://www.kaggle.com/bls/american-time-use-survey, evaluate working and unemployed individual's distribution of personal need time, work time, and leisure time. Utilizes Apache Spark SQL Dataframes and Datasets.

About

Programs I wrote for CS 4240 in Scala and Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published