Programs I wrote for CS 4240 (Large Scale Parallel Data Processing) in Scala and Apache Spark
Project 1: Wikipedia - Find and rank the most commonly mentioned programming languages in an Apache Spark RDD of WikipediaArticles using three ranking techniques: naive ranking, ranking using an inverted index, and ranking using reduceByKey.
Project 2: Timeusage - Using the American Time Use Survery dataset from https://www.kaggle.com/bls/american-time-use-survey, evaluate working and unemployed individual's distribution of personal need time, work time, and leisure time. Utilizes Apache Spark SQL Dataframes and Datasets.