Apache Spark - From installation to performing awesome operations in Apache Spark Stack
-
Updated
May 8, 2017 - Python
Apache Spark - From installation to performing awesome operations in Apache Spark Stack
👨🎓 Learn to design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets.
Group 10 Project, Fall 2020, CS 6240: Large-Scale Parallel Data Processing, Khoury College of Computer Sciences, Northeastern University
Assignment 2 of the course 'Distributed Systems Programming' by Meni Adler. In the assignment we build an application that calculates the probabilities for any word to come after a couple of words, for ANY couple of words in the n-gram corpus (google).
How to: find acute exacerbation of COPD events in UK primary care electronic healthcare records
Start and monitor jobs on EMR cluster
An ETL logic is written in Spark for transforming the given data set present in S3, and query on the transformed data is run using AWS Redshift. The data sets are in json format. All the raw data in json format has to be first uploaded to an S3 source bucket. Using EMR, a Spark job is executed, which would fetch the source data from S3 source bu…
Streaming pipeline using AWS MSK and AWS EMR with Spark, retrieving the data from Twitter Streams API
Efficient Secure-Channel Free Public Key Encryption with Keyword Search for EMRs in Cloud Storage implementation in C using PBC library
Parsing the common crawl database using Scala and Spark
Add a description, image, and links to the emr topic page so that developers can more easily learn about it.
To associate your repository with the emr topic, visit your repo's landing page and select "manage topics."