Skip to content

Latest commit

 

History

History
61 lines (47 loc) · 2.5 KB

README.md

File metadata and controls

61 lines (47 loc) · 2.5 KB

Big Data Analytics

CIS-5450 - University of Pennsylvania (Fall 2022)

Overview

This repository showcases assignments from the CIS-5450 Big Data Analytics course. Each assignment highlights key skills and concepts in data science, machine learning, and big data technologies.

Table of Contents

  1. Homework 1: Data Wrangling with Pandas
  2. Homework 2: SQL with Spotify Data
  3. Homework 3: Spark SQL and Amazon Reviews
  4. Homework 4: Machine Learning with Apache Spark ML
  5. Homework 5: Deep Learning with PyTorch

Homework 1: Data Wrangling with Pandas

  • Skills Demonstrated: Data cleaning, aggregation, and visualization using Pandas.
  • Project Summary:
    Analyzed the performance of various airline companies by wrangling and cleaning raw data.
  • File: homework1.ipynb

Homework 2: SQL with Spotify Data

  • Skills Demonstrated: SQL querying with pandasql, text analysis.
  • Project Summary:
    Explored a Spotify dataset containing song reviews and statistics to uncover trends and insights.
  • File: homework2.ipynb

Homework 3: Spark SQL and Amazon Reviews

  • Skills Demonstrated: Big data processing with Spark SQL, cluster computing with AWS EMR.
  • Project Summary:
    Manipulated datasets about Amazon products and their reviews using Spark SQL on an EMR cluster.
  • File: homework3.ipynb

Homework 4: Machine Learning with Apache Spark ML

  • Skills Demonstrated: Predictive modeling with Apache Spark ML and Scikit-learn.
  • Project Summary:
    Built predictive models to estimate ratings of new Airbnb properties.
  • File: homework4.ipynb

Homework 5: Deep Learning with PyTorch

  • Skills Demonstrated: Neural network modeling, image classification with PyTorch.
  • Project Summary:
    Designed a deep learning model to classify images from the CIFAR-10 dataset.
  • File: homework5.ipynb

Skills Gained

  • Data wrangling and visualization with Pandas.
  • SQL querying with pandasql.
  • Big data processing with Apache Spark SQL and EMR.
  • Predictive modeling using Spark ML and Scikit-learn.
  • Deep learning with PyTorch for image classification.