ETL Pipeline with Apache Spark using CSV, Parquet and JSON

Project Overview

Extract, transform, and load (ETL) processes combine data from multiple sources into a large, central repository called a data warehouse.This project uses a toy pipeline to model an ETL in the wild working on different file format using PySpark(the Python API for Apache Spark).

This project creates an ETL (extract, transform, load) pipeline that:

Imports CSV, JSON, and Parquet data formats into PySpark
Cleans and transform the data as required
Joins the required columns from each into a final table
Saves the joined dataset in different formats (CSV, JSON, Parquet)

For a detailed explanation on the ETL processes used, check out the accompanying article on medium

Code

You can find the code for this project here.

File overview:

File_Format_Spark_ETL.ipynb - the full code from this project

Environment Setup

Installation

To follow this project, please install the following locally:

Python 3.8+
Spark-3.2.1
os
sys
Python packages
- pyspark
- pyspark.sql.functions

Data

You can download the exact file used in this project here:

Consumer_Complaints - CSV file used
us-states-population - JSON file used
US States Long_Lat - Parquet file used

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
FormatETL dataset		FormatETL dataset
File_Format_Spark_ETL.ipynb		File_Format_Spark_ETL.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Pipeline with Apache Spark using CSV, Parquet and JSON

Project Overview

Code

Environment Setup

Installation

Data

About

Uh oh!

Releases

Packages

Languages

ayoakin/File_Format_Spark_ETL

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline with Apache Spark using CSV, Parquet and JSON

Project Overview

Code

Environment Setup

Installation

Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages