Extract, transform, and load (ETL) processes combine data from multiple sources into a large, central repository called a data warehouse.This project uses a toy pipeline to model an ETL in the wild working on different file format using PySpark(the Python API for Apache Spark).
This project creates an ETL (extract, transform, load) pipeline that:
- Imports CSV, JSON, and Parquet data formats into
PySpark - Cleans and transform the data as required
- Joins the required columns from each into a final table
- Saves the joined dataset in different formats (CSV, JSON, Parquet)
For a detailed explanation on the ETL processes used, check out the accompanying article on medium
You can find the code for this project here.
File overview:
File_Format_Spark_ETL.ipynb- the full code from this project
To follow this project, please install the following locally:
- Python 3.8+
- Spark-3.2.1
- os
- sys
- Python packages
- pyspark
- pyspark.sql.functions
You can download the exact file used in this project here:
- Consumer_Complaints - CSV file used
- us-states-population - JSON file used
- US States Long_Lat - Parquet file used