- ApacheSpark: Batch-process and merge data from GCS buckets to BigQuery using DataProc.
- BigQuery: Host the materialized tables with aggregated data.
- Dataproc: GCP's Spark provider for batch processing from GCS bucket to BigQuery tables.
- GCSBuckets: For storing datasets and Spark scripts.
- Mage: For orchestrating ETL from API calls to GCS buckets.
- PySpark: Library for implementing Spark scripts.
- Terraform: For creating GCS buckets, BigQuery dataset and Dataproc clusters.
ApacheSpark: BigQuery Dataproc GCSBuckets Mage PySpark Terraform
An educational project to build an end-to-end pipline for near real-time and batch processing of data and visualisation.
The project is designed to enable the preparation of an analytical summary of the variability of METAR weather reports over the years for airports of European countries.
The dataset is about Meteorological Aerodrome Reports (METAR) which are observations of current surface weather reported in a standard international format.
Here is the Looker Studio demo of the analysis using few stations: CA_BC_ASOS
Setup overview:
We will make API calls and save data to GCS buckets using parquet format with the following file path: <network_name>/<station_name>/<station_name>.parquet
.
Here is a sample API call:
https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?station=EPKK&data=all&year1=2023&month1=1&day1=1&year2=2023&month2=3&day2=26&tz=Etc%2FUTC&format=onlycomma&latlon=no&elev=no&missing=null&trace=T&direct=no&report_type=3&report_type=4
More details are in doc/Dataset.md.
The project is developed in the cloud using scalable infrastructure provided by Google Cloud Platform. Infrastructure as Code (IaC) tools such as Terraform are utilized to provision and manage the cloud resources efficiently.
Data ingestion involves batch processing, where data is collected, processed, and uploaded to the data lake periodically and subsequently to the data warehouse. This ensures that the latest information on customers' meal choices, order values, and sales conversions is readily available for analysis.
An end-to-end pipeline is orchestrated using Mage to automate data workflows. This pipeline efficiently manages multiple steps in a Directed Acyclic Graph (DAG), ensuring seamless execution and reliability of the data processing tasks.
In this project, Google Cloud Storage is used as the data lake where the data is initially stored after ingestion from the source. Google BigQuery is used as the data warehouse and for storing and optimizing structured data for analysis.
Data transformations are performed using Apache Spark via DataProc. The transformation logic is defined and executed via script using PySpark and executed in DataProc cluster.
Finally a dashboard is then created using Looker Studio to visualize key insights derived from the processed data. The dashboard comprises of tiles that provide some insights into the customer actions, habits, and engagement with the hotel.
Clone this repo: git clone [email protected]:prantoran/METER-Weather-reports.git
.
- Set up a google cloud platform account.
- Create a GCP project and set up service account and authentication as per these instructions.
- Setup Terraform in local environment. Check out terraform installation instructions here.
- See terraform/README.md for instructions where to copy the service account json credentials.
- Rename the json to
service-acc-cred.json
. - Create a
keys
folder in root directory and copy the json to the folder.
- Rename the json to
Checkout terraform/README.md.
Checkout mage/README.md.
The contents of the Mage folder were copied from github.com/mage-ai/maze-zoomcamp. In addition, Dynamic blocks are used to trigger multiple instances of children blocks based on the number of stations returned by the API calls. After running docker-compose up
, the Mage UI will be accessible in localhost:6789
.
Note: If there are existing docker instances (i.e. Postgres):
- Remove the existing docker instances:
docker stop $(docker ps -aq)
docker rm $(docker ps -aq)
- Stop Postgresql service (e.g. in Ubuntu):
sudo systemctl stop postgresql
Note: According to the config in terraform/dev.auto.tfvars, adjust the bucket paths in scripts/upload_pyspark_script_sql_to_gcs.sh and script/submit_dataproc_job.sh.
./scripts/upload_pyspark_script_sql_to_gcs.sh
./script/submit_dataproc_job.sh
- The desired station (i.e.
CA_BC_ASOS
) can be changed in script/submit_dataproc_job.sh.
Here is a sample Looker dashboard: CA_BC_ASOS.
- The plots are created using the generated BigQuery tables as data source by running PySpark script in a job in DataProc cluster.