This project is dedicated to the comprehensive analysis and visualization of FANNG (Facebook, Amazon, Apple, Netflix, Google) stock data, leveraging a robust data pipeline to process extensive historical stock data sourced from Kaggle. The project encompasses the ingestion of third-party data, applying initial processing using Apache Spark to load data into a data lake, followed by detailed transformation and calculation steps orchestrated via Apache Airflow and dbt. These steps ensure data sanity and accuracy in metric calculations, such as MACD and EMA20. The ultimate goal is to provide a dynamic dashboard that presents these key financial metrics, offering actionable insights into stock performance trends and aiding in informed investment decisions.
- Cloud: Google Cloud Platform (GCP)
- Data Ingestion: Apache Spark
- Data Lake Storage: Google Cloud Storage (GCS)
- Data Warehousing: BigQuery
- ETL/ELT Process: dbt (data build tool)
- Workflow Orchestration: Apache Airflow
- Analytics and Visualization: Looker
- Programming Languages: SQL, Python
- Version Control: Git
This diagram illustrates the flow of data from source to visualization, showcasing how each technology is utilized within the pipeline.
Before you begin setting up this project, ensure you have the following:
- A Google Cloud account with billing enabled.
- Access to Google Cloud services like BigQuery and Google Cloud Storage.
- Apache Spark and Apache Airflow installed either locally or in a cloud environment.
- Looker or another compatible visualization tool set up to connect to your BigQuery datasets.
Follow Setup.md
Here is the link.
Tests are added to dbt models. To further improvements, Airflow tests should be added. Also CI/CD process should be added.