layout | title | subtitle | categories | tags | |
---|---|---|---|---|---|
post |
Data Warehouse |
Replicate, transform, and analyze data about book sales and orders |
Data-Circle |
|
- What is the project content about? You will replicate, transform, and analyze data about book sales and orders.
- What is the problem we are solving with the project? Data warehouses are the backbone of company analytics. This project will familiarize you with the design and architecture of data warehouses. If you are interested in a data analyst, analytics engineer, or data engineer role, this project will give you an introduction to how these roles interact with data in production.
- How is the student going to work on the project? You will collaborate with others delivering different parts of the same project.
- What is the major challenge of the project? You will implement a data pipeline and a star schema data model, then use it for dashboards.
List the components / frameworks which the student is going to use / learn. Example below:
- concepts: data warehouses, Extract-Transform-Load, data modeling
- BigQuery - massively parallel processing (MPP) data warehouse
- airbyte - ETL tool
- dbt - transformation tool
- Metabase - BI tool
Add a few extra challenges for the more ambitious students.
- Create additional fact table:
orders
that summarizes all order line items and adds shipping costs. Users can analyze sales and shipping. - Create additional dimensional tables:
publishers
andlanguages
. Users can analyze sales per publisher and languages. - Create additional fact tables:
order_fulfillment
. Users can analyze orders by status.
How can the student get started with the project? Example:
- Fork boilerplate repo
- Set up the project with version control (e.g. Git)
- Read the README.md file and have a look around the project
- Get colors, fonts etc from the style-guide.md file
- Set up your project/file architecture however you want
- Start coding!
- (free book) The Analytics Setup Guidebook from Holistics.io: https://www.holistics.io/books/setup-analytics/start-here-introduction/
- (free course) dbt Fundamentals Course from dbt Labs: https://courses.getdbt.com/courses/fundamentals
- (blog series) Three-Part Data Modeling Series from airbyte: https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-introduction https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-approaches-and-techniques https://airbyte.com/blog/data-modeling-unsung-hero-data-engineering-architecture-pattern-tools
-
Airbyte: An open-source data integration platform that extracts and loads data from various sources.
-
Airflow: A platform to programmatically author, schedule, and monitor workflows.
-
BI (Business Intelligence): A technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions.
-
BigQuery: A web service from Google that is used for handling and analyzing big data.
-
Data Modelling: The process of creating a data model for an information system by applying certain formal techniques.
-
Data Stack: A collection of software tools, platforms, and languages to systematically manage and perform tasks with large sets of data.
Learning Resources
-
Data Warehouse: A technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence.
-
Database Normalization: The process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity.
Learning Resources
-
DBT (Data Build Tool): A tool that allows data analysts and engineers to transform data in their warehouses by simply writing select statements.
Learning Resources
-
Docker: An open-source platform that automates the deployment, scaling, and management of applications.
Learning Resources
-
Docker-compose: A tool for defining and managing multi-container Docker applications.
Learning Resources
-
ETL (Extract, Transform, Load): The process of extracting data from outside sources, transforming it to fit operational needs, then loading it into the end target.
Learning Resources
-
More details at Wikipedia
-
GCP (Google Cloud Platform): A suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products.
Learning Resources
-
Metabase: An open-source business intelligence tool that lets you ask questions about your data, and displays answers in formats that make sense, either as a chart, dashboard, or detailed data table.
Learning Resources
-
pgAdmin: An open-source, full-featured and web-based PostgreSQL management tool.
Learning Resources
-
Postgres: An open-source, object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads.
Learning Resources
-
Star Schema: A relational database schema for representing multidimensional data.
Learning Resources
-
Superset: An open-source data exploration and visualization platform designed to be intuitive and interactive.
Learning Resources