This project is an independent educational resource and is not endorsed by Databricks, Inc. "Databricks" is a registered trademark of Databricks, Inc.
Follow me on Linkedin for useful Databricks projects and tips. Training materials are also available on my website: dataengineer.wiki
Welcome to the Apparel Retail 360 project! In today's data-driven world, retail companies rely on timely and accurate data to understand customer behavior, manage inventory, and optimize sales strategies. The goal of this project is to build a robust, multi-layered data processing pipeline that simulates this real-world challenge.
You will take on the role of a Data Engineer tasked with building an end-to-end analytics platform. Using Delta Live Tables, you will ingest raw data, progressively clean and transform it through a medallion architecture (Bronze, Silver, and Gold layers), and ultimately produce curated datasets ready for business intelligence and reporting.
Feel free to reach out to me if you have any questions. My contact details are available on dataengineer.wiki.
- Ingesting and processing continuous data streams.
- Applying and managing data quality expectations.
- Implementing a medallion architecture in DLT.
- Handling historical data changes using Slowly Changing Dimensions (SCD Type 2).
- Creating business-ready, aggregated tables for analytics.
- Databricks Free Edition
- Synthetic streaming data (generated by
dlt/data_generator.py)- It imitates real world data
- It generates 4 tables (a fact sales table, and 3 lookup tables - stores, customers, products)
- DLT pipeline
- You'll create a DLT pipeline to ingest, clean and aggregate raw data.
- The pipeline is organized into layers:
- Bronze layer (
dlt/01_bronze.py) - Raw data ingestion - Silver layer (
dlt/02A_silver.py,dlt/02B_silver.py,dlt/02C_silver.py,dlt/02D_silver.py) - Cleaned and transformed data - Gold layer (
dlt/03_gold.py) - Business-ready aggregations
- Bronze layer (
- At the end, your pipeline will look like this

- Intermediate Python (or SQL, if you prefer). The project focuses on using Python for the DLT pipeline.
- Intermediate Databricks knowledge. While you may be able to complete this project with junior-level experience, it may be more difficult to follow the tasks independently. If you need help, feel free to copy code snippets from the solution file (
final_code/final_dlt.py) to run and observe the DLT pipeline.
Read ProjectPlan.md
The project is organized as follows:
This folder contains everything you need to complete the lab:
-
Environment Setup:
environment_setup.ipynb- Notebook to set up Unity Catalog (catalogs, schemas, volumes)environment_maintenance.ipynb- Maintenance utilities for the environmentvariables.py- Configuration file for catalog names, paths, and other settings
-
Data Generation:
data_generator.py- Synthetic data generator that creates realistic streaming data
-
DLT Pipeline Files (Your Tasks):
01_bronze.py- Bronze layer: Raw data ingestion from source files02A_silver.py- Silver layer: Sales data cleaning and transformation02B_silver.py- Silver layer: Customer dimension with SCD Type 202C_silver.py- Silver layer: Product dimension with SCD Type 202D_silver.py- Silver layer: Store dimension with SCD Type 203_gold.py- Gold layer: Business-ready aggregations and analytics
Note: Each file contains tasks with requirements and a "Solution Is Below" section for reference.
final_dlt.py- Complete solution for the entire pipeline when you need a full reference
Learning Tip: Each task file (dlt/0*.py) includes solution code in a "Solution Is Below" section. Try solving tasks yourself first, then check the solution if needed!
README.md- This file, provides project overviewProjectPlan.md- Step-by-step instructions and tasksSynteticDataGenerator.md- Details about the data generator and schemas
This hands-on project is designed to closely mirror the real-world skills and knowledge areas assessed in the Databricks Certified Data Engineer Associate exam. By completing this project, you will gain practical experience with the Databricks Data Intelligence Platform, Delta Live Tables (DLT), and the medallion architecture, all of which are core to the certification exam. Here’s how the project aligns with the exam outline:
Section 1: Databricks Intelligence Platform
- You will work directly in the Databricks workspace, learning to manage data layout, optimize query performance, and select appropriate compute resources for streaming and batch workloads.
- The project demonstrates the value of the Data Intelligence Platform by showing how it simplifies ETL, governance, and analytics.
Section 2: Development and Ingestion
- You will use Notebooks and Python scripts to develop and orchestrate data pipelines, similar to real Databricks workflows.
- The project’s raw data ingestion leverages Delta Lake and streaming, exposing you to Auto Loader-like patterns and troubleshooting data ingestion issues.
Section 3: Data Processing & Transformations
- The pipeline implements the three layers of the Medallion Architecture (Bronze, Silver, Gold), giving you hands-on experience with their purposes and best practices.
- You will use DLT to build ETL pipelines, apply data quality expectations, and perform complex aggregations with PySpark DataFrames.
- The project covers DDL/DML operations and demonstrates how to manage schema evolution and data transformations.
Section 4: Productionizing Data Pipelines
- You will learn about deploying and orchestrating pipelines, handling failures, and rerunning tasks, which are key for production workflows.
- The project encourages you to analyze Spark UI and optimize queries for performance.
- You will see the difference between serverless and cluster-based compute, and understand Databricks Asset Bundles (DAB) concepts through pipeline configuration.
Section 5: Data Governance & Quality
- The project uses Unity Catalog concepts (catalogs, schemas, volumes) and demonstrates the difference between managed and external tables.
- You will practice setting up permissions, understanding roles, and using data lineage features.
- The pipeline’s data quality checks and expectations prepare you for questions on governance, audit logging, and Delta Sharing.
Recommended Preparation
- This project complements Databricks Academy’s self-paced and instructor-led courses, providing the hands-on experience recommended for the exam.
- By following the project plan and checklist, you will cover all major exam topics, from ingestion to governance.
Exam Details
- 45 multiple-choice questions, 90 minutes, online proctored.
- No prerequisites, but hands-on experience (like this project) is highly recommended.
- For the latest exam guide and recommended training, visit the official Databricks certification page.