Databricks DLT Apparel Pipeline: A Learning Project

This project is an independent educational resource and is not endorsed by Databricks, Inc. "Databricks" is a registered trademark of Databricks, Inc.

Follow me on Linkedin for useful Databricks projects and tips. Training materials are also available on my website: dataengineer.wiki

1. Context and Goal

Welcome to the Apparel Retail 360 project! In today's data-driven world, retail companies rely on timely and accurate data to understand customer behavior, manage inventory, and optimize sales strategies. The goal of this project is to build a robust, multi-layered data processing pipeline that simulates this real-world challenge.

You will take on the role of a Data Engineer tasked with building an end-to-end analytics platform. Using Delta Live Tables, you will ingest raw data, progressively clean and transform it through a medallion architecture (Bronze, Silver, and Gold layers), and ultimately produce curated datasets ready for business intelligence and reporting.

Feel free to reach out to me if you have any questions. My contact details are available on dataengineer.wiki.

2. By completing this project, you will gain hands-on experience with:

Ingesting and processing continuous data streams.
Applying and managing data quality expectations.
Implementing a medallion architecture in DLT.
Handling historical data changes using Slowly Changing Dimensions (SCD Type 2).
Creating business-ready, aggregated tables for analytics.

3. Architecture

Databricks Free Edition
Synthetic streaming data (generated by dlt/data_generator.py)
- It imitates real world data
- It generates 4 tables (a fact sales table, and 3 lookup tables - stores, customers, products)
DLT pipeline
- You'll create a DLT pipeline to ingest, clean and aggregate raw data.
- The pipeline is organized into layers:
  - Bronze layer (dlt/01_bronze.py) - Raw data ingestion
  - Silver layer (dlt/02A_silver.py, dlt/02B_silver.py, dlt/02C_silver.py, dlt/02D_silver.py) - Cleaned and transformed data
  - Gold layer (dlt/03_gold.py) - Business-ready aggregations
- At the end, your pipeline will look like this

4. Prerequisites

Intermediate Python (or SQL, if you prefer). The project focuses on using Python for the DLT pipeline.
Intermediate Databricks knowledge. While you may be able to complete this project with junior-level experience, it may be more difficult to follow the tasks independently. If you need help, feel free to copy code snippets from the solution file (final_code/final_dlt.py) to run and observe the DLT pipeline.

5. How to start?

Read ProjectPlan.md

6. Project Structure

The project is organized as follows:

`dlt/` folder - Main working directory

This folder contains everything you need to complete the lab:

Environment Setup:
- environment_setup.ipynb - Notebook to set up Unity Catalog (catalogs, schemas, volumes)
- environment_maintenance.ipynb - Maintenance utilities for the environment
- variables.py - Configuration file for catalog names, paths, and other settings
Data Generation:
- data_generator.py - Synthetic data generator that creates realistic streaming data
DLT Pipeline Files (Your Tasks):
- 01_bronze.py - Bronze layer: Raw data ingestion from source files
- 02A_silver.py - Silver layer: Sales data cleaning and transformation
- 02B_silver.py - Silver layer: Customer dimension with SCD Type 2
- 02C_silver.py - Silver layer: Product dimension with SCD Type 2
- 02D_silver.py - Silver layer: Store dimension with SCD Type 2
- 03_gold.py - Gold layer: Business-ready aggregations and analytics
Note: Each file contains tasks with requirements and a "Solution Is Below" section for reference.

`final_code/` folder - Reference materials

final_dlt.py - Complete solution for the entire pipeline when you need a full reference

Learning Tip: Each task file (dlt/0*.py) includes solution code in a "Solution Is Below" section. Try solving tasks yourself first, then check the solution if needed!

Documentation

README.md - This file, provides project overview
ProjectPlan.md - Step-by-step instructions and tasks
SynteticDataGenerator.md - Details about the data generator and schemas

Appendix: How This Project Prepares You for the Databricks Data Engineer Associate Certification

This hands-on project is designed to closely mirror the real-world skills and knowledge areas assessed in the Databricks Certified Data Engineer Associate exam. By completing this project, you will gain practical experience with the Databricks Data Intelligence Platform, Delta Live Tables (DLT), and the medallion architecture, all of which are core to the certification exam. Here’s how the project aligns with the exam outline:

Section 1: Databricks Intelligence Platform

You will work directly in the Databricks workspace, learning to manage data layout, optimize query performance, and select appropriate compute resources for streaming and batch workloads.
The project demonstrates the value of the Data Intelligence Platform by showing how it simplifies ETL, governance, and analytics.

Section 2: Development and Ingestion

You will use Notebooks and Python scripts to develop and orchestrate data pipelines, similar to real Databricks workflows.
The project’s raw data ingestion leverages Delta Lake and streaming, exposing you to Auto Loader-like patterns and troubleshooting data ingestion issues.

Section 3: Data Processing & Transformations

The pipeline implements the three layers of the Medallion Architecture (Bronze, Silver, Gold), giving you hands-on experience with their purposes and best practices.
You will use DLT to build ETL pipelines, apply data quality expectations, and perform complex aggregations with PySpark DataFrames.
The project covers DDL/DML operations and demonstrates how to manage schema evolution and data transformations.

Section 4: Productionizing Data Pipelines

You will learn about deploying and orchestrating pipelines, handling failures, and rerunning tasks, which are key for production workflows.
The project encourages you to analyze Spark UI and optimize queries for performance.
You will see the difference between serverless and cluster-based compute, and understand Databricks Asset Bundles (DAB) concepts through pipeline configuration.

Section 5: Data Governance & Quality

The project uses Unity Catalog concepts (catalogs, schemas, volumes) and demonstrates the difference between managed and external tables.
You will practice setting up permissions, understanding roles, and using data lineage features.
The pipeline’s data quality checks and expectations prepare you for questions on governance, audit logging, and Delta Sharing.

Recommended Preparation

This project complements Databricks Academy’s self-paced and instructor-led courses, providing the hands-on experience recommended for the exam.
By following the project plan and checklist, you will cover all major exam topics, from ingestion to governance.

Exam Details

45 multiple-choice questions, 90 minutes, online proctored.
No prerequisites, but hands-on experience (like this project) is highly recommended.
For the latest exam guide and recommended training, visit the official Databricks certification page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Databricks DLT Apparel Pipeline: A Learning Project

1. Context and Goal

2. By completing this project, you will gain hands-on experience with:

3. Architecture

4. Prerequisites

5. How to start?

6. Project Structure

`dlt/` folder - Main working directory

`final_code/` folder - Reference materials

Documentation

Appendix: How This Project Prepares You for the Databricks Data Engineer Associate Certification

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
dlt		dlt
final_code		final_code
sources		sources
.gitignore		.gitignore
ProjectPlan.md		ProjectPlan.md
README.md		README.md
SynteticDataGenerator.md		SynteticDataGenerator.md

Uh oh!

Uh oh!

jrlasak/databricks_apparel_streaming

Folders and files

Latest commit

History

Repository files navigation

Databricks DLT Apparel Pipeline: A Learning Project

1. Context and Goal

2. By completing this project, you will gain hands-on experience with:

3. Architecture

4. Prerequisites

5. How to start?

6. Project Structure

dlt/ folder - Main working directory

final_code/ folder - Reference materials

Documentation

Appendix: How This Project Prepares You for the Databricks Data Engineer Associate Certification

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`dlt/` folder - Main working directory

`final_code/` folder - Reference materials

Packages