This project is part of a data engineering solution for Tasty Truck Treats (T3), a catering company specializing in food trucks. The objective is to build a data pipeline that collects, processes, and analyses transaction-level data from each truck, enabling T3 to make data-driven decisions to optimise their business operations.
The extract script connects to T3’s S3 bucket to retrieve all relevant data files. The bucket contains:
- Historical transaction data for each truck in
.parquet
format within thehistorical/
folder. - Metadata about each truck in an
.xlsx
file within themetadata/
folder.
Steps in the Extract Script:
- Initializes an S3 client using credentials from an
.env
file. - Downloads transaction data (
.parquet
files) from thehistorical/
folder. - Downloads metadata (
.xlsx
file) from themetadata/
folder. - Filters files based on file extension to ensure only
.parquet
or.xlsx
files are downloaded.
Script Location: extract_data.py
The transform script loads the downloaded transaction data files and prepares them for analysis. Specifically, it:
- Combines all transaction data files into a single CSV file to facilitate cross-truck analysis.
- Extracts the truck ID from each file’s name and adds it as a new column in the data.
- Cleans the data by:
- Removing rows where
total
is zero, blank, NULL, or marked as "VOID". - Converting columns to appropriate data types (e.g.,
total
to numeric,timestamp
to datetime). - Dropping any unnecessary columns.
- Removing rows where
The resulting data is saved to data/cleaned_combined_transactions.csv
, a consolidated and cleaned dataset ready for analysis.
Script Location: transform_data.py
The load script uploads the cleaned and transformed data into a Redshift database for further analysis and reporting. It:
- Connects to the Redshift database using credentials in the
.env
file. - Inserts data from the combined CSV file into the
FACT_Transaction
table in Redshift.
Script Location: load.py
The etl_pipeline.py
script orchestrates the full ETL process by running the extract, transform, and load steps in sequence. This script:
- Uses the functions in
extract.py
,transform.py
, andload.py
. - Ensures that data is extracted from S3, transformed for consistency, and loaded into the Redshift database in one seamless process.
Script Location: etl_pipeline.py
The data exploration notebook (data_analysis.ipynb
) loads the cleaned transaction data and performs initial analysis to gain insights into T3’s operations. The notebook answers the following key questions:
- Which truck has the highest number of transactions?
- Which truck has the lowest total transaction value?
- What is the average transaction value?
- What is the average transaction value for each truck?
- What proportion of transactions use cash versus card?
The notebook also includes visualisations created using the Vega-Altair library:
- Bar Chart: Displays the average transaction total for each truck.
- Pie Chart: Shows the proportion of transactions made using cash or card.
- Line Chart: Illustrates daily total transaction values to identify trends.
The T3 Food Trucks Dashboard provides an interactive way to explore transaction data for T3’s food trucks. Built with Streamlit, it includes:
- Visualizations of total revenue, average transaction value, transaction volume by truck, peak transaction times, and payment method distribution.
- Sidebar filters for date range, truck selection, and payment type.
To make deployment easier, the ETL pipeline is containerised with Docker. Docker allows for consistent environments and easy deployment on ECS or other cloud platforms.
Dockerfile for ETL Pipeline (Dockerfile
):
To build and run the Docker container locally:
docker build -t t3-etl-pipeline .
docker run --env-file .env t3-etl-pipeline
Based on the analysis, the following recommendations have been made to help T3 achieve its business objectives:
- Introduce Dynamic Pricing Based on Demand: Analyse peak times, locations, and customer demographics to implement variable pricing strategies, potentially raising prices slightly during high-demand periods or locations.
- Focus on High-Performing Trucks for Marketing: Use insights from high-performing trucks to boost profitability and apply successful strategies to other trucks.
- Experiment with Daily Promotions Based on Revenue Trends: Implement promotions on days with lower sales to drive traffic and stabilise weekly revenue.
-
Clone the Repository:
- Clone this repository to your local machine:
git clone https://github.com/ebradley12/Coursework-Data-Engineering-Week-3.git cd dashboard
-
Set Up Environment Variables:
- Create a .env file in the root directory with the following variables to securely connect to your AWS and Redshift instances:
ACCESS_KEY_ID=<your-access-key-id> SECRET_ACCESS_KEY=<your-secret-access-key> BUCKET=<your-s3-bucket-name> HOST=<your-redshift-host> PORT=<your-redshift-port> USERNAME=<your-redshift-username> PASSWORD=<your-redshift-password> DATABASE_NAME=<your-database-name> SCHEMA=<your-database-schema>
-
Install Dependencies:
- Install all required dependencies from requirements.txt:
pip install -r requirements.txt
-
Run the Extract Script:
extract_data.py
- Ensure your AWS credentials are in an
.env
file in the root directory. - Execute the script to download data files from the S3 bucket.
- Ensure your AWS credentials are in an
-
Run the Transform Script:
transform_data.py
- This script combines, cleans, and saves transaction data to a single CSV file, ready for analysis.
-
Run the Load Script:
load.py
- Use this script to upload the cleaned and combined transaction data to your Redshift database.
-
Run the ETL Pipeline Script:
etl_pipeline.py
- Alternatively, you can execute the full ETL pipeline, which runs the extract, transform, and load steps sequentially.
-
Explore Data in the Notebook:
data_analysis.ipynb
- Open the notebook in Jupyter and follow the steps to analyse and visualise the data.
- Review the recommendations section to gain insights for T3’s business strategy.
- Python 3+
- AWS S3 Access: Ensure your
.env
file containsACCESS_KEY_ID
andSECRET_ACCESS_KEY
. - Libraries:
pandas
boto3
dotenv
redshift_connector
streamlit
altair
Install required packages with:
pip3 install pandas boto3 python-dotenv altair