This project implements a complete ETL (Extract-Transform-Load) pipeline in Bash and AWK, designed to clean and prepare delivery dataset files, convert them to JSON, and optionally visualize them in Amazon QuickSight (QS) using a manifest.
This project processes raw delivery data (dataset.csv) through:
- Extraction from remote source (Kaggle)
- Transformation into a clean, validated dataset
- Loading to an AWS S3 bucket
- Visualization via Amazon QuickSight with a prepared
manifest.json
FD_Bash_ETL/
├── assets/ # Supporting files
│ ├── delivery_dashboard.pdf # Exported QuickSight dashboard
│ ├── manifest.json # Used to load JSON into QuickSight
│ ├── sample_output_console.png
│ └── sample_result_aws_s3.png
├── data/
│ ├── processed/ # Cleaned output
│ │ ├── dataset.json
│ │ └── dataset.csv
│ └── raw/ # Original raw input
│ └── dataset.csv
├── scripts/
│ ├── 1_extract.sh # Downloads & unzips dataset
│ ├── 2_transform.sh # Cleans & validates data using AWK
│ ├── 3_load.sh # Uploads cleaned files to S3
│ └── main.sh # Runs full ETL pipeline
Run all stages in order with:
cd scripts && ./main.sh- 🪣 S3 Upload
Ensure you've configured AWS CLI:
aws configureYour cleaned dataset will be uploaded to:
- s3://
<your-bucket-name>/<your-bucket-path>/dataset.csv - s3://
<your-bucket-name>/<your-bucket-path>/dataset.json
where <your-bucket-name> & <your-bucket-path> are clearly indicated as constants defined inside 3_load.sh along with other ones
- 📊 Amazon QuickSight
To visualize the data in QuickSight:
- Use
assets/manifest.jsonfor importing the JSON dataset. - In QuickSight, go to
New Dataset → S3 → Upload a manifest file. - Ensure your S3 bucket permissions allow QuickSight to access the data.
Browse visual assets in the assets/folder:
- 📊
delivery_dashboard.pdf: Full dashboard export - ✅
sample_result_aws_s3.png: AWS upload confirmation - 🧪
sample_output_console.png: CLI output of transformation stage
After running the full ETL pipeline and visualizing the data in Amazon QuickSight, several meaningful patterns started to emerge from the delivery dataset:
- Semi-Urban areas had the slowest delivery times, likely due to longer routes or less optimized infrastructure compared to urban centers.
- The dataset covers over 38,000 successful deliveries — all cleaned and validated through our Bash + AWK pipeline.
- 9 PM turned out to be the busiest hour for deliveries, reflecting a typical late dinner or snack-time spike.
- Deliveries were noticeably slower on foggy days, which isn’t too surprising — visibility and road conditions play a big role.
- The worst-case scenario? Fog combined with traffic jams. It had a dramatic impact on delivery times.
- During festival periods, delivery times increased.
- Also, the more deliveries a person handled at once, the longer each one took — as expected.
- There were consistent order spikes every week, and an unusual jump in early March 2022 — possibly tied to a local event or holiday.
- Electric scooters seemed to perform slightly better than other vehicles — maybe because they’re easier to maneuver in traffic-heavy areas.
- The best-rated delivery workers tended to fall within the 25–35 age range, possibly reflecting a balance of experience and energy.
- By mapping the locations, we spotted dense clusters of restaurant origins and delivery zones — showing us the busiest areas and outer edges where delays were more common.
This project is licensed under the MIT License.
