Skip to content

Latest commit

 

History

History
93 lines (63 loc) · 2.32 KB

README.md

File metadata and controls

93 lines (63 loc) · 2.32 KB

Data Engineering Sandbox

The Data Engineering Sandbox is an environment designed for experimentation with data engineering tools. It provides a pre-configured setup for running data pipelines, working with databases, and performing data transformations.

Features

  • Easy setup and initialization using Python 3.11 and Poetry.
  • Local environment variables management with direnv.
  • Convenient task execution with the just command.
  • Container management with nerdctl (a drop-in replacement for docker-compose).

Prerequisites

Make sure you have the following dependencies installed on your system:

Getting Started

Follow the steps below to set up the Data Engineering Sandbox:

  1. Clone this repository:
git clone <repository-url>
cd data-engineering-sandbox
  1. Install the project dependencies using Poetry:
poetry install
  1. Create a .envrc file in the project root directory and define your local environment variables. Example:
export POSTGRES_USER="myuser"
export POSTGRES_PASSWORD=mypassword
export POSTGRES_DB=mydatabase
  1. Enable direnv to load the environment variables automatically:
direnv allow
  1. Execute tasks using the just command. Some available tasks include:
  • Start containers: just start-postgres
  • Load data into databases: just load-sample-data postgres
  • Clean up containers: just clean
  • List all commands available: just

Datasets sources

Datasets comes from this repo https://github.com/neelabalan/mongodb-sample-dataset.

In your ./data folder, you should have

  • sample_airbnb
  • sample_analytics
  • sample_geospatial
  • sample_mflix
  • sample_supplies
  • sample_weatherdata

Note: json file are in a special format for mongodb import. So you need to transform it to load them using pandas.

Loading sample data in a database

Loading it in a postgres database

just start-postgres
just load-sample-data postgres --without airbnb
just clean

Loading it in a mongodb

just start-mongo
just load-sample-data mongo
just clean