Skip to content

Laurans/data-engineering-sandbox

Repository files navigation

Data Engineering Sandbox

The Data Engineering Sandbox is an environment designed for experimentation with data engineering tools. It provides a pre-configured setup for running data pipelines, working with databases, and performing data transformations.

Features

  • Easy setup and initialization using Python 3.11 and Poetry.
  • Local environment variables management with direnv.
  • Convenient task execution with the just command.
  • Container management with nerdctl (a drop-in replacement for docker-compose).

Prerequisites

Make sure you have the following dependencies installed on your system:

Getting Started

Follow the steps below to set up the Data Engineering Sandbox:

  1. Clone this repository:
git clone <repository-url>
cd data-engineering-sandbox
  1. Install the project dependencies using Poetry:
poetry install
  1. Create a .envrc file in the project root directory and define your local environment variables. Example:
export POSTGRES_USER="myuser"
export POSTGRES_PASSWORD=mypassword
export POSTGRES_DB=mydatabase
  1. Enable direnv to load the environment variables automatically:
direnv allow
  1. Execute tasks using the just command. Some available tasks include:
  • Start containers: just start-postgres
  • Load data into databases: just load-sample-data postgres
  • Clean up containers: just clean
  • List all commands available: just

Datasets sources

Datasets comes from this repo https://github.com/neelabalan/mongodb-sample-dataset.

In your ./data folder, you should have

  • sample_airbnb
  • sample_analytics
  • sample_geospatial
  • sample_mflix
  • sample_supplies
  • sample_weatherdata

Note: json file are in a special format for mongodb import. So you need to transform it to load them using pandas.

Loading sample data in a database

Loading it in a postgres database

just start-postgres
just load-sample-data postgres --without airbnb
just clean

Loading it in a mongodb

just start-mongo
just load-sample-data mongo
just clean

About

Environment to test and experiment with data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published