Skip to content

lautibursese/Data_Engineering_-1

Repository files navigation

Data Engineering

Welcome to this project! On this occasion, a job will be done in the role of a Data Engineer.


Introduction

The idea of ​​this project is to internalize the knowledge required for the development and execution of an API.

Application Programming Interface is an interface that allows two applications to communicate with each other, independent of the underlying infrastructure. They are very versatile and fundamental tools for the creation of, for example, pipelines, since they allow you to move and provide simple access to the data that you want to make available through the different endpoints, or API exit points.

Today we have FastAPI, a modern and high-performance web framework for building APIs with Python.

Job offer

The project consists of ingesting data from various sources, then applying the transformations that are considered relevant, and then making the clean data available for consultation through an API. This API will be built in a dockerized virtual environment.

The data will be provided in files of different extensions, such as csv or json. There will be a correction of data types, null and duplicate values, among other tasks. Later, they will have to relate the datasets so they can access their information through API queries.

The queries to be made are:

  • Maximum duration according to type of film (film/series), by platform and by year: The request should be: get_max_duration(year, platform, [min or season])

  • Number of movies and series (separated) by platform The request should be: get_count_platform(platform)

  • Number of times a genre and platform is repeated with greater frequency. The request should be: get_listedin('gender')

  • Actor who repeats himself the most according to platform and year. The request should be: get_actor(platform, year)

Project steps

  1. Data ingestion and normalization

  2. Relate the data set and create the table needed to perform queries. Here it is recommended to verify what data you will need based on the queries to be made and concatenate the 4 tables

  3. Create the API in a Docker environment

  4. Make requested inquiries

Concepts of interest

  • Docker is a complete solution for the production, distribution and use of containers.
     - Container is a software layer abstraction that allows packaging code, with libraries and dependencies in a partially isolated environment.
     - Image is a Docker executable that has everything needed to run applications, including a configuration file, environment and runtime variables, and libraries.
     - Dockerfile text file with instructions for building an image. Image creation automation can be considered.

Resources and links

Docker image with Uvicorn/Guinicorn for high performance web applications:

FAST API Documentation:

Releases

No releases published

Packages

No packages published