Skip to content

jpcadena/car-sales-etl

Repository files navigation

car-sales-etl


Logo

car-sales-etl

Description for car-sales-etl project
Explore the docs »

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License
  6. Contact

About the project

Project

This project is about building a data pipeline to extract, transform, and load (ETL) data from a source to a target. The data source is a CSV file containing information about car sales. The target is a PostgreSQL database table.

PostgreSQL was preferred for the richer data handling with multiple data types, transaction management and its highly scalability to provide great performance at CRUD operations.
The project followed SQLAlchemy models scheme based on OOP concepts that provide an excellent abstraction when working with multiple datasets for a future process. This high level abstraction provides a greater control over the data being inserted as the table structure can be defined with multiple constraints and relationships.
For more advanced requirements, transactions, migrations and more complex operations can be performed through the ORM so managing large amounts of data won't be an issue.
The project also works with PEP8 style that is tested with Pylint and this includes type hinting for variables, functions arguments and more.

If performance is critical, consider using Python 3.11 in terms of handling exceptions that can be thrown and re-raised in shorter execution times.
Assets are also included with future consideration for HTML and CSS files.
Testing could be done using unittests (to be implemented in a future release).

Transformations

  • Remove any rows with missing values.
  • Convert the date columns to a standard format.
  • Create a new column to store the year of the sale.
  • Replace the categorical values in the "Car Model" column with numerical values.

Requirements

  • The target database should be either PostgreSQL or MySQL.
  • The pipeline should be runnable using a command-line interface.
  • The pipeline should have error handling and logging capabilities.
  • The pipeline should be modular and easily extendable to handle additional data sources and transformations.

(back to top)

Built with

  • Python

(back to top)

Getting started

Prerequisites

Installation

  1. Clone the repository
    git clone https://github.com/jpcadena/car-sales-etl.git
    
  2. Change the directory to root project
    cd car-sales-etl
    
  3. Create a virtual environment venv
    python3 -m venv venv
    
  4. Activate environment in Windows
    .\venv\Scripts\activate
    
  5. Or with Unix/Mac OS X
    source venv/bin/activate
    
  6. Install requirements with PIP
    pip install -r requirements.txt
    

(back to top)

Usage

  1. Rename file sample.env to .env.
  2. Replace your credentials into the .env file.
  3. Execute with console.
    python main.py
    

(back to top)

Contributing

If you have a suggestion that would make this better, please fork the repo and create a pull request.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Use docstrings with reStructuredText format by adding triple double quotes """ after function definition.
Add a brief function description, also for the parameters including the return value and its corresponding data type.
Please use linting to check your code quality following PEP 8.
Check documentation for Visual Studio Code or Jetbrains Pycharm.\

Recommended plugin for autocompletion: Tabnine

(back to top)

License

Distributed under the MIT License.

(back to top)

Contact

LinkedIn: Juan Pablo Cadena Aguilar

E-mail: Juan Pablo Cadena Aguilar

(back to top)