Skip to content

Repository pour le projet de Data Science de la LP SID

License

Notifications You must be signed in to change notification settings

cmnemoi/NYCTaxiFareLPSID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYCTaxiFareLPSID

Repository for the first Data Science project of Lille's Bachelor of Economics, which consists of participating in the Kaggle competition New York City Taxi Fare Prediction by Enzo Risbetz and Charles-Meldhine Madi Mnemoi.

Open in Streamlit

Installation

You need Python 3.9+ to run this project. If you have Anaconda or Miniconda, create a virtual environment with the following commands and then install the required packages:

conda create -n NYCTaxiFareLPSID python=3.9 -y
conda activate NYCTaxiFareLPSID
pip install -r requirements.txt

To run the app locally : streamlit run src/app/main.py

You need a Google Maps API key to run the app locally. You can get one here. Then create a .streamlit/secrets.toml file with the following content: GOOGLE_MAPS_API="xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Tools used: Python (Pandas, Seaborn, Scikit-learn, Intel Extension for Scikit-learn, tune-sklearn, Streamlit), Google Maps API

Our approach was as follows:

  • Development of a Data Science project roadmap
  • Data cleaning
    • Handling missing values
    • Handling outliers
  • Exploratory analysis
    • Study of explanatory variables distributions
    • Identification of interesting variables for prediction through correlation and statistical tests
    • Creation of new variables
  • Modeling
    • Preprocessing
      • OrdinalEncoding, Standard scaling, and Trimming of extreme values with a custom FunctionTransformer
    • Model training
      • Cross-validation to observe learning curves and find optimal hyperparameters using GridSearchCV
      • Model selection based on comparison of metrics (MSE and MAE) from cross-validation results
    • Evaluation of the final model on test data
  • Deployment
    • Creation of Scikit-learn pipelines for automated preprocessing and model export using joblib
    • Creation of a web application using Streamlit

Experience feedback:

  • Application of our data manipulation and modeling skills
  • Deepening of our knowledge in handling Scikit-learn pipelines
  • Encountered a new problem: computation time for very large datasets
    • Discovery of libraries to accelerate it (somewhat): sklearnex and tune-sklearn

About

Repository pour le projet de Data Science de la LP SID

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages