This is an End to End ML project to determine Calorie from fitbit health data. The project covers training and prediction pipelines. The project involves a regression model to predict the calories burnt based on the given indicators in the training data.
Dataset is taken from fitbit dataset from kaggle. The data contains following features:
1. Id: The customer ID
2. ActivityDate: The date for which the activity is getting tracked.
3. TotalSteps: Total Steps taken on that day.
4. TotalDistance: Total distance covered.
5. TrackerDistance: Distance as per the tracker
6. LoggedActivitiesDistance: Logged
7. VeryActiveDistance: The distance for which the user was the most active.
8. ModeratelyActiveDistance: The distance for which the user was moderately active.
9. LightActiveDistance: The distance for which the user was the least active.
10. SedentaryActiveDistance: The distance for which the user was almost inactive.
11. VeryActiveMinutes: The number of minutes for the most activity.
12. FairlyActiveMinutes: The number of minutes for moderately activity.
13. LightlyActiveMinutes: The number of minutes for the least activity
14. SedentaryMinutes: The number of minutes for almost no activity
15. Calories(Target): The calories burnt.
- Data Capture: Data is captured from the files inside_Training_Batch_Files_ directory
- Data Validation: rawValidation.py inside Training_Raw_files_Validated validates the data captured based on schema defined in schema_training.json. Data which satisfies the schema conditions is then saved in Training_Raw_files_validated/Good_Raw and the data which violates the schema is saved in Training_Raw_files_validated/Bad_Raw directory.
- Data Transformation: DataTransform.py inside DataTransform_Training performs transformations on data in Training_Raw_files_validated/Good_Raw like adding double quotes to string values in columns
- Data insertion to Database: DataTypeValidation.py inside DataTypeValidation_Insertion_Training directory, saves the transformed data in Training.db inside Training_Database
- Export data from DB to CSV format: DataTypeValidation.py takes data from Training.db and creates InputFile.csv inside Training_FileFromDB which will be later used for model training
- Data Pre-processing: preprocessing.py inside data_preprocessing perform necessary pre-processing steps like removing unnecessary columns, separate the label feature, replace null values using KNN Imputer, encode Categorical values etc.
- Data Clustering: The project is based on customized ML approach where using KNN algorithm clusters from data is created. ML algorithm will be applied later on the data in these clusters to prevent overfitting in the model.
- Model Selection & Hyper parameter tuning: tuner.py inside best_model_finder performs Grid Search CV for hyperparameter optimization on XGBoost Regressor and Random Forest Regressor to select the best model with best hyper parameters which is then saved at models directory.
- Data Capture: Data is captured from the files inside_Prediction_Batch_Files_ directory
- Data Validation: rawValidation.py inside Prediction_Raw_files_Validated validates the data captured based on schema defined in schema_prediction.json. Data which satisfies the schema conditions is then saved in Prediction_Raw_files_validated/Good_Raw and the data which violates the schema is saved in Prediction_Raw_files_validated/Bad_Raw directory.
- Data Transformation: DataTransform.py inside DataTransform_Prediction performs transformations on data in Prediction_Raw_files_validated/Good_Raw like adding double quotes to string values in columns
- Data insertion to Database: DataTypeValidation.py inside DataTypeValidation_Insertion_Prediction directory, saves the transformed data in Prediction.db inside Prediction_Database
- Export data from DB to CSV format: DataTypeValidation.py takes data from Prediction.db and creates InputFile.csv inside Prediction_FileFromDB which will be later used for model prediction
- Data Pre-processing: preprocessing.py inside data_preprocessing perform necessary pre-processing steps like removing unnecessary columns, separate the label feature, replace null values using KNN Imputer, encode Categorical values etc.
- Data Cluster identification: Prediction pipeline in predictFromModel.py check in which cluster the given data is present.
- Model Prediction: Prediction pipeline in predictFromModel.py predicts the calorie value based on model for cluster in which given data is present.
- Flask - Web framework to develop APIs
- Scikit-learn - To create Machine Learning models for KNN and Random Forest algorithms
- XGBoost - To create XGBoost based model for calorie prediction
- SQLite - Database to store the validated Raw data and data submitted for prediction.
- Python 3.6 - As a programming language
-
Clone the repo using following command
$ git clone https://github.com/coolmunzi/restaurant_bot.git
-
Run the Flask app, by executing main.py file $ python main.py
-
To train the models, go to any API testing tool like Postman. Create a POST request with URL as '127.0.0.1:5000/train' and JSON body as {"folderPath" : "Training_Batch_Files"}
-
Once the model is trained, you can perform batch prediction from web browser by opening 'http://127.0.0.1:5000/' and pasting the absolute folder path of Prediction_Batch_files folder (which is inside the project directory)