💎 This project aims to predict the prices of diamonds based on their features using machine learning models. The project demonstrates a complete machine learning pipeline, including data ingestion, preprocessing, transformation, and model training.
The dataset is sourced from Kaggle's Gem Price Prediction. It contains the following features:
- Categorical Features:
cut,color,clarity - Numerical Features:
carat,depth,table,x,y,z - Target Feature:
price
The dataset was split into training (70%) and testing (30%) datasets.
📂 The following artifacts are generated and used in the project:
raw.csv: Raw dataset.train.csvandtest.csv: Training and testing datasets.preprocessor.pkl: Preprocessing pipeline for data transformation.model.pkl: Trained machine learning model.
- Programming Language: Python 🔰
- Libraries:
- Data Manipulation:
pandas,numpy - Preprocessing:
scikit-learn - Logging and Exception Handling:
logging,sys
- Data Manipulation:
- Reads the raw dataset (
gemstone.csv). - Saves raw, training, and testing data into
artifacts/directory. - Splits the dataset into training and testing sets with a 70-30 split.
- Handles missing values using
SimpleImputer. - Scales numerical features using
StandardScaler. - Encodes categorical features (
cut,color,clarity) usingOrdinalEncoderwith predefined category rankings. - Combines pipelines into a
ColumnTransformerfor seamless preprocessing. - Saves the preprocessor object as
preprocessor.pkl.
- Models Used:
- Linear Regression
- Lasso Regression
- Ridge Regression
- ElasticNet
- Decision Tree Regressor
- Evaluates models using R² score.
- Selects the best model based on the highest R² score.
- Saves the best model as
model.pkl.
The prediction pipeline (PredictPipeline) includes:
- Loading the saved preprocessor and model objects.
- Scaling input features and making predictions.
- Handles both manual input and data from a DataFrame.
- The best-performing model was identified based on R² score.
- The R² scores for all models are logged and reported.
- Clone the repository:
git clone https://github.com/AryanDhanuka10/Diamond_Price_Prediction/tree/main
- Navigate to the project directory:
cd diamond-price-prediction - Install required dependencies:
pip install -r requirements.txt
- Run the data ingestion module to generate train and test datasets:
python src/data_ingestion.py
- Run the data transformation module to preprocess data:
python src/data_transformation.py
- Train models and select the best model:
python src/model_trainer.py
- Use the saved model (
model.pkl) for predictions:from src.prediction_training import PredictPipeline, CustomData custom_data = CustomData(carat=0.5, depth=61.5, table=55, x=4.5, y=4.6, z=2.9, cut="Ideal", color="E", clarity="VVS1") df = custom_data.get_data_as_dataframe() predict_pipeline = PredictPipeline() prediction = predict_pipeline.predict(df) print("Predicted Price:", prediction)
- Challenge: Handling categorical data encoding with ordinal features.
- Solution: Used
OrdinalEncoderwith custom category rankings for categorical features.
- Solution: Used
- Challenge: Ensuring robust preprocessing for both training and testing datasets.
- Solution: Implemented a unified
ColumnTransformerpipeline.
- Solution: Implemented a unified
Access the web application for real-time price prediction: Diamond Price Prediction App
We welcome contributions to improve this project! Follow these steps:
- Fork the repository.
- Create a new branch:
git checkout -b feature-name
- Make your changes and commit them:
git commit -m "Add detailed description of your changes" - Push to your branch:
git push origin feature-name
- Create a pull request describing your changes.
- Extend the project to include additional models such as Random Forest or Gradient Boosting.
- Develop a web application for real-time price prediction.
- Kaggle Dataset for providing the diamond dataset.
- Various Python libraries and tools used in the project.
This project is licensed under the MIT License.