Project: Module 10 Practical Application Example 2

Author: Vijay Chaganti

Dataset information

The provided dataset contains information on 426K cars to ensure speed of processing. Your goal is to understand what factors make a car more or less expensive. As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

Given dataset has 426880 entries with 18 columns

Data Source: https://github.com/chagantvj/PracticalApplicationM10/blob/main/vehicles.csv.zip

Python Code: https://github.com/chagantvj/PracticalApplicationM10/blob/main/VijayChaganti_Module10_Practical_Example2.ipynb

Note: There are no errors in the code, but FutureWarning due to usage of replace=inplace option.

Date Understanding and Cleaning

Total entries for each column of the data frame is 12684.

There is some data missing for Columsn car, Bar, CoffeeHouse, CarryAway, RestaurantLessThan20 and Restaurant20To50.

Many columns have missing data up to 40+ pergentage and
many columns are irrelevant to the price of car such as VIN, drive, paint_color, state, id etc

df.isna().mean().round(4)*100
id               0.00
region           0.00
price            0.00
year             0.28
manufacturer     4.13
model            1.24
condition       40.79
cylinders       41.62
fuel             0.71
odometer         1.03
title_status     1.93
transmission     0.60
VIN             37.73
drive           30.59
size            71.77
type            21.75
paint_color     30.50
state            0.00
dtype: float64

p = [0.05, 0.85, 0.95, 0.99]
df[['price']].describe(p)
price
count	426880.00
mean	75199.03
std	12182282.17
min	0.00
5%	0.00
50%	13950.00
85%	32995.00
95%	44500.00
99%	66995.00
max	3736928711.00

Replaced NaN and INF values in price column with mean of price! and no duplicate data found

has_nan = df['price'].isna().any().sum() 
has_inf = np.isinf(df['price']).any().sum()
df_cleaned = df[~np.isinf(df['price'])]
has_inf = np.isinf(df_cleaned['price']).any().sum()
print(has_nan)
#  0
print(has_inf)
#  0

sum(df.duplicated()) ## O duplicates found
#  0

Boxplot of price log

Histogram plot of log of price

Lost data with Z-SCORE (< 1%) vs IRQ (98%)

zscore_data_lost = 1 - (df_zscore.shape[0]/df.shape[0])
print("We lost {:.6%} of the data by the z-score method" .format(zscore_data_lost))
# We lost 0.007262% of the data by the z-score method <---- We lost only less than 1% of data using zscore
df_zscore['price'].describe()
count    426849.00
mean      17552.14
std       20667.53
min           0.00
25%        5900.00
50%       13950.00
75%       26455.00
max     5000000.00

irq_data_lost = 1 - (df_irq.shape[0]/df.shape[0])
print("We lost {:.2%} of the data by the IRQ method" .format(irq_data_lost))
# We lost 98.08% of the data by the IRQ method <--- We lost almost 98% of data with IRQ method of eliminating data which is not good.
df_irq['price'].describe()
count         8177.00
mean       3088930.26
std       87973256.90
min          57400.00
25%          61000.00
50%          67995.00
75%          77999.00
max     3736928711.00

Violin plot of title_status vs price log

Violin plot of condition vs price log

Train & Test data with 20% Test Size and random state as 23

Applied One-Hot Encoding on Nominal category of columns like title_status & condition

dfc = df[['condition', 'title_status', 'odometer’]]
X = dfc[['condition', 'title_status', 'odometer']]
y = df_numeric['price']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20,random_state=23)

from sklearn import set_config
set_config(transform_output="pandas")
from sklearn.impute import SimpleImputer

si = SimpleImputer(strategy = "most_frequent")
X_train = si.fit_transform(X_train)
X_test = si.transform(X_test)

pd.get_dummies(dfc.select_dtypes("object"), dtype = int, drop_first = True)
# 426880 rows × 10 columns

y.mean().round(1)
# 75199.0

Linear Regression

Losso Regression

Ridge Regression

Conclusion & Recommendation

Out of all the columns gien in dataset, columns named 'region', 'manufacturer', 'model', 'drive', 'size', 'type', 'paint_color' & 'state' as too many categorical variables that will be overfitting the model and hence ignored those column data. If we apply One-Hot encoding on those columns, its going to generate 100s of columns which will overload the model to process.

Column named 'id', 'VIN', 'fuel', & 'cylinders' does not play a role on car price and hence dropped from the DataSet

Column named 'odometer', 'title_status' & 'condition' is going to play a role as per domain knowledge that I have and hence considered for modeling. Applied one-hot encoding on catorigical data like 'title_status' & 'condition' and applied three different models like Linear-Regression, Losso-Regression and Ridge-Regression and all these three models gives almost ~0 on Train_R2_Score and Test_R2_Score. This shows the the models are under-fit and hence not able to predict actual price of the model

This under-fitting could be due to the 'most_frequent' imputer method used to fillin missing data for columns 'title_status' & 'condition'.

Violin plot on 'title_status' VS PriceLog & 'condition' VS PriceLog shows that the dependency is almost close to mean of the given data and hence the model. This data suggests that the model might be able to predict actual value given the data considered for models designed.

Overall the suggestions would be that the 'odometer', 'title_status' & 'condition' is going to play a role on price of a car!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
VijayChaganti_Module10_Practical_Example2.ipynb		VijayChaganti_Module10_Practical_Example2.ipynb
vehicles.csv.zip		vehicles.csv.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Module 10 Practical Application Example 2

Author: Vijay Chaganti

Dataset information

Date Understanding and Cleaning

Replaced NaN and INF values in price column with mean of price! and no duplicate data found

Boxplot of price log

Histogram plot of log of price

Lost data with Z-SCORE (< 1%) vs IRQ (98%)

Violin plot of title_status vs price log

Violin plot of condition vs price log

Train & Test data with 20% Test Size and random state as 23

Applied One-Hot Encoding on Nominal category of columns like title_status & condition

Linear Regression

Losso Regression

Ridge Regression

Conclusion & Recommendation

About

Releases

Packages

Languages

chagantvj/PracticalApplicationM10

Folders and files

Latest commit

History

Repository files navigation

Project: Module 10 Practical Application Example 2

Author: Vijay Chaganti

Dataset information

Date Understanding and Cleaning

Replaced NaN and INF values in price column with mean of price! and no duplicate data found

Boxplot of price log

Histogram plot of log of price

Lost data with Z-SCORE (< 1%) vs IRQ (98%)

Violin plot of title_status vs price log

Violin plot of condition vs price log

Train & Test data with 20% Test Size and random state as 23

Applied One-Hot Encoding on Nominal category of columns like title_status & condition

Linear Regression

Losso Regression

Ridge Regression

Conclusion & Recommendation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages