flipr-hackathon-7.0

Which model have you used for Total IPL 2020 Runs prediction for each player? Explain your model.

Solution:

For the highest scores having ‘’ (Not out) we create an additional column taking only binary values (0 or 1) for the presence and absence of ‘’.
The strings with ‘*’ are removed with the help of string manipulations on regular expressions (regex).
The ‘Avg’ column which contained ‘-’ for the players who never got out was replaced with their total runs.
We then calculate the significance of each feature on the ‘2019_Runs’ using ‘SelectKBest’ and calculating the chi-square to find the confidence Interval of each feature.
We drop the features that have a less score in chi-square and use only the significant features.
We use label encodings to encode the dataset.
Post preprocessing, a RandomForestRegressor was employed, the hyperparameters were n_trees = 100, max_features='auto'
Other Machine Learning algorithms such as AdaBoost, XGBoost,lgbm were conducted in order to achieve the highest score in the appropriate metrics used and as stated below.
After training the model we predict the 2019_Runs. 10.The metric employed was r2_score.

The same RandomForestRegressor model was trained using the features and 2019_Runs instead of the 2018_Runs.
This model then predicts 2020_Runs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data.xlsx		Data.xlsx
Hackathon_7.0_ML_Guidelines.pdf		Hackathon_7.0_ML_Guidelines.pdf
README.md		README.md
Submission.csv		Submission.csv
Variable_Description (2).xlsx		Variable_Description (2).xlsx
feature selection.png		feature selection.png
feature_importance.py		feature_importance.py
prdicting2020.py		prdicting2020.py
trainingwith2019.py		trainingwith2019.py