Solution:
- For the highest scores having ‘’ (Not out) we create an additional column taking only binary values (0 or 1) for the presence and absence of ‘’.
- The strings with ‘*’ are removed with the help of string manipulations on regular expressions (regex).
- The ‘Avg’ column which contained ‘-’ for the players who never got out was replaced with their total runs.
- We then calculate the significance of each feature on the ‘2019_Runs’ using ‘SelectKBest’ and calculating the chi-square to find the confidence Interval of each feature.
- We drop the features that have a less score in chi-square and use only the significant features.
- We use label encodings to encode the dataset.
- Post preprocessing, a RandomForestRegressor was employed, the hyperparameters were n_trees = 100, max_features='auto'
- Other Machine Learning algorithms such as AdaBoost, XGBoost,lgbm were conducted in order to achieve the highest score in the appropriate metrics used and as stated below.
- After training the model we predict the 2019_Runs. 10.The metric employed was r2_score.
- The same RandomForestRegressor model was trained using the features and 2019_Runs instead of the 2018_Runs.
- This model then predicts 2020_Runs.