This repository acts as a central point for all the Kaggle competitions I compete it. There is a template branch I easily clone that contains my preferred file structure for data science projects. They are broadly split into two groups: ones with monetary rewards and the playground series (no monetary reward).
Playground:
- S2E6 - Imputation (132/884)
- S2E7 - Clustering (42/1253)
- S2E8 - Regularization (519/1888)
- S3E10 - Nonlinearity (6/807)
Monetary:
- RSNA - Breast Cancer (962/1687)
- Women in Data Science
- BirdCLEF 2023
- IceCube - Neutrinos in Deep Ice
- Google - Isolated Sign Language Recognition
-
S2E6 - Imputation (132/884)
- Missing value prediction on a synthetic dataset of 1 million samples and 81 features
- Leveraged EDA to reduce training time and apply feature engineering
- Designed a multi-head Multilayer Perceptron with skip connections and mish activator
- Things I learned:
- Many different techniques to impute missing data
- How imperative a thorough EDA is
- How different MLP architecture can reduce training time while maintaining the same accuracy
-
S2E7 - Clustering (42/1253)
- Predict clusters on a 98000 sample 29 feature synthetic dataset
- Due to the lack of ground truth data and evaluation metric (Adjusted Rand Score), a brute force method was used instead of the usual cross validation
- 2 stage approach to predictions:
- Use a clustering model to predict clusters
- Train classifiers on the high confident predictions from the clustering model and ensemble
- Things I learned:
- Some functionality of the SK-Lego library, this was necessary for the score I achieved
- How to implement the pseudo-labelling technique
- How to make a custom soft voting ensemble
-
S2E8 - Regularization (519/1888)
- Predict failure on a synthetic dataset of 26500 samples 26 features which mimicked a real-world product test
- Tricky competition with a lot of missing values and some features were strongly correlated with the target but not produce great results
- Final model was an untuned Logistic regressor with a few engineered features
- Things I learned:
- The importance of creating a robust cross validation method, tricky for this one because the test set had categorical data and groups that was not in the training data
- Missing data can be engineered into features
-
S3E10 - Nonlinearity (6/807)
-
RSNA - Breast Cancer (962/1687)
- 314 Gigabyte dataset of ~55000 images from ~12000 patients in the form of DICOM files, with an accompanying auxiliary csv