World Bank Pover-T Tests: Predicting Poverty
world-bank-pover-t-tests-solution
├── Background and Submission Overview.md
├── data
│ └── get_data.sh
├── README.md
├── requirements.txt
├── src
│ ├── bayesian-opts-res
│ │ └── bayesian-opt-test-preds
│ ├── Data Processor Original Dataset.ipynb
│ ├── Full Bayesian Model Training and Predictions.ipynb
│ └── modules
│ ├── __init__.py
│ ├── training_models.py
│ ├── training_optimizers.py
│ └── training_utils.py
└── submission
6 directories, 10 files
Place the training data inside the data/
directory of the project. This can also be done automatically (assuming you're in the root directory) by running:
$ cd data/
$ bash get_data.sh
The data below should be present inside the data/
directory in order to proceed to the next step of generating the transformed dataset for training.
│ ├── A_hhold_test.csv
│ ├── A_hhold_train.csv
│ ├── A_indiv_test.csv
│ ├── A_indiv_train.csv
│ ├── B_hhold_test.csv
│ ├── B_hhold_train.csv
│ ├── B_indiv_test.csv
│ ├── B_indiv_train.csv
│ ├── C_hhold_test.csv
│ ├── C_hhold_train.csv
│ ├── C_indiv_test.csv
│ ├── C_indiv_train.csv
Assuming you're in the root directory, navigate inside the src/
directory and open the Data Processor Original Dataset.ipynb notebook. The notebook will do the following transformations to the hhold
and indiv
datasets for each country.
Process to generate indiv_cat:
- Take only categorical features
- One-hot-encode the features
- Summarize the encoded features to represent a household using:
mean
median
all
any
Process to generate hhold-transformed:
- Take numeric and categorical data
- For numeric, transform data using:
- MinMax scaler:
mx_
- Standard scaler:
sc_
- MinMax scaler:
- For categorical, encode data:
- Use label encoding
- Use the label encoded data to perform one-hot-encoding
- Retain the label encoding
The above process will generate these additional files inside the data/
directory. These will be used by the models.
│ ├── A-hhold-transformed-test.csv
│ ├── A-hhold-transformed-train.csv
│ ├── B-hhold-transformed-test.csv
│ ├── B-hhold-transformed-train.csv
│ ├── C-hhold-transformed-test.csv
│ ├── C-hhold-transformed-train.csv
│ ├── indiv_cat_train.hdf
│ ├── indiv_cat_test.hdf
For each country, the model is a blending of meta predictions from 20 variations of 5 models. The following base models are used:
- Logistic Regression with L1 regularization
- Neural Network (3 hidden layers)
- Random Forest
- LightGBM
- XGBoost
Each variation is produced by performing Bayesian optimization over the base models given a range of parameter values. The Bayesian optimization is trained to optimize the prediction score over an optimization fold. The optimization fold is allowed to randomly vary for a more robust model mixture to prevent overfitting which is likely to happen if only a single optimization fold is used.
The top 20 meta-models having the highest optimization-fold score are included in the blending model. The blending model is trained by optimizing the log loss of the out-of-fold (OOF) predictions against the actual values. The variables over which the optimization is made are the weights of each meta-model to the final prediction.
- python version 2.7.12
This project depends on the following python modules:
-
Standard:
- os
- datetime
- glob
- cPickle
- time
- warnings
- hashlib
- contextlib
-
Packages:
- numpy==1.14.0
- pandas==0.20.2
- joblib==0.11
- bayesian-optimization==0.6.0
- scikit-learn==0.19.0
- xgboost==0.7
- lightgbm==2.1.0
- scipy==1.0.0
- matplotlib==2.0.0
- tqdm==4.11.2
Install the needed modules by running the command below in the project root directory:
$ pip install -r requirements.txt
Assuming you're in the root directory, navigate inside the src/
directory and open the Full Bayesian Model Training and Predictions.ipynb notebook. Run all cells. This will take a while to complete.
The submission file will be generated and stored in the submission/
directory in the project root.
Logs from the model training can be accessed by looking at the output.logs
file.
Please check the Background and Submission Overview for more details.