This repository presents an approach used for solving Kaggle Categorical Feature Encoding Challenge II.
To validate the results, I divided train dataset (600000 rows) into two sets
having 300000 rows each. I repeated this operation 4 times using
different random_seed
and calculated CV score as a mean score over 4 iterations.
from sklearn.metrics import roc_auc_score
from cafeen import config, steps
scores = []
for seed in [0, 1, 2, 3]:
# read data from files
train_x, test_x, train_y, test_y, test_id = steps.make_data(
path_to_train=config.path_to_train,
seed=seed,
drop_features=['bin_3'])
# apply encoders
train_x, test_x = steps.encode(train_x, train_y, test_x, is_val=True)
# apply estimator
predicted = steps.train_predict(train_x, train_y, test_x)
# compute ROC AUC score
scores += [roc_auc_score(test_y.values, predicted)]
The full encoding pipeline can be seen here.
As a baseline model, I used logistic regression with default parameters and liblinear
solver.
All features in dataset are one-hot encoded.
CV: 0.78130, private score: 0.78527
After hyperparameters optimization, I found the following configuration yields a highest CV score.
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(
C=0.049,
class_weight={0: 1, 1: 1.42},
solver='liblinear',
fit_intercept=True,
penalty='l2')
CV: 0.78519, private score: 0.78704
I dropped bin_3
feature, as it seems to be not really important, and keeping
it in the dataset doesn't improve the score.
CV: 0.78520, private score: 0.78704
I used ordinal encoding for ord_0
, ord_1
, ord_4
, ord_5
, approximating
categories target mean with a linear function. For ord_4
and ord_5
I removed outliers,
categories with small amount of observations, before applying the linear regression.
CV: 0.78582, private score: 0.78727
For nom_6
feature I removed all categories which have less than 90 observations (replaced it with NaN
).
Then using K-Fold target encoding, converted it to numeric and grouped in three groups with qcut
.
import pandas as pd
x['nom_6'] = pd.qcut(x['nom_6'], 3, labels=False, duplicates='drop')
CV: 0.78691, private score: 0.78796
For nom_9
feature I removed all categories which have less than 60 observations (replaced it with NaN
)
and combined together categories which have equal target average.
CV: 0.78691, private score: 0.78797
For one-hot encoded features (all features except ord_0
, ord_1
, ord_4
, ord_5
),
I replaced missing values with -1
. For ordinal encoded features, I replaced it with
the target probability, 0.18721
.
That's it, though I haven't chosen the best submission for final score and the official results are a bit worse.
Private score 0.78795 (110 place) Public score 0.78669 (22 place)