This package is a lightweight implementation of bayesian target encoding. This implementation is taken from Slakey et al., with ensemble methodology from Larionov.
The encoding proceeds as follows:
- User observes and chooses a likelihood for the target variable (e.g. Bernoulli for a binary classification problem),
- Using Fink's Compendium of Priors, derive the conjugate prior for the likelihood (e.g. Beta),
- Use the training data to initialize the hyperparameters for the prior distribution
- NOTE: This process is generally reliant on common interpretations of hyperparameters.
- Using Fink's Compendium, derive the methodology for generating the posterior distribution,
- For each level in the categorical variable,
- Generate the posterior distribution using the observed target values for the categorical level,
- Set the encoding value to a sample from the posterior distribution
- If a new level has appeared in the dataset, the encoding will be sampled from the prior distribution.
To disable this behaviour, initialize the encoder with
handle_unknown="error"
.
- If a new level has appeared in the dataset, the encoding will be sampled from the prior distribution.
To disable this behaviour, initialize the encoder with
Then, we repeat step 5.2 a total of n_estimators
times, generating a total of n_estimators
training datasets
with unique encodings. The end model is a vote from each sampled dataset.
For reproducibility, you can set the encoding value to the mean of the posterior distribution instead.
Install from PyPI:
python -m pip install bayte
Let's create a binary classification dataset.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2)
X = pd.DataFrame(X)
# Categorical data
X[5] = np.random.choice(["red", "green", "blue"], size=1000)
Import and fit the encoder:
import bayte as bt
encoder = bt.BayesianTargetEncoder(dist="bernoulli")
encoder.fit(X[[5]], y)
To encode your categorical data,
X[5] = encoder.transform(X[[5]])
If you want to utilize the ensemble methodology described above, construct the same dataset
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2)
X = pd.DataFrame(X)
# Categorical data
X[5] = np.random.choice(["red", "green", "blue"], size=1000)
and import a classifier to supply to the ensemble class
from sklearn.svm import SVC
import bayte as bt
ensemble = bt.BayesianTargetClassifier(
base_estimator=SVC(kernel="linear"),
encoder=bt.BayesianTargetEncoder(dist="bernoulli")
)
Fit the ensemble. NOTE: either supply an explicit list of categorical features to categorical_feature
, or
use a DataFrame with categorical data types.
ensemble.fit(X, y, categorical_feature=[5])
When you call predict
on a novel dataset, note that the encoder will transform your data at runtime and it
will encode based on the mean of the posterior distribution:
ensemble.predict(X)