"That's the worst name I ever heard."
Shabadoo is the worst kind of machine learning. It automates nothing; your models will not perform well and it will be your own fault.
BEWARE. Shabadoo is in an open alpha phase. It is authored by someone who does not know how to manage open source projects. Things will change as the author identifies mistakes and corrects (?) them.
Shabadoo is for people who want to do Bayesian regression but who do not want to write probabilistic programming code. You only need to assign priors to features and pass your pandas dataframe to a .fit()
/ .predict()
API.
Shabadoo runs on numpyro and is basically a wrapper around the numpyro Bayesian regression tutorial.
pip install shabadoo
or
pip install git+https://github.com/nolanbconaway/shabadoo
Shabadoo was designed to make it as easy as possible to test ideas about features and their priors. Models are defined using a class which contains configuration specifying how the model should behave.
You need to define a new class which inherits from one of the Shabadoo models. Currently, Normal, Poisson, and Bernoulli are implemented.
import numpy as np
import pandas as pd
from numpyro import distributions as dist
from shabadoo import Normal
# random number generator seed, to reproduce exactly.
RNG_KEY = np.array([0, 0])
class Model(Normal):
dv = "y"
features = dict(
const=dict(transformer=1, prior=dist.Normal(0, 1)),
x=dict(transformer=lambda df: df.x, prior=dist.Normal(0, 1)),
)
df = pd.DataFrame(dict(x=[1, 2, 2, 3, 4, 5], y=[1, 2, 3, 4, 3, 5]))
The dv
attribute specifies the variable you are predicting. features
is a dictionary of dictionaries, with one item per feature. Above, two features are defined (const
and x
). Each feature needs a transformer
and a prior
.
The transformer specifies how to obtain the feature given a source dataframe. The prior specifies your beliefs about the model's coefficient for that feature.
Shabadoo models implement the well-known .fit
/ .predict
api pattern.
model = Model().fit(df, rng_key=RNG_KEY)
# sample: 100%|██████████| 1500/1500 [00:04<00:00, 308.01it/s, 7 steps of size 4.17e-01. acc. prob=0.89]
model.predict(df)
"""
0 1.351874
1 2.219510
2 2.219510
3 3.087146
4 3.954782
5 4.822418
"""
Use model.predict(df, ci=True)
to obtain a credible interval around the model's prediction. This interval accounts for error estimating the model's coefficients but does not account for the error around the model's point estimate (PRs welcome ya'll!).
model.predict(df, ci=True)
"""
y ci_lower ci_upper
0 1.351874 0.730992 1.946659
1 2.219510 1.753340 2.654678
2 2.219510 1.753340 2.654678
3 3.087146 2.663617 3.526434
4 3.954782 3.401837 4.548420
5 4.822418 4.047847 5.578753
"""
Shabadoo's model classes come with a number of model inspection methods. It should be easy to understand your model's composition and with Shabadoo it is!
The average and standard deviation of the MCMC samples are used to provide a rough sense of the coefficient in general.
print(model.formula)
"""
y = (
const * 0.48424(+-0.64618)
+ x * 0.86764(+-0.21281)
)
"""
Samples from fitted models can be accessed using model.samples
(for raw device arrays) and model.samples_df
(for a tidy DataFrame).
model.samples['x']
"""
DeviceArray([[0.9443443 , 1.0215557 , 1.0401363 , 1.1768144 , 1.1752374 ,
...
"""
model.samples_df.head()
"""
const x
chain sample
0 0 0.074572 0.944344
1 0.214246 1.021556
2 -0.172168 1.040136
3 0.440978 1.176814
4 0.454463 1.175237
"""
The Model.metrics()
method is packed with functionality. You should not have to write a lot of code to evaluate your model's prediction accuracy!
Obtaining aggregate statistics is as easy as:
model.metrics(df)
{'r': 0.8646920305474705,
'rsq': 0.7476923076923075,
'mae': 0.5661819464378061,
'mape': 0.21729708806356265}
For per-point errors, use aggerrs=False
. A pandas dataframe will be returned that you can join on your source data using its index.
model.metrics(df, aggerrs=False)
"""
residual pe ape
0 -0.351874 -35.187366 35.187366
1 -0.219510 -10.975488 10.975488
2 0.780490 26.016341 26.016341
3 0.912854 22.821353 22.821353
4 -0.954782 -31.826066 31.826066
5 0.177582 3.551638 3.551638
"""
You can use grouped_metrics
to understand within-group errors. Under the hood, the predicted and actual dv
are groupby-aggregated (default sum) and metrics are computed within each group.
df["group"] = [1, 1, 1, 2, 2, 2]
model.grouped_metrics(df, 'group')
{'r': 1.0,
'rsq': 1.0,
'mae': 0.17238043177407247,
'mape': 0.023077819594065668}
model.grouped_metrics(df, "group", aggerrs=False)
"""
residual pe ape
group
1 -0.209107 -3.485113 3.485113
2 -0.135654 -1.130450 1.130450
"""
Shabadoo models have to_json
and from_dict
methods which allow models to be saved and recovered exactly.
import json
# export to a JSON string
model_json = model.to_json()
# recover the model
model_recovered = Model.from_dict(json.loads(model_json))
# check the predictions are the same
model_recovered.predict(df).equals(model.predict(df))
True
To get a development installation going, set up a python 3.6 or 3.7 virtualenv however you'd like and set up an editable installation of Shabadoo like so:
$ git clone https://github.com/nolanbconaway/shabadoo.git
$ cd shabadoo
$ pip install -e .[test]
You should be able to run the full test suite via:
$ tox -e py36 # or py37 if thats what you installed