Skip to content

Commit

Permalink
Merge pull request #80 from jrzaurin/pmulinka/dir
Browse files Browse the repository at this point in the history
Pmulinka/dir
  • Loading branch information
jrzaurin committed Mar 10, 2022
2 parents 8b4c3a8 + 108cebc commit 923011c
Show file tree
Hide file tree
Showing 189 changed files with 16,189 additions and 161,921 deletions.
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
parallel = True
omit =
pytorch_widedeep/optim/*
pytorch_widedeep/bayesian_models/bayesian_nn/modules/*

[report]
omit =
pytorch_widedeep/optim/*
pytorch_widedeep/bayesian_models/bayesian_nn/modules/*
precision = 2
13 changes: 12 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ __pycache__*
.ipynb_checkpoints
Untitled*.ipynb

# sublime debugger
*.sublime-project
*.sublime-workspace

# data related dirs
tmp_data/
model_weights/
Expand All @@ -27,6 +31,9 @@ htmlcov*/
.cache
.hypothesis/

# vscode
.vscode

# sublime
*.sublime-workspace
sftp*-config.json
Expand All @@ -44,4 +51,8 @@ _build
_templates

# test checkpoints
checkpoints
checkpoints

# wnb
wandb/
wandb_api.key
134 changes: 59 additions & 75 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@

# pytorch-widedeep

A flexible package to use Deep Learning with tabular data, text and images
using wide and deep models.
A flexible package for multimodal-deep-learning to combine tabular data with
text and images using Wide and Deep models in Pytorch

**Documentation:** [https://pytorch-widedeep.readthedocs.io](https://pytorch-widedeep.readthedocs.io/en/latest/index.html)

**Companion posts and tutorials:** [infinitoml](https://jrzaurin.github.io/infinitoml/)

**Experiments and comparisson with `LightGBM`**: [TabularDL vs LightGBM](https://github.com/jrzaurin/tabulardl-benchmark)
**Experiments and comparison with `LightGBM`**: [TabularDL vs LightGBM](https://github.com/jrzaurin/tabulardl-benchmark)

The content of this document is organized as follows:

Expand All @@ -33,7 +33,8 @@ The content of this document is organized as follows:

### Introduction

``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792)
``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792),
adjusted for multi-modal datasets

In general terms, `pytorch-widedeep` is a package to use deep learning with
tabular data. In particular, is intended to facilitate the combination of text
Expand Down Expand Up @@ -89,15 +90,11 @@ into:
<img width="300" src="docs/figures/architecture_2_math.png">
</p>

I recommend using the ``wide`` and ``deeptabular`` models in
``pytorch-widedeep``. However it is very likely that users will want to use
their own models for the ``deeptext`` and ``deepimage`` components. That is
perfectly possible as long as the the custom models have an attribute called
It is perfectly possible to use custom models (and not necessarily those in
the library) as long as the the custom models have an attribute called
``output_dim`` with the size of the last layer of activations, so that
``WideDeep`` can be constructed. Again, examples on how to use custom
components can be found in the Examples folder. Just in case
``pytorch-widedeep`` includes standard text (stack of LSTMs) and image
(pre-trained ResNets or stack of CNNs) models.
``WideDeep`` can be constructed. Examples on how to use custom components can
be found in the Examples folder.

### The ``deeptabular`` component

Expand All @@ -110,15 +107,17 @@ its own, i.e. what one might normally refer as Deep Learning for Tabular
Data. Currently, ``pytorch-widedeep`` offers the following different models
for that component:


0. **Wide**: a simple linear model where the nonlinearities are captured via
cross-product transformations, as explained before.
1. **TabMlp**: a simple MLP that receives embeddings representing the
categorical features, concatenated with the continuous features.
categorical features, concatenated with the continuous features, which can
also be embedded.
2. **TabResnet**: similar to the previous model but the embeddings are
passed through a series of ResNet blocks built with dense layers.
3. **TabNet**: details on TabNet can be found in
[TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442)

And the ``Tabformer`` family, i.e. Transformers for Tabular data:
The ``Tabformer`` family, i.e. Transformers for Tabular data:

4. **TabTransformer**: details on the TabTransformer can be found in
[TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/pdf/2012.06678.pdf).
Expand All @@ -133,12 +132,19 @@ on the Fasformer can be found in
the Perceiver can be found in
[Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206)

And probabilistic DL models for tabular data based on
[Weight Uncertainty in Neural Networks](https://arxiv.org/abs/1505.05424):

9. **BayesianWide**: Probabilistic adaptation of the `Wide` model.
10. **BayesianTabMlp**: Probabilistic adaptation of the `TabMlp` model

Note that while there are scientific publications for the TabTransformer,
SAINT and FT-Transformer, the TabFasfFormer and TabPerceiver are our own
adaptation of those algorithms for tabular data.

For details on these models and their options please see the examples in the
Examples folder and the documentation.
For details on these models (and all the other models in the library for the
different data modes) and their corresponding options please see the examples
in the Examples folder and the documentation.

### Installation

Expand All @@ -165,27 +171,6 @@ cd pytorch-widedeep
pip install -e .
```

**Important note for Mac users**: at the time of writing the latest `torch`
release is `1.9`. Some past [issues](https://stackoverflow.com/questions/64772335/pytorch-w-parallelnative-cpp206)
when running on Mac, present in previous versions, persist on this release
and the data-loaders will not run in parallel. In addition, since `python
3.8`, [the `multiprocessing` library start method changed from `'fork'` to`'spawn'`](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods).
This also affects the data-loaders (for any `torch` version) and they will
not run in parallel. Therefore, for Mac users I recommend using `python 3.7`
and `torch <= 1.6` (with the corresponding, consistent
version of `torchvision`, e.g. `0.7.0` for `torch 1.6`). I do not want to
force this versioning in the `setup.py` file since I expect that all these
issues are fixed in the future. Therefore, after installing
`pytorch-widedeep` via pip or directly from github, downgrade `torch` and
`torchvision` manually:

```bash
pip install pytorch-widedeep
pip install torch==1.6.0 torchvision==0.7.0
```

None of these issues affect Linux users.

### Quick start

Binary classification with the [adult
Expand All @@ -195,7 +180,6 @@ using `Wide` and `DeepDense` and defaults settings.
Building a wide (linear) and deep model with ``pytorch-widedeep``:

```python

import pandas as pd
import numpy as np
import torch
Expand All @@ -205,16 +189,15 @@ from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.datasets import load_adult


# the following 4 lines are not directly related to ``pytorch-widedeep``. I
# assume you have downloaded the dataset and place it in a dir called
# data/adult/
df = pd.read_csv("data/adult/adult.csv.zip")
df = load_adult(as_frame=True)
df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop("income", axis=1, inplace=True)
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label)

# prepare wide, crossed, embedding and continuous columns
# Define the 'column set up'
wide_cols = [
"education",
"relationship",
Expand All @@ -223,49 +206,53 @@ wide_cols = [
"native-country",
"gender",
]
cross_cols = [("education", "occupation"), ("native-country", "occupation")]
embed_cols = [
("education", 16),
("workclass", 16),
("occupation", 16),
("native-country", 32),
]
cont_cols = ["age", "hours-per-week"]
target_col = "income_label"
crossed_cols = [("education", "occupation"), ("native-country", "occupation")]

# target
target = df_train[target_col].values
cat_embed_cols = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"gender",
"capital-gain",
"capital-loss",
"native-country",
]
continuous_cols = ["age", "hours-per-week"]
target = "income_label"
target = df_train[target].values

# wide
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
# prepare the data
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df_train)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)

# deeptabular
tab_preprocessor = TabPreprocessor(embed_cols=embed_cols, continuous_cols=cont_cols)
tab_preprocessor = TabPreprocessor(
cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type]
)
X_tab = tab_preprocessor.fit_transform(df_train)
deeptabular = TabMlp(
mlp_hidden_dims=[64, 32],

# build the model
wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1)
tab_mlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
embed_input=tab_preprocessor.embeddings_input,
continuous_cols=cont_cols,
cat_embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=continuous_cols,
)
model = WideDeep(wide=wide, deeptabular=tab_mlp)

# wide and deep
model = WideDeep(wide=wide, deeptabular=deeptabular)

# train the model
# train and validate
trainer = Trainer(model, objective="binary", metrics=[Accuracy])
trainer.fit(
X_wide=X_wide,
X_tab=X_tab,
target=target,
n_epochs=5,
batch_size=256,
val_split=0.1,
)

# predict
# predict on test
X_wide_te = wide_preprocessor.transform(df_test)
X_tab_te = tab_preprocessor.transform(df_test)
preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te)
Expand All @@ -282,14 +269,11 @@ torch.save(model.state_dict(), "model_weights/wd_model.pt")
# From here in advance, Option 1 or 2 are the same. I assume the user has
# prepared the data and defined the new model components:
# 1. Build the model
model_new = WideDeep(wide=wide, deeptabular=deeptabular)
model_new = WideDeep(wide=wide, deeptabular=tab_mlp)
model_new.load_state_dict(torch.load("model_weights/wd_model.pt"))

# 2. Instantiate the trainer
trainer_new = Trainer(
model_new,
objective="binary",
)
trainer_new = Trainer(model_new, objective="binary")

# 3. Either start the fit or directly predict
preds = trainer_new.predict(X_wide=X_wide, X_tab=X_tab)
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.14
1.1.0
15 changes: 15 additions & 0 deletions docs/bayesian_models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
The ``bayesian models`` module
==============================

This module contains the two Bayesian Models available in this library, namely
the bayesian version of the ``Wide`` and ``TabMlp`` models, referred as
``BayesianWide`` and ``BayesianTabMlp``


.. autoclass:: pytorch_widedeep.bayesian_models.tabular.bayesian_linear.bayesian_wide.BayesianWide
:exclude-members: forward
:members:

.. autoclass:: pytorch_widedeep.bayesian_models.tabular.bayesian_mlp.bayesian_tab_mlp.BayesianTabMlp
:exclude-members: forward
:members:
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = [u"_build", "Thumbs.db", ".DS_Store"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = "sphinx"
Expand Down
27 changes: 14 additions & 13 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,17 @@ This section provides links to example notebooks that may be helpful to better
understand the functionalities withing ``pytorch-widedeep`` and how to use
them to address different problems

* `Preprocessors and Utils <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/01_Preprocessors_and_utils.ipynb>`__
* `Model Components <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/02_1_Model_Components.ipynb>`__
* `deeptabular Models <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/02_2_deeptabular_models.ipynb>`__
* `Binary Classification with default parameters <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/03_Binary_Classification_with_Defaults.ipynb>`__
* `Binary Classification with varying parameters <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/04_Binary_Classification_Varying_Parameters.ipynb>`__
* `Regression with Images and Text <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/05_Regression_with_Images_and_Text.ipynb>`__
* `FineTune routines <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/06_FineTune_and_WarmUp_Model_Components.ipynb>`__
* `Custom Components <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/07_Custom_Components.ipynb>`__
* `Save and Load Model and Artifacts <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/08_save_and_load_model_and_artifacts.ipynb>`__
* `Using Custom DataLoaders and Torchmetrics <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/09_Custom_DataLoader_Imbalanced_dataset.ipynb>`__
* `The Transformer Family <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/10_The_Transformer_Family.ipynb>`__
* `Extracting Embeddings <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/11_Extracting_Embeddings.ipynb>`__
* `HyperParameter Tuning With RayTune <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/12_HyperParameter_tuning_w_RayTune.ipynb>`__
* `Preprocessors and Utils <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/01_Preprocessors_and_utils.ipynb>`__
* `Model Components <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/02_model_components.ipynb>`__
* `Binary Classification with default parameters <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/03_Binary_Classification_with_Defaults.ipynb>`__
* `Regression with Images and Text <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/04_regression_with_images_and_text.ipynb>`__
* `Save and Load Model and Artifacts <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/05_save_and_load_model_and_artifacts.ipynb>`__
* `FineTune routines <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/06_fineTune_and_warmup.ipynb>`__
* `Custom Components <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/07_Custom_Components.ipynb>`__
* `Using Custom DataLoaders and Torchmetrics <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/08_custom_dataLoader_imbalanced_dataset.ipynb>`__
* `Extracting Embeddings <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/09_extracting_embeddings.ipynb>`__
* `HyperParameter Tuning With RayTune <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/10_hyperParameter_tuning_w_raytune_n_wnb.ipynb>`__
* `Model Uncertainty Prediction <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/13_Model_Uncertainty_prediction.ipynb>`__
* `Bayesian Models <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/14_bayesian_models.ipynb>`__
* `Deep Imbalanced Regression <https://github.com/jrzaurin/pytorch-widedeep/blob/master/examples/notebooks/15_DIR-LDS_and_FDS.ipynb>`__

Binary file removed docs/figures/01_Preprocessors_and_utils_40_0.png
Binary file not shown.
Binary file removed docs/figures/01_Preprocessors_and_utils_43_0.png
Binary file not shown.
Binary file removed docs/figures/01_Preprocessors_and_utils_46_0.png
Binary file not shown.
Binary file removed docs/figures/ft_transformer_arch.png
Binary file not shown.
Binary file removed docs/figures/resnet_block.png
Binary file not shown.
Binary file removed docs/figures/saint_arch.png
Binary file not shown.
Binary file removed docs/figures/tabmlp_arch.png
Binary file not shown.
Binary file removed docs/figures/tabnet_arch_1.png
Binary file not shown.
Binary file removed docs/figures/tabnet_arch_2.png
Binary file not shown.
Binary file removed docs/figures/tabresnet_arch.png
Binary file not shown.
Binary file removed docs/figures/tabtransformer_arch.png
Binary file not shown.
Binary file removed docs/figures/transformer_block.png
Binary file not shown.
Loading

0 comments on commit 923011c

Please sign in to comment.