From 108cebca69bc07b0ebf28982c30f3146947a46c4 Mon Sep 17 00:00:00 2001 From: jrzaurin Date: Thu, 10 Mar 2022 10:31:56 +0100 Subject: [PATCH] updated README and docs --- README.md | 120 +++++++++++++++++++++---------------------- docs/index.rst | 40 +++++++++------ docs/quick_start.rst | 80 +++++++++++++---------------- pypi_README.md | 96 +++++++++++++++------------------- 4 files changed, 159 insertions(+), 177 deletions(-) diff --git a/README.md b/README.md index a49517a5..b02dab58 100644 --- a/README.md +++ b/README.md @@ -15,14 +15,14 @@ # pytorch-widedeep -A flexible package to use Deep Learning with tabular data, text and images -using wide and deep models. +A flexible package for multimodal-deep-learning to combine tabular data with +text and images using Wide and Deep models in Pytorch **Documentation:** [https://pytorch-widedeep.readthedocs.io](https://pytorch-widedeep.readthedocs.io/en/latest/index.html) **Companion posts and tutorials:** [infinitoml](https://jrzaurin.github.io/infinitoml/) -**Experiments and comparisson with `LightGBM`**: [TabularDL vs LightGBM](https://github.com/jrzaurin/tabulardl-benchmark) +**Experiments and comparison with `LightGBM`**: [TabularDL vs LightGBM](https://github.com/jrzaurin/tabulardl-benchmark) The content of this document is organized as follows: @@ -33,7 +33,8 @@ The content of this document is organized as follows: ### Introduction -``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792) +``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792), +adjusted for multi-modal datasets In general terms, `pytorch-widedeep` is a package to use deep learning with tabular data. In particular, is intended to facilitate the combination of text @@ -89,15 +90,11 @@ into:

-I recommend using the ``wide`` and ``deeptabular`` models in -``pytorch-widedeep``. However it is very likely that users will want to use -their own models for the ``deeptext`` and ``deepimage`` components. That is -perfectly possible as long as the the custom models have an attribute called +It is perfectly possible to use custom models (and not necessarily those in +the library) as long as the the custom models have an attribute called ``output_dim`` with the size of the last layer of activations, so that -``WideDeep`` can be constructed. Again, examples on how to use custom -components can be found in the Examples folder. Just in case -``pytorch-widedeep`` includes standard text (stack of LSTMs) and image -(pre-trained ResNets or stack of CNNs) models. +``WideDeep`` can be constructed. Examples on how to use custom components can +be found in the Examples folder. ### The ``deeptabular`` component @@ -110,15 +107,17 @@ its own, i.e. what one might normally refer as Deep Learning for Tabular Data. Currently, ``pytorch-widedeep`` offers the following different models for that component: - +0. **Wide**: a simple linear model where the nonlinearities are captured via +cross-product transformations, as explained before. 1. **TabMlp**: a simple MLP that receives embeddings representing the -categorical features, concatenated with the continuous features. +categorical features, concatenated with the continuous features, which can +also be embedded. 2. **TabResnet**: similar to the previous model but the embeddings are passed through a series of ResNet blocks built with dense layers. 3. **TabNet**: details on TabNet can be found in [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442) -And the ``Tabformer`` family, i.e. Transformers for Tabular data: +The ``Tabformer`` family, i.e. Transformers for Tabular data: 4. **TabTransformer**: details on the TabTransformer can be found in [TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/pdf/2012.06678.pdf). @@ -133,12 +132,19 @@ on the Fasformer can be found in the Perceiver can be found in [Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206) +And probabilistic DL models for tabular data based on +[Weight Uncertainty in Neural Networks](https://arxiv.org/abs/1505.05424): + +9. **BayesianWide**: Probabilistic adaptation of the `Wide` model. +10. **BayesianTabMlp**: Probabilistic adaptation of the `TabMlp` model + Note that while there are scientific publications for the TabTransformer, SAINT and FT-Transformer, the TabFasfFormer and TabPerceiver are our own adaptation of those algorithms for tabular data. -For details on these models and their options please see the examples in the -Examples folder and the documentation. +For details on these models (and all the other models in the library for the +different data modes) and their corresponding options please see the examples +in the Examples folder and the documentation. ### Installation @@ -165,13 +171,6 @@ cd pytorch-widedeep pip install -e . ``` -**Important note for Mac users**: Since `python -3.8`, [the `multiprocessing` library start method changed from `'fork'` to`'spawn'`](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) which affects the data-loaders. -For the time being, `pytorch-widedeep` sets the `num_workers` to 0 when using -Mac and python version 3.8+. - -Note that this issue does not affect Linux users. - ### Quick start Binary classification with the [adult @@ -181,7 +180,6 @@ using `Wide` and `DeepDense` and defaults settings. Building a wide (linear) and deep model with ``pytorch-widedeep``: ```python - import pandas as pd import numpy as np import torch @@ -191,16 +189,15 @@ from pytorch_widedeep import Trainer from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor from pytorch_widedeep.models import Wide, TabMlp, WideDeep from pytorch_widedeep.metrics import Accuracy +from pytorch_widedeep.datasets import load_adult + -# the following 4 lines are not directly related to ``pytorch-widedeep``. I -# assume you have downloaded the dataset and place it in a dir called -# data/adult/ -df = pd.read_csv("data/adult/adult.csv.zip") +df = load_adult(as_frame=True) df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int) df.drop("income", axis=1, inplace=True) df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label) -# prepare wide, crossed, embedding and continuous columns +# Define the 'column set up' wide_cols = [ "education", "relationship", @@ -209,38 +206,43 @@ wide_cols = [ "native-country", "gender", ] -cross_cols = [("education", "occupation"), ("native-country", "occupation")] -embed_cols = [ - ("education", 16), - ("workclass", 16), - ("occupation", 16), - ("native-country", 32), -] -cont_cols = ["age", "hours-per-week"] -target_col = "income_label" +crossed_cols = [("education", "occupation"), ("native-country", "occupation")] -# target -target = df_train[target_col].values +cat_embed_cols = [ + "workclass", + "education", + "marital-status", + "occupation", + "relationship", + "race", + "gender", + "capital-gain", + "capital-loss", + "native-country", +] +continuous_cols = ["age", "hours-per-week"] +target = "income_label" +target = df_train[target].values -# wide -wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols) +# prepare the data +wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols) X_wide = wide_preprocessor.fit_transform(df_train) -wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1) -# deeptabular -tab_preprocessor = TabPreprocessor(cat_embed_cols=embed_cols, continuous_cols=cont_cols) +tab_preprocessor = TabPreprocessor( + cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type] +) X_tab = tab_preprocessor.fit_transform(df_train) -deeptabular = TabMlp( - mlp_hidden_dims=[64, 32], + +# build the model +wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1) +tab_mlp = TabMlp( column_idx=tab_preprocessor.column_idx, - embed_input=tab_preprocessor.cat_embed_input, - continuous_cols=cont_cols, + cat_embed_input=tab_preprocessor.cat_embed_input, + continuous_cols=continuous_cols, ) +model = WideDeep(wide=wide, deeptabular=tab_mlp) -# wide and deep -model = WideDeep(wide=wide, deeptabular=deeptabular) - -# train the model +# train and validate trainer = Trainer(model, objective="binary", metrics=[Accuracy]) trainer.fit( X_wide=X_wide, @@ -248,10 +250,9 @@ trainer.fit( target=target, n_epochs=5, batch_size=256, - val_split=0.1, ) -# predict +# predict on test X_wide_te = wide_preprocessor.transform(df_test) X_tab_te = tab_preprocessor.transform(df_test) preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te) @@ -268,14 +269,11 @@ torch.save(model.state_dict(), "model_weights/wd_model.pt") # From here in advance, Option 1 or 2 are the same. I assume the user has # prepared the data and defined the new model components: # 1. Build the model -model_new = WideDeep(wide=wide, deeptabular=deeptabular) +model_new = WideDeep(wide=wide, deeptabular=tab_mlp) model_new.load_state_dict(torch.load("model_weights/wd_model.pt")) # 2. Instantiate the trainer -trainer_new = Trainer( - model_new, - objective="binary", -) +trainer_new = Trainer(model_new, objective="binary") # 3. Either start the fit or directly predict preds = trainer_new.predict(X_wide=X_wide, X_tab=X_tab) diff --git a/docs/index.rst b/docs/index.rst index 2e573f66..32c3a33a 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -31,7 +31,8 @@ Documentation Introduction ------------ ``pytorch-widedeep`` is based on Google's `Wide and Deep Algorithm -`_. +`_, adjusted for multi-modal datasets + In general terms, ``pytorch-widedeep`` is a package to use deep learning with tabular and multimodal data. In particular, is intended to facilitate the @@ -97,9 +98,12 @@ own, i.e. what one might normally refer as Deep Learning for Tabular Data. Currently, ``pytorch-widedeep`` offers the following different models for that component: +0. **Wide**: a simple linear model where the nonlinearities are captured via +cross-product transformations, as explained before. 1. **TabMlp**: a simple MLP that receives embeddings representing the -categorical features, concatenated with the continuous features. +categorical features, concatenated with the continuous features, which can +also be embedded. 2. **TabResnet**: similar to the previous model but the embeddings are passed through a series of ResNet blocks built with dense layers. @@ -107,7 +111,7 @@ passed through a series of ResNet blocks built with dense layers. 3. **TabNet**: details on TabNet can be found in `TabNet: Attentive Interpretable Tabular Learning `_ -And the ``Tabformer`` family, i.e. Transformers for Tabular data: +The ``Tabformer`` family, i.e. Transformers for Tabular data: 4. **TabTransformer**: details on the TabTransformer can be found in `TabTransformer: Tabular Data Modeling Using Contextual Embeddings @@ -130,22 +134,24 @@ Models for Natural Language Understanding the Perceiver can be found in `Perceiver: General Perception with Iterative Attention `_ +And probabilistic DL models for tabular data based on +`Weight Uncertainty in Neural Networks `_: + +9. **BayesianWide**: Probabilistic adaptation of the `Wide` model. + +10. **BayesianTabMlp**: Probabilistic adaptation of the `TabMlp` model + Note that while there are scientific publications for the TabTransformer, SAINT and FT-Transformer, the TabFasfFormer and TabPerceiver are our own -adaptation of those algorithms for tabular data. - -For details on these models and their options please see the examples in the -Examples folder and the documentation. - -Finally, while I recommend using the ``wide`` and ``deeptabular`` models in -``pytorch-widedeep`` it is very likely that users will want to use their own -models for the ``deeptext`` and ``deepimage`` components. That is perfectly -possible as long as the the custom models have an attribute called -``output_dim`` with the size of the last layer of activations, so that -``WideDeep`` can be constructed. Again, examples on how to use custom -components can be found in the Examples folder. Just in case -``pytorch-widedeep`` includes standard text (stack of LSTMs or GRUs) and -image(pre-trained ResNets or stack of CNNs) models. +adaptation of those algorithms for tabular data. For details on these models +and their options please see the examples in the Examples folder and the +documentation. + +Finally, it is perfectly possible to use custom models as long as the the +custom models have an attribute called ``output_dim`` with the size of the +last layer of activations, so that ``WideDeep`` can be constructed. Again, +examples on how to use custom components can be found in the Examples +folder. Indices and tables ================== diff --git a/docs/quick_start.rst b/docs/quick_start.rst index 60718364..e21d618e 100644 --- a/docs/quick_start.rst +++ b/docs/quick_start.rst @@ -15,8 +15,9 @@ Read and split the dataset import pandas as pd import numpy as np from sklearn.model_selection import train_test_split + from pytorch_widedeep.datasets import load_adult - df = pd.read_csv("data/adult/adult.csv.zip") + df = load_adult(as_frame=True) df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int) df.drop("income", axis=1, inplace=True) df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label) @@ -28,13 +29,12 @@ Prepare the wide and deep columns .. code-block:: python - import torch from pytorch_widedeep import Trainer from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor from pytorch_widedeep.models import Wide, TabMlp, WideDeep from pytorch_widedeep.metrics import Accuracy - # prepare wide, crossed, embedding and continuous columns + # Define the 'column set up' wide_cols = [ "education", "relationship", @@ -43,41 +43,45 @@ Prepare the wide and deep columns "native-country", "gender", ] - cross_cols = [("education", "occupation"), ("native-country", "occupation")] - embed_cols = [ - ("education", 16), - ("workclass", 16), - ("occupation", 16), - ("native-country", 32), - ] - cont_cols = ["age", "hours-per-week"] - target_col = "income_label" + crossed_cols = [("education", "occupation"), ("native-country", "occupation")] - # target - target = df_train[target_col].values + cat_embed_cols = [ + "workclass", + "education", + "marital-status", + "occupation", + "relationship", + "race", + "gender", + "capital-gain", + "capital-loss", + "native-country", + ] + continuous_cols = ["age", "hours-per-week"] + target = "income_label" + target = df_train[target].values Preprocessing and model components definition --------------------------------------------- .. code-block:: python - # wide - wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols) + wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols) X_wide = wide_preprocessor.fit_transform(df_train) - wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1) - # deeptabular - tab_preprocessor = TabPreprocessor(cat_embed_cols=embed_cols, continuous_cols=cont_cols) + tab_preprocessor = TabPreprocessor( + cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type] + ) X_tab = tab_preprocessor.fit_transform(df_train) - deeptabular = TabMlp( + + # build the model + wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1) + tab_mlp = TabMlp( column_idx=tab_preprocessor.column_idx, cat_embed_input=tab_preprocessor.cat_embed_input, - continuous_cols=cont_cols, - mlp_hidden_dims=[64, 32], + continuous_cols=continuous_cols, ) - - # wide and deep - model = WideDeep(wide=wide, deeptabular=deeptabular) + model = WideDeep(wide=wide, deeptabular=tab_mlp) Fit and predict @@ -85,7 +89,7 @@ Fit and predict .. code-block:: python - # train the model + # train and validate trainer = Trainer(model, objective="binary", metrics=[Accuracy]) trainer.fit( X_wide=X_wide, @@ -93,10 +97,9 @@ Fit and predict target=target, n_epochs=5, batch_size=256, - val_split=0.1, ) - # predict + # predict on test X_wide_te = wide_preprocessor.transform(df_test) X_tab_te = tab_preprocessor.transform(df_test) preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te) @@ -109,34 +112,23 @@ Save and load # Option 1: this will also save training history and lr history if the # LRHistory callback is used - - # Day 0, you have trained your model, save it using the trainer.save - # method trainer.save(path="model_weights", save_state_dict=True) # Option 2: save as any other torch model - - # Day 0, you have trained your model, save as any other torch model torch.save(model.state_dict(), "model_weights/wd_model.pt") - # From here in advance, Option 1 or 2 are the same - - # Few days have passed...I assume the user has prepared the data and - # defined the model components: + # From here in advance, Option 1 or 2 are the same. I assume the user has + # prepared the data and defined the new model components: # 1. Build the model - model_new = WideDeep(wide=wide, deeptabular=deeptabular) + model_new = WideDeep(wide=wide, deeptabular=tab_mlp) model_new.load_state_dict(torch.load("model_weights/wd_model.pt")) # 2. Instantiate the trainer - trainer_new = Trainer( - model_new, - objective="binary", - ) + trainer_new = Trainer(model_new, objective="binary") - # 3. Either fit or directly predict + # 3. Either start the fit or directly predict preds = trainer_new.predict(X_wide=X_wide, X_tab=X_tab) - Of course, one can do **much more**. See the Examples folder in the repo, this documentation or the companion posts for a better understanding of the content of the package and its functionalities. diff --git a/pypi_README.md b/pypi_README.md index 90af089f..a83248c1 100644 --- a/pypi_README.md +++ b/pypi_README.md @@ -11,8 +11,8 @@ # pytorch-widedeep -A flexible package to use Deep Learning with tabular data, text and images -using wide and deep models. +A flexible package for multimodal-deep-learning to combine tabular data with +text and images using Wide and Deep models in Pytorch **Documentation:** [https://pytorch-widedeep.readthedocs.io](https://pytorch-widedeep.readthedocs.io/en/latest/index.html) @@ -24,7 +24,8 @@ using wide and deep models. ### Introduction -``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792) +``pytorch-widedeep`` is based on Google's [Wide and Deep Algorithm](https://arxiv.org/abs/1606.07792), +adjusted for multi-modal datasets In general terms, `pytorch-widedeep` is a package to use deep learning with tabular data. In particular, is intended to facilitate the combination of text @@ -35,7 +36,7 @@ architectures please visit the [repo](https://github.com/jrzaurin/pytorch-widedeep). -### Installation +### Installation Install using pip: @@ -60,20 +61,6 @@ cd pytorch-widedeep pip install -e . ``` -**Important note for Mac users**: Since `python -3.8`, [the `multiprocessing` library start method changed from `'fork'` to`'spawn'`](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) which affects the data-loaders. -For the time being, `pytorch-widedeep` sets the `num_workers` to 0 when using -Mac and python version 3.8+. - -Note that this issue does not affect Linux users. - -```bash -pip install pytorch-widedeep -pip install torch==1.6.0 torchvision==0.7.0 -``` - -None of these issues affect Linux users. - ### Quick start Binary classification with the [adult @@ -83,7 +70,6 @@ using `Wide` and `DeepDense` and defaults settings. Building a wide (linear) and deep model with ``pytorch-widedeep``: ```python - import pandas as pd import numpy as np import torch @@ -93,16 +79,15 @@ from pytorch_widedeep import Trainer from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor from pytorch_widedeep.models import Wide, TabMlp, WideDeep from pytorch_widedeep.metrics import Accuracy +from pytorch_widedeep.datasets import load_adult + -# the following 4 lines are not directly related to ``pytorch-widedeep``. I -# assume you have downloaded the dataset and place it in a dir called -# data/adult/ -df = pd.read_csv("data/adult/adult.csv.zip") +df = load_adult(as_frame=True) df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int) df.drop("income", axis=1, inplace=True) df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.income_label) -# prepare wide, crossed, embedding and continuous columns +# Define the 'column set up' wide_cols = [ "education", "relationship", @@ -111,38 +96,43 @@ wide_cols = [ "native-country", "gender", ] -cross_cols = [("education", "occupation"), ("native-country", "occupation")] -embed_cols = [ - ("education", 16), - ("workclass", 16), - ("occupation", 16), - ("native-country", 32), -] -cont_cols = ["age", "hours-per-week"] -target_col = "income_label" +crossed_cols = [("education", "occupation"), ("native-country", "occupation")] -# target -target = df_train[target_col].values +cat_embed_cols = [ + "workclass", + "education", + "marital-status", + "occupation", + "relationship", + "race", + "gender", + "capital-gain", + "capital-loss", + "native-country", +] +continuous_cols = ["age", "hours-per-week"] +target = "income_label" +target = df_train[target].values -# wide -wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols) +# prepare the data +wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols) X_wide = wide_preprocessor.fit_transform(df_train) -wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1) -# deeptabular -tab_preprocessor = TabPreprocessor(cat_embed_cols=embed_cols, continuous_cols=cont_cols) +tab_preprocessor = TabPreprocessor( + cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type] +) X_tab = tab_preprocessor.fit_transform(df_train) -deeptabular = TabMlp( - mlp_hidden_dims=[64, 32], + +# build the model +wide = Wide(input_dim=np.unique(X_wide).shape[0], pred_dim=1) +tab_mlp = TabMlp( column_idx=tab_preprocessor.column_idx, - embed_input=tab_preprocessor.cat_embed_input, - continuous_cols=cont_cols, + cat_embed_input=tab_preprocessor.cat_embed_input, + continuous_cols=continuous_cols, ) +model = WideDeep(wide=wide, deeptabular=tab_mlp) -# wide and deep -model = WideDeep(wide=wide, deeptabular=deeptabular) - -# train the model +# train and validate trainer = Trainer(model, objective="binary", metrics=[Accuracy]) trainer.fit( X_wide=X_wide, @@ -150,10 +140,9 @@ trainer.fit( target=target, n_epochs=5, batch_size=256, - val_split=0.1, ) -# predict +# predict on test X_wide_te = wide_preprocessor.transform(df_test) X_tab_te = tab_preprocessor.transform(df_test) preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te) @@ -170,14 +159,11 @@ torch.save(model.state_dict(), "model_weights/wd_model.pt") # From here in advance, Option 1 or 2 are the same. I assume the user has # prepared the data and defined the new model components: # 1. Build the model -model_new = WideDeep(wide=wide, deeptabular=deeptabular) +model_new = WideDeep(wide=wide, deeptabular=tab_mlp) model_new.load_state_dict(torch.load("model_weights/wd_model.pt")) # 2. Instantiate the trainer -trainer_new = Trainer( - model_new, - objective="binary", -) +trainer_new = Trainer(model_new, objective="binary") # 3. Either start the fit or directly predict preds = trainer_new.predict(X_wide=X_wide, X_tab=X_tab)