-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some additional self-supervised features #216
Comments
Hey @s9021025292140 sorry for the late reply. I will try to keep up with the conversation this week if you reply :) Let's go:
So in summary: a pipeline where you use SSL and then extract the learned embeddings and do...whatever you would like with them, is possible :) . You just have to train using any of the two training classes And regarding images, widedeep at the moment does not support SSL for images and tabular data. One thing you could do is to encode/embed the images separately (using a restnet architecture or something light) and then treat the encoded image as categorical cols. Let me know if this helps and I will try to add a bit more information during the day. And thanks for using the library! (or considering it :) ) |
Take a look here:
or here:
on how to extract the encoder and from there on you can do anything you wanted with it. |
Oh! Thank you very much for your reply!
Finally, thank you very much for answering each of my questions with care😊. |
here you have a fully functioning example for what you want: import torch
import pandas as pd
from pytorch_widedeep.models import TabMlp
from pytorch_widedeep.datasets import load_adult
from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.self_supervised_training import EncoderDecoderTrainer
use_cuda = torch.cuda.is_available()
if __name__ == "__main__":
# load the data and some preprocessing you probably don't need
df: pd.DataFrame = load_adult(as_frame=True)
df.columns = [c.replace("-", "_") for c in df.columns]
df["income_label"] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop("income", axis=1, inplace=True)
# define the categorical and continuous columns, as well as the target
cat_embed_cols = [
"workclass",
"education",
"marital_status",
"occupation",
"relationship",
"race",
"gender",
"capital_gain",
"capital_loss",
"native_country",
]
continuous_cols = ["age", "hours_per_week"]
target_col = "income_label"
# instantiate the TabPreprocessor that will be use throughout the experiment
tab_preprocessor = TabPreprocessor(
cat_embed_cols=cat_embed_cols, continuous_cols=continuous_cols, scale=True
)
X_tab = tab_preprocessor.fit_transform(df)
target = df[target_col].values
# We define a model that will act as the encoder in the encoder/decoder
# architecture. This could be any of: TabMlp, TabResnet or TabNet
tab_mlp = TabMlp(
column_idx=tab_preprocessor.column_idx,
cat_embed_input=tab_preprocessor.cat_embed_input,
continuous_cols=tab_preprocessor.continuous_cols,
)
# If we do not pass a custom decoder, which is perfectly possible via the
# decoder param (see the docs or the examples notebooks, the
# EncoderDecoderTrainer will automatically build a decoder which will be
# the 'mirror' image of the encoder
encoder_decoder_trainer = EncoderDecoderTrainer(encoder=tab_mlp)
encoder_decoder_trainer.pretrain(X_tab, n_epochs=5, batch_size=256)
# New data comes
new_data = df.sample(32)
# Preprocess the new data in the exact same way as the data used duting
# the pre-training before
new_X_tab_arr = tab_preprocessor.fit_transform(new_data)
# Normally, the transformation to tensor happens inside the Trainer.
# However, for what you want you just have to do it here
new_X_tab_tnsr = torch.tensor(new_X_tab_arr).float()
# And pass the tensor to the encoder (in eval model) to get the embeddings
# here 'ed_model' stands for 'encoder_decoder_model'
encoder = encoder_decoder_trainer.ed_model.encoder.eval()
# # If you choose to save the pretrained model then
# encoder_decoder_trainer.save(
# path="pretrained_weights", model_filename="encoder_decoder_model.pt"
# )
# # some time has passed, we load the model with torch as usual:
# encoder_decoder_model = torch.load("pretrained_weights/encoder_decoder_model.pt")
# encoder = encoder_decoder_model.encoder
embeddings_1 = encoder(new_X_tab_tnsr)
# or simply use tab_mlp, since as you remember, it was our encoder: 'encoder=tab_mlp'
embeddings_2 = tab_mlp.eval()(new_X_tab_tnsr) |
And as regarding your point 2 on |
Thank you for creating such a wonderful open-source project. I have a few questions:
In my past projects, I didn't have labeled data, so I used TabNet (https://github.com/dreamquark-ai/tabnet) for self-supervised learning on my tabular data. I made some modifications to the open-source code to allow the SSL-trained TabNet to output only the encoded latent space during prediction (the embeddings obtained after the original data passes through the encoder, without going through the decoder). This helps us capture outliers or retrieve similar data from unlabeled tabular data in real-world datasets. I would like to ask if the TabNet in this project can also use the encoder to directly predict data without labels after SSL training? Additionally, can the SSL-trained TabNet provide an "explanin" function (currently, it seems only the supervised TabNet supports the explain function and can obtain embeddings from the original data predictions)?
In your SSL tabular model, you also provide the SAINT architecture. Can SAINT, like the description above, be used to train unlabeled data with SSL and directly use the SSL-trained SAINT to predict the original data to obtain embeddings? Currently, it seems that only the supervised SAINT can output embeddings.
In your provided examples, you can train different modalities of data such as tabular, text, and image in a multimodal manner, but it seems to require labeled data (as indicated by the target). My application requires training tabular and image data without labels, with the aim of finding the latent space between tabular and image data through SSL. Is it possible to provide a version for unlabeled multimodal training?
model = WideDeep(
deeptabular=tab_mlp,
deeptext=models_fuser,
deepimage=vision,
deephead=deephead,
)
trainer = Trainer(model, objective="binary")
trainer.fit(
X_tab=X_tab,
X_text=[X_text_1, X_text_2],
X_img=X_img,
target=df["target"].values,
n_epochs=1,
batch_size=32,
)
I apologize for requesting so much. Since I mostly deal with unlabeled data in practical applications, meeting the above requirements would greatly benefit many people. Once again, thank you for providing such a great project! :)
The text was updated successfully, but these errors were encountered: