Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to use sparse matrix as input? #297

Closed
albert-ying opened this issue May 15, 2021 · 7 comments
Closed

Is it possible to use sparse matrix as input? #297

albert-ying opened this issue May 15, 2021 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@albert-ying
Copy link

Hi, I want to train on a very huge sparse dataset using tabnet. I noticed that tabnet only takes the dense array as input. However, the dataset is too big to be converted into dense dataset and my RAM can only hold 10% of samples in dense format.

I'm wondering if there is or will be any workaround for this kind of problem?

Thank you so much!

@albert-ying albert-ying added the enhancement New feature or request label May 15, 2021
@Optimox
Copy link
Collaborator

Optimox commented May 15, 2021

hello @albert-ying,

TabNet can't handle sparse matrix as entrypoints.

The thing is you'll have to desparsify your data before going to the dataloader and the network, it is feasible but the dataloader is not so easily accessible with the current implementation of the library and I'm not sure how efficient it is to randomly sample from a sparse matrix.

Why do you need sparse matrix? For large files that don't fit in RAM we should probably implement from disk loading.

@albert-ying
Copy link
Author

Thank you for the quick response @Optimox !

Because the dataset I used is very sparse and in glmnet (which I used before), preserving the sparsity of the dataset use less RAM, and improves the training speed. Maybe this is not the case for the deep learning method and disk loading is a better workaround here.

I'm also wondering if you have any suggestions on hyperparameter settings for the sparse dataset? It is a regression task with one numeric outcome and ~20,000 numeric input columns. Currently, I use the default and it works okay, but it would be impractical to fine-tune it as the dataset is too big. It would be great if you could point to some direction based on your experience that I could try.

Thank you so much!

@IchiruTake
Copy link

Hi, I want to train on a very huge sparse dataset using tabnet. I noticed that tabnet only takes the dense array as input. However, the dataset is too big to be converted into dense dataset and my RAM can only hold 10% of samples in dense format.

I'm wondering if there is or will be any workaround for this kind of problem?

Thank you so much!

You can try uint8 dataType to lower down memory burden

@Optimox
Copy link
Collaborator

Optimox commented May 20, 2021

@albert-ying it's hard to go give parameters in the dark.

If you want to train faster I would advise you to use a GPU, also using a one cycle learning rate scheduler can help you reduce the number of epochs required for convergence, you'll then be able to iterate faster on your parameter tuning.

About the memory burden, @IchiruTake I think it's not that simple as in the end you need to give float 32 to the network. Also I'll probably implement the half precision feature #150 in order to reduce size by half (and faster training time as well)

@IchiruTake
Copy link

@albert-ying it's hard to go give parameters in the dark.

If you want to train faster I would advise you to use a GPU, also using a one cycle learning rate scheduler can help you reduce the number of epochs required for convergence, you'll then be able to iterate faster on your parameter tuning.

About the memory burden, @IchiruTake I think it's not that simple as in the end you need to give float 32 to the network. Also I'll probably implement the half precision feature #150 in order to reduce size by half (and faster training time as well)

Yrs, I agree that apply not only uint8 will solve much problem. In fact, I have once handle the input matrix with size 581k obervations, 7465 features, with 7421 features of it is sparse (2.02 % density), dtype=np.uint8. You can make your network to be semi-supervised network (build 2 or 2+ sub-models with different network structure (different in number of layers and neurons)). Then merge them by Concatenate Layer and standardize your prediction. Then use Tabnet to fit your result with output of that last Concatenate layer. By then, it was much easier than search your associated "importance" features

@albert-ying
Copy link
Author

@IchiruTake Wow, it sounds like a good way to do it and might solve my problem! I'm wondering if there is any examples (tutorial/notebook) of this that I could refer to? Thank you so much!

@Optimox
Copy link
Collaborator

Optimox commented May 27, 2021

I'm closing this as we are not going to allow sparse matrix, the solution might come from #143

@Optimox Optimox closed this as completed May 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants