Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading large datasets that do not fit into the ram #143

Open
nicolasj92 opened this issue Jun 13, 2020 · 18 comments
Open

Loading large datasets that do not fit into the ram #143

nicolasj92 opened this issue Jun 13, 2020 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@nicolasj92
Copy link

Feature request

What is the expected behavior?
More flexibility in supplying data to the fit() function. How can I use this on a dataset that does not fit into the ram of my PC. Can I supply my own dataloader function?

What is motivation or use case for adding/changing the behavior?
The dataset I want to use tabnet on does not fit into my ram.

How should this be implemented in your opinion?
E.g. as a modular dataloader

Are you willing to work on this yourself?
yes

@nicolasj92 nicolasj92 added the enhancement New feature or request label Jun 13, 2020
@Optimox
Copy link
Collaborator

Optimox commented Jun 13, 2020

hello @nicolasj92,

I think indeed that this would be an interesting feature, especially since deep learning allows easily batch training.

One of my concerns is to keep the library easy to use, so I'm not sure that directly giving access to dataloaders would be the best way, but we could easily provide support for frequent datasets types ( like parquet files, hdf5 files or some others?) just by providing the path.

What is your current need?

@athewsey
Copy link
Contributor

I agree this library has been very quick & easy to get started with for me - but I don't think expanding the API to accept either array-like or generator-like inputs need have that much affect on usability, right? E.g. would be a familiar pattern to those who've worked with Keras before...

Trying to handle the end-to-end file loading within the library could raise all sorts of edge cases like "I have a CSV file but it's in ANSI/windows-1255, rather than UTF-8" - whereas just accepting loaders/generators keeps that complexity out of the library and keeps the interface simple to understand but powerful.

One way I'd like to be able to use the library if possible is with Amazon SageMaker's Pipe Mode to speed up my training job start-up time... It's almost like local files, except the file can only be read sequentially through exactly once - and whenever you need to read through the data again (e.g. another epoch) you move on to the next copy e.g. starting with train_0 on to train_1 and so on.

^ I would for sure not blame you for not wanting to add either this kind of file handling complexity or a SageMaker-specific extension to the API... But if fit() could accept dataloaders then I'd be free to do the whacky stuff in my code :-)

@nicolasj92
Copy link
Author

hello @nicolasj92,

I think indeed that this would be an interesting feature, especially since deep learning allows easily batch training.

One of my concerns is to keep the library easy to use, so I'm not sure that directly giving access to dataloaders would be the best way, but we could easily provide support for frequent datasets types ( like parquet files, hdf5 files or some others?) just by providing the path.

What is your current need?

Hi Optimox, thanks for the quick reply!
My current project involves a dataset containing 1.2m samples with ca. 4000 features. It is still just a csv file but I cannot load it into ram completely.

I agree that ease-of-use should be a major factor. A solution could be to provide a set of standard dataloaders (e.g. np.array, hdf5 ...) but also to allow the api user to code custom dataloaders that inherit from a default dataloader class.

What do you think?

@Optimox
Copy link
Collaborator

Optimox commented Jun 14, 2020

I have to get a closer look at the problem, currently with a very few changes the code would accept instead of X_train any object that can be called with indexing (as numpy arrays), whether it's stored in RAM or anywhere else. This would probably allow reading directly from hdf5 file, parquet files... I don't see many applications where you would need more than that.

I will also take a closer look to see how easily we could expose the dataloaders.

@Optimox
Copy link
Collaborator

Optimox commented Dec 13, 2020

@nicolasj92 @athewsey anyone willing to discuss more in depth how this could be done?

@showkeyjar
Copy link

@Optimox thanks for your great job plan! would you please release a data loader example for batch training after this request been done?

@nicolasj92
Copy link
Author

nicolasj92 commented Jan 20, 2021

@nicolasj92 @athewsey anyone willing to discuss more in depth how this could be done?

Sure, I would be interested and willing to contribute. Sorry for the late reply

@cadama
Copy link

cadama commented Jul 13, 2021

+1 for this. I would also be happy with any hack that makes this possible.

@fcdalgic
Copy link

fcdalgic commented Jul 25, 2021

Hello @Optimox

I'm sorry to bring this topic back. Similar to the @nicolasj92 's problem, I need to train the model batch by batch. To be able to do this, I wrote following code and started the training phase. However, I've noticed that, when the current batch iteration is completed, instead of decreasing the loss model tends to start all over again as also attached.
TabNet-Incremental-Learning.txt

    for i, (x_train, y_train) in enumerate(train_iterator):
        x_train = np.asarray(x_train, dtype='float64')
        y_train = np.asarray(y_train, dtype='float64')
        classifier.fit(x_train, y_train)

I've also tried to save and reload model on every batch iteration, but it gave me the same results.

There is very balanced and homogenous distribution on the dataset and it's batches and it is tested with different methods while having same approach on training part.

Do you have any suggestion to cope with this problem?

Thanks in advance,

Fırat

@Optimox
Copy link
Collaborator

Optimox commented Jul 26, 2021

@bytesandwines what's your format in disk for train_iterator ? do you have several csv files? do you have one large parquet file? hdf5 ?

What is the size of your dataset? how many rows and columns?

About the loss going back up, it might be the learning rate starting from a high value again at every new call to fit, maybe you could try to lower manually your learning rate.

I can start working on a way to train directly from a parquet file or hdf5 file directly but I'm not sure it will respond to everyone's need.

@fcdalgic
Copy link

fcdalgic commented Jul 26, 2021

Hi @Optimox,

Thank you for your quick respone. My code is running on a Portable SSD which has exFAT disk format. The iterator code is given below and it only process one CSV file batch by batch where the dataset consists of:

  • Rows: 800k
  • Columns: 2k, 4k, 6k, 8k and 10k features (num. of features may vary for each file)

I've tested it with foloowing LR values 0.001, 0.01 and 0.05 but however nothing changed on the behaviour. Moreover, the loss jumps from 0.046 to 0.25 with learning rate 0.05 almost whenever I call the fit method with new batch.

Do you think, any pre-control or setup method can cause resetting weights on model (not the whole weights but the higher layers, I've read that kind of limitation on different classification when I was working on different project.)

    num_lines = sum(1 for line in open(file_path))
    print("Number of lines", num_lines)
    with open(file_path, encoding="utf8", errors='ignore') as f:
        line = f.readline()
        while True:
            line_list = list(itertools.islice(f, batch_size))
            labels = []
            inputs = []
            if not line_list:
                break
            else:
                for line in line_list:
                    parts = line.split(',')
                    label = parts[len(parts) - 1]
                    label = label.replace('\n', '')
                    # Since our classifier .score method required decimal input we need to convert string labels first
                    # Moreover, depends on the common approach in phishing world, legitimetes are actual targets therefore
                    # legitimetes are referred to 1 and phishes are referred to 0
                    if label == 'P':
                        label = n_val
                    elif label == 'L':
                        label = p_val
                    else:
                        print("Mismacthed at the label, given one is : " , label)
                        label = -1

                    features = parts[:len(parts) - 1 - 1]
                    labels.append(label)

                    inputs.append(features)
                yield inputs, labels

train_iterator = ReadBatchGenerator(opt.train, opt.batch_size)



@Optimox
Copy link
Collaborator

Optimox commented Jul 26, 2021

Can you please update your code above by inserting it between two triple back quotes like this ``` it makes things easier to read.

I think your problem is different from the one of this thread which is: how do you train a single model with a large dataset that can't fit into RAM.

It seems that you are trying to train the same model by batch on different datasets and not on parts of a large dataset. This is not feasible, indeed you can't train the same model with 200 features and then 400 features, they must be different models.

I think your code don't use the previous model and simply creates a new tabnet model which is trained on a new dataset each time. There is no solution to your problem to my knowledge, you can't simply add features to an existing model and retrain it as nothing changed.

@fcdalgic
Copy link

fcdalgic commented Jul 26, 2021

Hi @Optimox ,

My previous code sample is edited according to your suggestion, thank you for that information.

Sorry for misleading you, I am facing with the sample problem (using with a large dataset that can't fit into RAM), the Column size with different values are belong to different experiments (Let's assume that we have always 10k). In my experiment, I iterating through 800k sample by taking 20k batch in each step, then using fit method to train my model. I always have the same tabnet model, only create an instance before entering the loop and not calling any other methods until the loop ends. (Note that, I also edited my first comment and the code sample in it, whole iteration code is given there)

@Optimox
Copy link
Collaborator

Optimox commented Jul 27, 2021

@bytesandwines ok so it seems that you are trying to do a proper training by batch.

I just gave a try with the census example notebook by just replacing the fit cell by this:

for loop in range(3):
    clf.fit(
        X_train=X_train, y_train=y_train,
        eval_set=[(X_train, y_train), (X_valid, y_valid)],
        eval_name=['train', 'valid'],
        eval_metric=['auc'],
        max_epochs=max_epochs , patience=20,
        batch_size=1024, virtual_batch_size=128,
        num_workers=0,
        weights=1,
        drop_last=False
    ) 

And here is the scores I get:

epoch 0  | loss: 0.66829 | train_auc: 0.75687 | valid_auc: 0.75707 |  0:00:02s
epoch 1  | loss: 0.51272 | train_auc: 0.81261 | valid_auc: 0.82081 |  0:00:05s
epoch 2  | loss: 0.46456 | train_auc: 0.85292 | valid_auc: 0.85174 |  0:00:08s
epoch 3  | loss: 0.44343 | train_auc: 0.87331 | valid_auc: 0.87174 |  0:00:12s
epoch 4  | loss: 0.42012 | train_auc: 0.88464 | valid_auc: 0.87953 |  0:00:15s
epoch 5  | loss: 0.40948 | train_auc: 0.89248 | valid_auc: 0.88777 |  0:00:18s
epoch 6  | loss: 0.40122 | train_auc: 0.90027 | valid_auc: 0.89367 |  0:00:22s
epoch 7  | loss: 0.39694 | train_auc: 0.90486 | valid_auc: 0.89908 |  0:00:25s
epoch 8  | loss: 0.38862 | train_auc: 0.90813 | valid_auc: 0.90367 |  0:00:29s
epoch 9  | loss: 0.36885 | train_auc: 0.91031 | valid_auc: 0.90283 |  0:00:33s
epoch 10 | loss: 0.37079 | train_auc: 0.91271 | valid_auc: 0.906   |  0:00:36s
epoch 11 | loss: 0.35614 | train_auc: 0.91258 | valid_auc: 0.90759 |  0:00:40s
epoch 12 | loss: 0.35444 | train_auc: 0.91465 | valid_auc: 0.90989 |  0:00:44s
epoch 13 | loss: 0.35157 | train_auc: 0.91577 | valid_auc: 0.9084  |  0:00:47s
epoch 14 | loss: 0.34683 | train_auc: 0.91823 | valid_auc: 0.91253 |  0:00:51s
epoch 15 | loss: 0.34771 | train_auc: 0.92352 | valid_auc: 0.91518 |  0:00:54s
epoch 16 | loss: 0.34263 | train_auc: 0.92569 | valid_auc: 0.92015 |  0:00:58s
epoch 17 | loss: 0.33663 | train_auc: 0.92415 | valid_auc: 0.91692 |  0:01:01s
epoch 18 | loss: 0.3416  | train_auc: 0.93034 | valid_auc: 0.92445 |  0:01:05s
epoch 19 | loss: 0.34263 | train_auc: 0.93099 | valid_auc: 0.92433 |  0:01:08s
Stop training because you reached max_epochs = 20 with best_epoch = 18 and best_valid_auc = 0.92445
Best weights from best epoch are automatically used!
epoch 0  | loss: 0.33913 | train_auc: 0.92537 | valid_auc: 0.92273 |  0:00:03s
epoch 1  | loss: 0.33651 | train_auc: 0.92182 | valid_auc: 0.91776 |  0:00:07s
epoch 2  | loss: 0.33323 | train_auc: 0.93207 | valid_auc: 0.92457 |  0:00:10s
epoch 3  | loss: 0.3351  | train_auc: 0.93314 | valid_auc: 0.92731 |  0:00:14s
epoch 4  | loss: 0.3254  | train_auc: 0.9337  | valid_auc: 0.92793 |  0:00:17s
epoch 5  | loss: 0.32266 | train_auc: 0.93351 | valid_auc: 0.92761 |  0:00:21s
epoch 6  | loss: 0.32471 | train_auc: 0.93276 | valid_auc: 0.9273  |  0:00:25s
epoch 7  | loss: 0.32707 | train_auc: 0.9359  | valid_auc: 0.92818 |  0:00:28s
epoch 8  | loss: 0.32245 | train_auc: 0.93518 | valid_auc: 0.92804 |  0:00:32s
epoch 9  | loss: 0.3218  | train_auc: 0.93559 | valid_auc: 0.92917 |  0:00:37s
epoch 10 | loss: 0.31865 | train_auc: 0.93564 | valid_auc: 0.92751 |  0:00:41s
epoch 11 | loss: 0.31857 | train_auc: 0.93575 | valid_auc: 0.92784 |  0:00:44s
epoch 12 | loss: 0.32027 | train_auc: 0.93655 | valid_auc: 0.92779 |  0:00:48s
epoch 13 | loss: 0.32318 | train_auc: 0.93156 | valid_auc: 0.92378 |  0:00:52s

So everything seems to be running as expected could you share your training loop?

@fcdalgic
Copy link

Hello @Optimox ,

I'm still testing my code with your suggestion, sorry for the late response.

At, first I've just copied your fit example and only changed the loop part with my train_iterator, trained model and It seemed to work. Then, I started to comment out parameters to detect which parameter might produce this problem. I will let you know if I could find a clue, or be sure that fix my problem.

I do appriciate your help,

@salman1993
Copy link

salman1993 commented Mar 18, 2022

@Optimox I have a large dataset (100 chunks in parquet) - each chunk fits in memory but not the entire dataset. What would be the best way to train TabNet model on such a dataset? From your example above, it seems like we can call clf.fit(...) multiple times on a different chunk, i.e. the behaviour is similar to fit_partial in other frameworks - is this correct? Would really appreciate if you could provide guidance on the easiest way to do this.

For the census example, this seems to be working:

save_history = []

clf = TabNetClassifier(**tabnet_params)

max_epochs = 10
num_chunks = 5
chunk_size = X_train.shape[0] // num_chunks

for epoch in range(max_epochs):
    for chunk_idx in range(5):    
        start = chunk_idx * chunk_size
        end = (chunk_idx * chunk_size) + chunk_size
        clf.fit(
            X_train=X_train[start:end], y_train=y_train[start:end],
            eval_set=[(X_train[start:end], y_train[start:end]), (X_valid, y_valid)],
            eval_name=['train', 'valid'],
            eval_metric=['auc'],
            max_epochs=1 , patience=20,
            batch_size=1024, virtual_batch_size=128,
            num_workers=0,
            weights=1,
            drop_last=False
        )
        save_history.append(clf.history["valid_auc"])

@Optimox
Copy link
Collaborator

Optimox commented Mar 21, 2022

@salman1993,

You can indeed train with large chunks that will fit into your memory.
The model does start automatically from a warm state each time you call fit so you can successively call fit and train with your entire dataset.

As you can see this is not the most elegant solution, and you'll probably need to decay the learning by hand in your first for loop but I think it should work. Reading directly from a pointer to large parquet file would be better but it's currently not available.

@shongscience
Copy link

[1] supporting CustomDataset and DataLoader can fix this issue easily. But it seems that this case could not happen.
[2] np.memmap could help solving this kind of large dataset issue more gracefully?!? especially, the input format X,y is fixed for numpy.array (not torch.tensor)?!?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests