Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Data loader with EmbeddingOperator using pretrained embeddings is very slow #1244

Open
CarloNicolini opened this issue Jul 4, 2024 · 0 comments

Comments

@CarloNicolini
Copy link

CarloNicolini commented Jul 4, 2024

❓ Questions & Help

I am experiencing a large degradation in the performance of the Loader when adding a transform with EmbeddingOperator, for loading data from a pretrained embeddings numpy array.
I have been following the method exposed in this tutorial notebook

Without the transforms argument the entire dataset is consumed in 6 seconds, while with the loading from pretrained embeddings array it takes almost 40 minutes!
My "validation.parquet" is a small NVTabular dataset with 16 partitions, totalling almost 200 MB.
Specifically with the transforms enabled, I am seeing a very low CPU and GPU utilization, as well as close to zero GPU memory consumption. Nor the CPU or the GPU gets utilized more than 6%.
It seems very strange to me that simply reading batch_size specific rows from a numpy array takes that much time, even considering moving them to GPU.

Details

Here is a minimal working example to reproduce this degradation.

from __future__ import annotations

from pathlib import Path

import numpy as np
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io.dataset import Dataset
from merlin.loader.tensorflow import Loader
from tqdm.auto import tqdm


def test_pretrained_loader():
    data_path = "validation.parquet"
    data_path = Path(data_path)
    X = Dataset(data_path, engine="parquet")
    pretrained_array = np.zeros((1_000_000, 2), dtype=np.float32)

    loader = Loader(
        X,
        batch_size=4096,
        shuffle=True,
        transforms=[
            EmbeddingOperator(
                pretrained_array,
                lookup_key="recruitment_id",
                embedding_name="embeddings",
            )
        ],
        device="gpu",
    )

    for batch in tqdm(loader, desc="Iterating batches..."):
        pass


if __name__ == "__main__":
    test_pretrained_loader()

Question

Is this behaviour intended? What are possible bottlenecks for this? Is something like data prefetching or asynchronous loading applicable here?

Tasks

No tasks being tracked yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant