Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make datapipe.py compatible with torchdata==0.11 or later #10080

Merged
merged 18 commits into from
Mar 6, 2025

Conversation

drivanov
Copy link
Contributor

With torch 2.7/torchdata 0.11 releases, the datapype.py example fails

python3 /workspace/examples/datapipe.py 
Traceback (most recent call last):
  File "/workspace/examples/datapipe.py", line 17, in <module>
    from torchdata.datapipes.iter import FileLister, FileOpener, IterDataPipe
ModuleNotFoundError: No module named 'torchdata.datapipes'

fails because some classes have been moved from torchdata to torch nd a few methods, including in_memory_cache, have been removed.

The current PR resolves this issue in that example.

NOTES:
Running this example with the option --task mesh was failing even with previous versions of torch/torchdata packages

  • because 3991 out of 7981 are binary, not text files in the required "OFF" format;
  • if the meshio and torch_claster packages are not installed.

Proposed changes address both these issues.

@drivanov drivanov requested a review from wsad1 as a code owner February 28, 2025 03:35
@akihironitta akihironitta changed the title Fixing datapipe.py example. Make datapipe.py compatible with torchdata==0.11 or later Mar 1, 2025

_, cached_datapipe = tee(datapipe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it different from this?

Suggested change
_, cached_datapipe = tee(datapipe)
cached_datapipe = tee(datapipe, 1)

Also, can you remind me of why we need tee?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. With the change you proposed, this example fails:
<itertools._tee object at 0x7f3512666b80>
Iterating over all data...
Traceback (most recent call last):
  File "/workspace/examples/datapipe.py", line 131, in <module>
    for batch in datapipe:
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/datapipes/_hook_iterator.py", line 204, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/datapipes/iter/grouping.py", line 97, in __iter__
    yield self.wrapper_class(batch)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/batch.py", line 97, in from_data_list
    batch, slice_dict, inc_dict = collate(
                                  ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/collate.py", line 56, in collate
    out = cls(_base_cls=data_list[0].__class__)  # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/batch.py", line 36, in __call__
    globals()[name] = MetaResolver(name, (cls, base_cls), {})
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: type 'itertools._tee' is not an acceptable base type
This exception is thrown by __iter__ of Batcher(batch_size=32, drop_last=False)
  1. I tested both approaches:

(a) With tee:

    _, cached_datapipe = tee(datapipe)
    datapipe = IterableWrapper(cached_datapipe)

(b) Without tee:

    datapipe = IterableWrapper(datapipe)

for both molecule and mesh values of the --task option.

  • For molecule, there is no significant difference.
  • For mesh, using version (a) results in:
Iterating over all data...
Done! [68.57s]
Iterating over all data a second time...
Done! [0.00s]

Whereas version (b) results in:

Iterating over all data...
Done! [68.50s]
Iterating over all data a second time...
Done! [67.46s]

Since version (a) significantly improves performance on the second iteration for mesh, using it for both cases would be more appropriate for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the change you proposed, this example fails:

Sorry for the typo! What I wanted to ask is, what the difference is between:

    _, cached_datapipe = tee(datapipe)

and

    cached_datapipe, = tee(datapipe, 1)

To me, these are exactly the same, but your way just creates an extra unnecesary object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works. Thank you for the great suggestion!

@akihironitta akihironitta enabled auto-merge (squash) March 6, 2025 07:17
@akihironitta akihironitta disabled auto-merge March 6, 2025 07:18
@akihironitta akihironitta merged commit 08697a7 into pyg-team:master Mar 6, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants