-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make datapipe.py
compatible with torchdata==0.11
or later
#10080
Conversation
Co-authored-by: Serge Panev <[email protected]>
datapipe.py
example.datapipe.py
compatible with torchdata==0.11
or later
examples/datapipe.py
Outdated
|
||
_, cached_datapipe = tee(datapipe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it different from this?
_, cached_datapipe = tee(datapipe) | |
cached_datapipe = tee(datapipe, 1) |
Also, can you remind me of why we need tee
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- With the change you proposed, this example fails:
<itertools._tee object at 0x7f3512666b80>
Iterating over all data...
Traceback (most recent call last):
File "/workspace/examples/datapipe.py", line 131, in <module>
for batch in datapipe:
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/datapipes/_hook_iterator.py", line 204, in wrap_generator
response = gen.send(None)
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/datapipes/iter/grouping.py", line 97, in __iter__
yield self.wrapper_class(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/batch.py", line 97, in from_data_list
batch, slice_dict, inc_dict = collate(
^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/collate.py", line 56, in collate
out = cls(_base_cls=data_list[0].__class__) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch_geometric/data/batch.py", line 36, in __call__
globals()[name] = MetaResolver(name, (cls, base_cls), {})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: type 'itertools._tee' is not an acceptable base type
This exception is thrown by __iter__ of Batcher(batch_size=32, drop_last=False)
- I tested both approaches:
(a) With tee
:
_, cached_datapipe = tee(datapipe)
datapipe = IterableWrapper(cached_datapipe)
(b) Without tee
:
datapipe = IterableWrapper(datapipe)
for both molecule
and mesh
values of the --task
option.
- For
molecule
, there is no significant difference. - For
mesh
, using version (a) results in:
Iterating over all data...
Done! [68.57s]
Iterating over all data a second time...
Done! [0.00s]
Whereas version (b) results in:
Iterating over all data...
Done! [68.50s]
Iterating over all data a second time...
Done! [67.46s]
Since version (a) significantly improves performance on the second iteration for mesh
, using it for both cases would be more appropriate for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change you proposed, this example fails:
Sorry for the typo! What I wanted to ask is, what the difference is between:
_, cached_datapipe = tee(datapipe)
and
cached_datapipe, = tee(datapipe, 1)
To me, these are exactly the same, but your way just creates an extra unnecesary object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it works. Thank you for the great suggestion!
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
This reverts commit 597b2f5.
With
torch 2.7
/torchdata 0.11
releases, thedatapype.py
example failsfails because some classes have been moved from
torchdata
totorch
nd a few methods, including in_memory_cache, have been removed.The current PR resolves this issue in that example.
NOTES:
Running this example with the option
--task mesh
was failing even with previous versions oftorch
/torchdata
packagesmeshio
andtorch_claster
packages are not installed.Proposed changes address both these issues.