You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An IterDataPipe which can consume from stdin and automatically re-cyle each epoch.
Motivation, pitch
I'd like to push data augmentation and preprocessing upstream so model training/inference can operate directly on tokens streamed over stdin. This allows for tremendous flexibility without a user needing to hard-code a preprocessing pipeline in userland code. For an NLP use-case, I imagine something like...
The preprocessed text could be written to a file which native torchdata constructs could operate on directly. This is fine, but requires a copy of the data to be written to disk.
Additional context
The current code doesn't work because sys.stdin closes when it reaches EOF, so the dataloader only sees a single epoch worth of data.
The text was updated successfully, but these errors were encountered:
🚀 The feature
An
IterDataPipe
which can consume from stdin and automatically re-cyle each epoch.Motivation, pitch
I'd like to push data augmentation and preprocessing upstream so model training/inference can operate directly on tokens streamed over stdin. This allows for tremendous flexibility without a user needing to hard-code a preprocessing pipeline in userland code. For an NLP use-case, I imagine something like...
with some code similar to
Alternatives
The preprocessed text could be written to a file which native torchdata constructs could operate on directly. This is fine, but requires a copy of the data to be written to disk.
Additional context
The current code doesn't work because
sys.stdin
closes when it reaches EOF, so the dataloader only sees a single epoch worth of data.The text was updated successfully, but these errors were encountered: