-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataPipe] extract keys #406
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This DataPipe looks useful for other cases as well if me modify it a bit.
Could you please separate it by two pipes:
- 1st one to filter dict keys based on pattern:
stage1 = IterableWrapper([
{"1.txt": "1", "1.bin": "1b", "3.jpg":"foo"},
{"2.txt": "2", "2.bin": "2b"},
])
stage2 = ExtractKeys(stage1, "*.txt", "*.bin")
output = list(iter(stage2))
self.assertEqual({"1.txt": "1", "1.bin": "1b"}, output[0])
- Second is simple map, to drop keys:
dp = dp.map(lambda x: x.values())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tmbdev,
Thanks for your commits on these PRs. Let us know if these are ready for review (but no rush at all!). @VitalyFedyunin and I will be happy to have a look.
Again, thanks for contributing to our library!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Feel free to not accept every requested change. Can you rebase as well?
Again, thank you so much for your contribution!
|
||
|
||
@functional_datapipe("extract_keys") | ||
class ExtractKeysIterDataPipe(IterDataPipe[Dict]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to KeyExtractor
to follow our naming convention? Thanks.
We can still keep "extract_keys"
as the functional name.
@@ -951,6 +952,30 @@ def test_mux_longest_iterdatapipe(self): | |||
with self.assertRaises(TypeError): | |||
len(output_dp) | |||
|
|||
def test_extractor(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_extractor(self): | |
def test_key_extractor(self): |
nit: We used to have a different extractor
duplicate_is_error: it is an error if the same key is selected twice (True) | ||
ignore_missing: skip any dictionaries where one or more patterns don't match (False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate_is_error: it is an error if the same key is selected twice (True) | |
ignore_missing: skip any dictionaries where one or more patterns don't match (False) |
Duplicate lines of descriptions
*args: list of glob patterns or list of glob patterns | ||
duplicate_is_error: it is an error if the same key is selected twice (True) | ||
ignore_missing: allow patterns not to match (i.e., incomplete outputs) | ||
as_tuple: return a tuple instead of a dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_tuple: return a tuple instead of a dictionary | |
as_tuple: return a tuple instead of a dictionary (True or False here) |
""" | ||
|
||
def __init__( | ||
self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to default as_tuple=False
? Based on the docstring I would've guessed you wanted True
instead.
self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False | |
self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error: bool = True, ignore_missing: bool = False, as_tuple: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: allow_duplicate
might be a better name than duplicate_is_error
def __len__(self) -> int: | ||
return len(self.source_datapipe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: A sample will always be yielded even if nothing matches right?
duplicate_is_error: it is an error if the same key is selected twice (True) | ||
ignore_missing: skip any dictionaries where one or more patterns don't match (False) | ||
*args: list of glob patterns or list of glob patterns | ||
duplicate_is_error: it is an error if the same key is selected twice (True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate_is_error: it is an error if the same key is selected twice (True) | |
duplicate_is_error: it is an error if the same key is selected twice (True), otherwise returns the first matched value |
if len(matches) > 1 and self.duplicate_is_error: | ||
raise ValueError(f"extract_keys: multiple sample keys {sample.keys()} match {pattern}.") | ||
if matches[0] in used and self.duplicate_is_error: | ||
raise ValueError(f"extract_keys: key {matches[0]} is selected twice.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError(f"extract_keys: key {matches[0]} is selected twice.") | |
raise ValueError(f"extract_keys: key {matches[0]} is selected twice by multiple patterns.") |
nit
@functional_datapipe("extract_keys") | ||
class ExtractKeysIterDataPipe(IterDataPipe[Dict]): | ||
r""" | ||
Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns. | |
Given a stream of dictionaries, return a stream of dicts (or tuples) by selecting keys using glob patterns. |
>>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys(["*.jpg", "*.png"], "*.gt.txt") | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the one example with webdataset, please add an example with sample outputs here. Copying from the test cases is totally fine to me.
This PR adds an ExtractKeys filter that turns samples represented as dictionaries into tuples. Tuples are constructed by selecting values from the dictionaries by matching the key against a given set of patterns.