[DataPipe] extract keys #406

tmbdev · 2022-05-13T20:05:29Z

This PR adds an ExtractKeys filter that turns samples represented as dictionaries into tuples. Tuples are constructed by selecting values from the dictionaries by matching the key against a given set of patterns.

torchdata/datapipes/iter/util/extractkeys.py

VitalyFedyunin

This DataPipe looks useful for other cases as well if me modify it a bit.

Could you please separate it by two pipes:

1st one to filter dict keys based on pattern:

        stage1 = IterableWrapper([
            {"1.txt": "1", "1.bin": "1b", "3.jpg":"foo"},
            {"2.txt": "2", "2.bin": "2b"},
        ])
        stage2 = ExtractKeys(stage1, "*.txt", "*.bin")
        output = list(iter(stage2))
        self.assertEqual({"1.txt": "1", "1.bin": "1b"}, output[0])

Second is simple map, to drop keys:

dp = dp.map(lambda x: x.values())

test/test_iterdatapipe.py

NivekT

Hi @tmbdev,

Thanks for your commits on these PRs. Let us know if these are ready for review (but no rush at all!). @VitalyFedyunin and I will be happy to have a look.

Again, thanks for contributing to our library!

NivekT

A few comments. Feel free to not accept every requested change. Can you rebase as well?

Again, thank you so much for your contribution!

NivekT · 2022-09-13T20:18:49Z

torchdata/datapipes/iter/util/extractkeys.py

+
+
+@functional_datapipe("extract_keys")
+class ExtractKeysIterDataPipe(IterDataPipe[Dict]):


Can we rename this to KeyExtractor to follow our naming convention? Thanks.

We can still keep "extract_keys" as the functional name.

NivekT · 2022-09-13T20:23:09Z

test/test_iterdatapipe.py

@@ -951,6 +952,30 @@ def test_mux_longest_iterdatapipe(self):
        with self.assertRaises(TypeError):
            len(output_dp)

+    def test_extractor(self):


Suggested change

def test_extractor(self):

def test_key_extractor(self):

nit: We used to have a different extractor

NivekT · 2022-09-13T20:24:47Z

torchdata/datapipes/iter/util/extractkeys.py

+        duplicate_is_error: it is an error if the same key is selected twice (True)
+        ignore_missing: skip any dictionaries where one or more patterns don't match (False)


Suggested change

duplicate_is_error: it is an error if the same key is selected twice (True)

ignore_missing: skip any dictionaries where one or more patterns don't match (False)

Duplicate lines of descriptions

NivekT · 2022-09-13T20:25:31Z

torchdata/datapipes/iter/util/extractkeys.py

+        *args: list of glob patterns or list of glob patterns
+        duplicate_is_error: it is an error if the same key is selected twice (True)
+        ignore_missing: allow patterns not to match (i.e., incomplete outputs)
+        as_tuple: return a tuple instead of a dictionary


Suggested change

as_tuple: return a tuple instead of a dictionary

as_tuple: return a tuple instead of a dictionary (True or False here)

NivekT · 2022-09-13T20:27:07Z

torchdata/datapipes/iter/util/extractkeys.py

+    """
+
+    def __init__(
+        self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False


Do we want to default as_tuple=False? Based on the docstring I would've guessed you wanted True instead.

Suggested change

self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False

self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error: bool = True, ignore_missing: bool = False, as_tuple: bool = False

nit: allow_duplicate might be a better name than duplicate_is_error

NivekT · 2022-09-13T20:28:35Z

torchdata/datapipes/iter/util/extractkeys.py

+    def __len__(self) -> int:
+        return len(self.source_datapipe)


Question: A sample will always be yielded even if nothing matches right?

NivekT · 2022-09-13T20:52:32Z

torchdata/datapipes/iter/util/extractkeys.py

+        duplicate_is_error: it is an error if the same key is selected twice (True)
+        ignore_missing: skip any dictionaries where one or more patterns don't match (False)
+        *args: list of glob patterns or list of glob patterns
+        duplicate_is_error: it is an error if the same key is selected twice (True)


Suggested change

duplicate_is_error: it is an error if the same key is selected twice (True)

duplicate_is_error: it is an error if the same key is selected twice (True), otherwise returns the first matched value

NivekT · 2022-09-13T20:54:30Z

torchdata/datapipes/iter/util/extractkeys.py

+                if len(matches) > 1 and self.duplicate_is_error:
+                    raise ValueError(f"extract_keys: multiple sample keys {sample.keys()} match {pattern}.")
+                if matches[0] in used and self.duplicate_is_error:
+                    raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")


Suggested change

raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")

raise ValueError(f"extract_keys: key {matches[0]} is selected twice by multiple patterns.")

nit

NivekT · 2022-09-13T20:55:24Z

torchdata/datapipes/iter/util/extractkeys.py

+@functional_datapipe("extract_keys")
+class ExtractKeysIterDataPipe(IterDataPipe[Dict]):
+    r"""
+    Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.


Suggested change

Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.

Given a stream of dictionaries, return a stream of dicts (or tuples) by selecting keys using glob patterns.

NivekT · 2022-09-13T20:57:13Z

torchdata/datapipes/iter/util/extractkeys.py

+        >>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys(["*.jpg", "*.png"], "*.gt.txt")
+    """


In addition to the one example with webdataset, please add an example with sample outputs here. Copying from the test cases is totally fine to me.

Tom added 2 commits May 13, 2022 12:27

merged

751da99

added extractkeys

5dc2a89

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

torchdata/datapipes/iter/util/extractkeys.py Show resolved Hide resolved

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

test/test_iterdatapipe.py Outdated Show resolved Hide resolved

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

test/test_iterdatapipe.py Show resolved Hide resolved

VitalyFedyunin changed the title ~~extract keys~~ [DataPipe] extract keys May 19, 2022

VitalyFedyunin mentioned this pull request May 20, 2022

[DataPipe] key renamer #402

Open

tmbdev and others added 4 commits August 31, 2022 12:51

added as_tuple option, better testing, duplicate detection

59298b7

Merge branch 'main' into wdsextractkeys

45ae754

fixed type errors

b31d721

improved documentation in extract_keys

ba9b5a4

NivekT reviewed Sep 7, 2022

View reviewed changes

NivekT reviewed Sep 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataPipe] extract keys #406

[DataPipe] extract keys #406

tmbdev commented May 13, 2022

VitalyFedyunin left a comment

NivekT left a comment

NivekT left a comment •

edited

Loading

NivekT Sep 13, 2022

NivekT Sep 13, 2022

NivekT Sep 13, 2022

NivekT Sep 13, 2022

NivekT Sep 13, 2022

NivekT Sep 13, 2022

NivekT Sep 13, 2022 •

edited

Loading

NivekT Sep 13, 2022

NivekT Sep 13, 2022

NivekT Sep 13, 2022

NivekT Sep 13, 2022



		@functional_datapipe("extract_keys")
		class ExtractKeysIterDataPipe(IterDataPipe[Dict]):

		duplicate_is_error: it is an error if the same key is selected twice (True)
		ignore_missing: skip any dictionaries where one or more patterns don't match (False)

	as_tuple: return a tuple instead of a dictionary
	as_tuple: return a tuple instead of a dictionary (True or False here)

	self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error=True, ignore_missing=False, as_tuple=False
	self, source_datapipe: IterDataPipe[Dict], *args, duplicate_is_error: bool = True, ignore_missing: bool = False, as_tuple: bool = False

	raise ValueError(f"extract_keys: key {matches[0]} is selected twice.")
	raise ValueError(f"extract_keys: key {matches[0]} is selected twice by multiple patterns.")

	Given a stream of dictionaries, return a stream of tuples by selecting keys using glob patterns.
	Given a stream of dictionaries, return a stream of dicts (or tuples) by selecting keys using glob patterns.

		>>> dp = FileLister(...).load_from_tar().webdataset().decode(...).extract_keys([".jpg", ".png"], "*.gt.txt")
		"""

[DataPipe] extract keys #406

Are you sure you want to change the base?

[DataPipe] extract keys #406

Conversation

tmbdev commented May 13, 2022

VitalyFedyunin left a comment

Choose a reason for hiding this comment

NivekT left a comment

Choose a reason for hiding this comment

NivekT left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NivekT Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NivekT left a comment •

edited

Loading

NivekT Sep 13, 2022 •

edited

Loading