Handle training-validation-test splits in NESMusicDatabase #29

salu133445 · 2020-12-25T19:41:59Z

The current implementation of NESMusicDatabase does not handle the training-validation-test splits provided in the original dataset. To avoid changing the base Dataset class too much, we could add a subset method and achieve something like the following.

nes = muspy.NESMusicDatabase("data/nes/")

training_set = nes.subset("training")  # also a Dataset object
validation_set = nes.subset("validation")
test_set = nes.subset("test")

The text was updated successfully, but these errors were encountered:

cifkao · 2020-12-25T22:25:41Z

I also need to load a dataset that has splits. I added a part parameter to my constructor:

class Groove2GrooveDataset(muspy.RemoteFolderDataset):

    ...
    
    def __init__( 
        self,
        root: Union[str, Path],
        download_and_extract: bool = False,
        cleanup: bool = False,
        convert: bool = False,
        kind: str = "json",
        n_jobs: int = 1,
        ignore_exceptions: bool = True,
        use_converted: Optional[bool] = None,
        part: str = "train"
    ):
        muspy.RemoteFolderDataset.__init__(
            self, root=root, download_and_extract=download_and_extract,
            cleanup=cleanup, convert=convert, kind=kind, n_jobs=n_jobs,
            ignore_exceptions=ignore_exceptions, use_converted=use_converted)

        path = self.root / 'groove2groove-data-v1.0.0' / 'midi' / part / 'fixed'
        self.raw_filenames = sorted(
            (
                filename
                for filename in path.rglob("*." + self._extension)
            )
        )
        self._filenames = self.raw_filenames

However, this doesn't work:

>>> test_data = Groove2GrooveDataset('/tmp/groove2groove-data', part='test', download_and_extract=True)  # OK
>>> test_data.convert()  # OK
>>> val_data = Groove2GrooveDataset('/tmp/groove2groove-data', part='val')  # OK, reuses downloaded data
>>> val_data.convert()  # not OK, skips conversion as '_converted' already contains the test data

Edit: Overriding converted_dir fixed it:

    @property
    def converted_dir(self):
        return self.root / "_converted_{}".format(self.part)

salu133445 · 2020-12-27T20:02:32Z

I see your point. We could have a Subset class for this, which does not have a convert method but can be iterated over just like a regular dataset. The key is that the subset always share the data with its parent dataset and the only difference is in the filenames to look for.

nes = muspy.NESMusicDatabase("data/nes/")
nes.convert()
training_set = nes.subset("training")

And this won't work

nes = muspy.NESMusicDatabase("data/nes/")
training_set = nes.subset("training")
training_set.convert()  # error

salu133445 added enhancement New feature or request good first issue Good for newcomers labels Dec 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle training-validation-test splits in NESMusicDatabase #29

Handle training-validation-test splits in NESMusicDatabase #29

salu133445 commented Dec 25, 2020

cifkao commented Dec 25, 2020 •

edited

Loading

salu133445 commented Dec 27, 2020

Handle training-validation-test splits in NESMusicDatabase #29

Handle training-validation-test splits in NESMusicDatabase #29

Comments

salu133445 commented Dec 25, 2020

cifkao commented Dec 25, 2020 • edited Loading

salu133445 commented Dec 27, 2020

cifkao commented Dec 25, 2020 •

edited

Loading