Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle training-validation-test splits in NESMusicDatabase #29

Open
salu133445 opened this issue Dec 25, 2020 · 2 comments
Open

Handle training-validation-test splits in NESMusicDatabase #29

salu133445 opened this issue Dec 25, 2020 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@salu133445
Copy link
Owner

The current implementation of NESMusicDatabase does not handle the training-validation-test splits provided in the original dataset. To avoid changing the base Dataset class too much, we could add a subset method and achieve something like the following.

nes = muspy.NESMusicDatabase("data/nes/")

training_set = nes.subset("training")  # also a Dataset object
validation_set = nes.subset("validation")
test_set = nes.subset("test")
@salu133445 salu133445 added enhancement New feature or request good first issue Good for newcomers labels Dec 25, 2020
@cifkao
Copy link
Contributor

cifkao commented Dec 25, 2020

I also need to load a dataset that has splits. I added a part parameter to my constructor:

class Groove2GrooveDataset(muspy.RemoteFolderDataset):

    ...
    
    def __init__( 
        self,
        root: Union[str, Path],
        download_and_extract: bool = False,
        cleanup: bool = False,
        convert: bool = False,
        kind: str = "json",
        n_jobs: int = 1,
        ignore_exceptions: bool = True,
        use_converted: Optional[bool] = None,
        part: str = "train"
    ):
        muspy.RemoteFolderDataset.__init__(
            self, root=root, download_and_extract=download_and_extract,
            cleanup=cleanup, convert=convert, kind=kind, n_jobs=n_jobs,
            ignore_exceptions=ignore_exceptions, use_converted=use_converted)

        path = self.root / 'groove2groove-data-v1.0.0' / 'midi' / part / 'fixed'
        self.raw_filenames = sorted(
            (
                filename
                for filename in path.rglob("*." + self._extension)
            )
        )
        self._filenames = self.raw_filenames

However, this doesn't work:

>>> test_data = Groove2GrooveDataset('/tmp/groove2groove-data', part='test', download_and_extract=True)  # OK
>>> test_data.convert()  # OK
>>> val_data = Groove2GrooveDataset('/tmp/groove2groove-data', part='val')  # OK, reuses downloaded data
>>> val_data.convert()  # not OK, skips conversion as '_converted' already contains the test data

Edit: Overriding converted_dir fixed it:

    @property
    def converted_dir(self):
        return self.root / "_converted_{}".format(self.part)

@salu133445
Copy link
Owner Author

I see your point. We could have a Subset class for this, which does not have a convert method but can be iterated over just like a regular dataset. The key is that the subset always share the data with its parent dataset and the only difference is in the filenames to look for.

nes = muspy.NESMusicDatabase("data/nes/")
nes.convert()
training_set = nes.subset("training")

And this won't work

nes = muspy.NESMusicDatabase("data/nes/")
training_set = nes.subset("training")
training_set.convert()  # error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants