[Bug] Using multiple datasets with custom formatter fails after the first dataset. #290

ivuorio · 2025-02-06T10:00:16Z

Describe the bug

I am training a YourTTS model using multiple internal dataset that all use the same custom formatter. I am hit two problems:

The embedding computing compute_embeddings() does not accept custom formatters as an input. Solving this seems trivial, just adding formatter kwarg for the method and passing it onwards to load_tts_samples().
The load_tts_samples() only uses the given custom formatter for the first dataset as the formatter is set to None at end of each loop. And thus only the first dataset uses the custom formatter and the second one tries to use a formatter based on the formatter in the config.

To Reproduce

Define multiple datasets with custom formatter in your training script(for me the YourTTS recipe from VCTK directory). e.g.

def custom_formatter():
     """Format the dataset to coqui format"""
     pass


dataset_config = BaseDatasetConfig(
    formatter='custom', meta_file_train="", language="fi-fi", path='./data/data_custom.json'
)

dataset_config_2 = BaseDatasetConfig(
    formatter='custom', meta_file_train="", language="fi-fi", path='./data/data_custom2.json' 
) 

DATASETS_CONFIG_LIST = [dataset_config, dataset_config_2]

uv run python recipes/vctk/yourtts/train_yourtts_jackson_finetune.py and the compute embeddings will fail with missing formatter.
fix compute_embeddings() for support custom formatter. But still the second dataset formatting fails.

Expected behavior

No response

Logs

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "12.4"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.5.1+cu124",
        "TTS": "0.25.3",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.12.5",
        "version": "#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023"
    }
}

Additional context

The problem 1. as said should be trivial and the addition of the kwarg defaulting to None should not cause any side effects.

The problem 2. might require so discussion. For our usage I made a fix that if the formatter argument is given it is used for all the dataset, but I started wondering if this is actually the best solution. How this would not work for a situation where there would be a set of different formatter different formatters used. For example if you would like to fine-tune a model, that is base trained with VCTK-dataset, with a custom formatted dataset. And include some of the original VCTK-data to avoid overfitting to the new speaker.

An other way to solve this would be to change the custom dataset formatter documentation to suggest monkey patching the dataset object, but to my understanding that is not consider as a best practice in python community. Or add an api for adding and getting formatters.

Do you love to contribute on this, but I think it having an input from others would help to reach a desired outcome. Any takes on this. Would it be easier to discuss if I open a pull request with an example fix?

The text was updated successfully, but these errors were encountered:

eginhard · 2025-02-07T14:28:20Z

Yes, we have also encountered this. PRs in that direction would be very welcome, maybe separately for these 2 issues.

For the second issue, it's been a while since I looked into it, but maybe optionally accepting a list of custom formatters is feasible, matching the length of datasets to allow setting a different formatter for each:

coqui-ai-TTS/TTS/tts/datasets/__init__.py

Lines 72 to 74 in a2fb366

    
           datasets: list[dict] | dict, 
        
           eval_split=True, 
        
           formatter: Callable = None,

ivuorio added the bug Something isn't working label Feb 6, 2025

ivuorio mentioned this issue Feb 7, 2025

Add new method for custom formatter handling #293

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Using multiple datasets with custom formatter fails after the first dataset. #290

[Bug] Using multiple datasets with custom formatter fails after the first dataset. #290

ivuorio commented Feb 6, 2025

eginhard commented Feb 7, 2025

[Bug] Using multiple datasets with custom formatter fails after the first dataset. #290

[Bug] Using multiple datasets with custom formatter fails after the first dataset. #290

Comments

ivuorio commented Feb 6, 2025

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

eginhard commented Feb 7, 2025