Skip to content

[assistance] Confirmation on Data Format and Structure for Fine-Tuning #141

@IrisSally

Description

@IrisSally

确认清单

  • 我已经阅读过 README.md 和 dependencies.md 文件
  • 我已经确认之前没有 issue 或 discussion 涉及此 BUG
  • 我已经确认问题发生在最新代码或稳定版本中
  • 我已经确认问题与 API 无关
  • 我已经确认问题与 WebUI 无关
  • 我已经确认问题与 Finetune 无关

你的issues

Hi,

I am planning to fine-tune ChatTTS using my own dataset, and I would like to confirm a few details regarding the data format and requirements.

1. Data Structure and .list File Format

Based on the documentation and examples, I have organized my data as follows:

File Structure

datasets/
└── data_speaker_a/
    ├── speaker_a/
    │   ├── 1.wav
    │   ├── 2.wav
    │   └── ... (more audio files)
    └── speaker_a.list

.list File Format

Each line in the .list file is formatted as filepath|speaker|lang|text, where:

  • filepath: Relative path to the audio file (relative to the directory containing the .list file).
  • speaker: Name of the speaker.
  • lang: Language code (e.g., ZH for Chinese, EN for English).
  • text: Transcription of the audio content.

Example:

speaker_a/1.wav|John|ZH|你好
speaker_a/2.wav|John|EN|Hello

Could you please confirm if this structure and format are correct?

2. Audio Data Specifications

I am planning to use 100 audio files, each approximately 10 seconds long, with a sampling rate of 24000 Hz for training.

Is this a suitable setup for fine-tuning the model? Are there any specific recommendations or requirements?

Thank you for your assistance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions