Skip to content

Conversation

@rfejgin
Copy link
Collaborator

@rfejgin rfejgin commented Oct 13, 2025

Added these two metrics to infer_and_evaluate.py:

  • Add MOS estimation using UTMOSv2
  • Add a metric to tracking total generated audio duration per dataset. We can use this as a lightweight indicator of speaking rate.
  • Rename a command line option for consistency: disable_fcd --> --no_fcd

- Add UTMOSv2 MOS estimation
- Track total generated duration per dataset, which we'll use an
  indicator of speech rate.

Signed-off-by: Fejgin, Roy <[email protected]>
@github-actions github-actions bot added the TTS label Oct 13, 2025
@rfejgin rfejgin changed the title Inference metrics improvements: UTMOSv2 and total generated duration tracking Inference metrics improvements Oct 13, 2025
@rfejgin rfejgin marked this pull request as ready for review October 13, 2025 21:27
Until we add this to our docker image, temporarily install it each time
the (relevant) CI tests are run.

Signed-off-by: Fejgin, Roy <[email protected]>
uses: actions/checkout@v4
# Temporarily install this manually until we add it to the docker image.
- name: Install UTMOSv2 # Needed by evaluate_generated_audio.py.
run: pip install git+https://github.com/sarulab-speech/UTMOSv2.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use a specific commit or are we comfortable with using their latest main branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering the same. But I looked at their recent commits and it's basically infrequent maintenance updates (e.g. compatibility with new versions of torch). Probably less than 1 commit a month on average. So I tend to think we benefit more from using latest on main branch since it should have bugfixes etc. but minimal churn from updates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways: thought about it some more and decided to update pip-requirements to pin UTMOSv2 to its latest release, v1.2.1.

Signed-off-by: Fejgin, Roy <[email protected]>
And also pin to version 1.2.1.

Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
def compute_utmosv2_scores(audio_dir):
print(f"\nComputing UTMOSv2 scores for files in {audio_dir}...")
start_time = time.time()
utmosv2_calculator = UTMOSv2Calculator()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same "device" as on line 200: device = "cuda"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed.

Copy link
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left once final comment, but otherwise good to go

1. Bugfix: use correct device for UTMOSv2 model
2. Removed restriction of runnin UTMOSv2 only on datasets with fewer than 200 entries. Reason: after optimizing how UTMOSv2 is run it is now much faster; a 2000-utterance dataset only went from 25 minutes to 28 minutes by adding UTMOSv2. Getting the metric is worth the small additional  runtime.

Signed-off-by: Fejgin, Roy <[email protected]>
@rfejgin
Copy link
Collaborator Author

rfejgin commented Oct 20, 2025

I removed the restriction limiting UTMOSv2 to datasets with under 200 utterances. That's because it's a lot faster to run now that it is batched and with inference workers and threads tuned. For a 2000-utterance dataset, UTMOSv2 only increased test time from ~25 min to 28 min which seems worth it.

Will merge once CI passes (excluding unrelated known CI issue with coverage calculation).

@rfejgin rfejgin enabled auto-merge (squash) October 20, 2025 22:38
@blisc blisc disabled auto-merge October 20, 2025 23:00
@blisc blisc merged commit 066d622 into NVIDIA-NeMo:magpietts_2508 Oct 20, 2025
63 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants