Add changes for uroman package to handle non-Roman characters #32404

nandwalritik · 2024-08-03T11:55:49Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. [mms tts] add uroman as a soft-dependency #32387
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings. - No
Did you write any new necessary tests? - No

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sanchit-gandhi @ylacombe

I have added the change for using uroman as optional dependency, just want to know if it's correctly implemented.

sanchit-gandhi

Super first draft @nandwalritik! It's very close to being ready 🤗

My only issue with the PR is regarding backwards compatibility (explained below). Otherwise, my only request is slightly more verbose installation instructions in the docs.

Other than that, the PR looks great!

sanchit-gandhi · 2024-08-07T13:12:27Z

src/transformers/models/vits/tokenization_vits.py

@@ -172,11 +174,12 @@ def prepare_for_tokenization(
        filtered_text = self._preprocess_char(text)

        if has_non_roman_characters(filtered_text) and self.is_uroman:
-            logger.warning(


This is my only qualm with the PR: it's currently not backwards compatible.

Previously, if a user was omitting the uroman step and inputting non-roman characters, they would have only get a simple warning. Now, the code will error out and stop them from doing this (i.e. breaking backwards compatibility).

I would be in favour of keeping the logger.warning (rather than an ImportError) and advising the user to install the uroman package

logger.warning( "Text to the tokenizer contains non-Roman characters. To apply the `uroman` pre-processing " "step automatically, ensure the `uroman` Romanizer is installed with: `pip install uroman` "Otherwise, apply the Romanizer manually as per the instructions: https://github.com/isi-nlp/uroman )

Sorry, didn't noticed the default behaviour, I have modified accordingly.
Now it will only give warning in case the uroman package is not installed.
Thanks for the inputs.

Looks great after the changes!

sanchit-gandhi · 2024-08-07T13:13:15Z

docs/source/en/model_doc/vits.md

-You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable 
-`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function:
-
+If the is_uroman attribute is True, the tokenizer will automatically apply the uroman package to your text inputs. 


Can we encourage users to install uroman here and explain why it's required?

pip install --upgrade uroman

added changes for docs.

sanchit-gandhi · 2024-08-07T13:13:39Z

src/transformers/utils/import_utils.py

@@ -1054,6 +1055,10 @@ def is_phonemizer_available():
    return _phonemizer_available


+def is_uroman_available():


Awesome job here!

sanchit-gandhi · 2024-08-07T13:13:52Z

docs/source/en/model_doc/vits.md

-uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
-
-inputs = tokenizer(text=uromaized_text, return_tensors="pt")
+inputs = tokenizer(text=text, return_tensors="pt")


So much simpler!

nandwalritik · 2024-08-07T14:31:41Z

Super first draft @nandwalritik! It's very close to being ready 🤗

My only issue with the PR is regarding backwards compatibility (explained below). Otherwise, my only request is slightly more verbose installation instructions in the docs.

Other than that, the PR looks great!

Just one doubt here, when I was installing uroman, It requires python version >=3.10.
Do we need to mention it somewhere?

sanchit-gandhi

Great work @nandwalritik! Regarding the requirement of python>=3.10 - let's mention it in the documentation and logging message. If the user has python<3.10, perhaps we can advise them to use the old method of applying uroman? E.g. using the perl code snippet outside of the tokenizer?

sanchit-gandhi · 2024-08-07T15:28:01Z

Feel free to request a final review directly from @ArthurZucker when the last changes have been made @nandwalritik :)

…bility

nandwalritik · 2024-08-08T15:57:35Z

Hi @ArthurZucker can you give a final review to the PR, already updated docs as per sanchit's inputs for backward compatibility.

ArthurZucker

sorry for my late review, this looks great to me!

HuggingFaceDocBuilderDev · 2024-08-26T15:27:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…gface#32404) * Add changes for uroman package to handle non-Roman characters * Update docs for uroman changes * Modifying error message to warning, for backward compatibility * Update instruction for user to install uroman * Update docs for uroman python version dependency and backward compatibility * Update warning message for python version compatibility with uroman * Refine docs

ArthurZucker requested a review from sanchit-gandhi August 5, 2024 08:38

sanchit-gandhi reviewed Aug 7, 2024

View reviewed changes

sanchit-gandhi approved these changes Aug 7, 2024

View reviewed changes

nandwalritik added 6 commits August 8, 2024 21:21

Add changes for uroman package to handle non-Roman characters

f7cea15

Update docs for uroman changes

de3f1e5

Modifying error message to warning, for backward compatibility

fb03d30

Update instruction for user to install uroman

beb2236

Update docs for uroman python version dependency and backward compati…

ac2d81c

…bility

Update warning message for python version compatibility with uroman

7a7ef2d

nandwalritik force-pushed the add_uroman branch from ebb09fd to 7a7ef2d Compare August 8, 2024 15:51

Refine docs

b413053

ArthurZucker approved these changes Aug 26, 2024

View reviewed changes

ArthurZucker merged commit a378a54 into huggingface:main Aug 26, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add changes for uroman package to handle non-Roman characters #32404

Add changes for uroman package to handle non-Roman characters #32404

nandwalritik commented Aug 3, 2024 •

edited by sanchit-gandhi

Loading

sanchit-gandhi left a comment

sanchit-gandhi Aug 7, 2024 •

edited

Loading

nandwalritik Aug 7, 2024

sanchit-gandhi Aug 7, 2024

sanchit-gandhi Aug 7, 2024

nandwalritik Aug 7, 2024

sanchit-gandhi Aug 7, 2024

sanchit-gandhi Aug 7, 2024

nandwalritik commented Aug 7, 2024

sanchit-gandhi left a comment

sanchit-gandhi commented Aug 7, 2024

nandwalritik commented Aug 8, 2024 •

edited

Loading

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented Aug 26, 2024

		@@ -1054,6 +1055,10 @@ def is_phonemizer_available():
		return _phonemizer_available


		def is_uroman_available():

Add changes for uroman package to handle non-Roman characters #32404

Add changes for uroman package to handle non-Roman characters #32404

Conversation

nandwalritik commented Aug 3, 2024 • edited by sanchit-gandhi Loading

What does this PR do?

Before submitting

Who can review?

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

nandwalritik Aug 7, 2024

Choose a reason for hiding this comment

sanchit-gandhi Aug 7, 2024

Choose a reason for hiding this comment

sanchit-gandhi Aug 7, 2024

Choose a reason for hiding this comment

nandwalritik Aug 7, 2024

Choose a reason for hiding this comment

sanchit-gandhi Aug 7, 2024

Choose a reason for hiding this comment

sanchit-gandhi Aug 7, 2024

Choose a reason for hiding this comment

nandwalritik commented Aug 7, 2024

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi commented Aug 7, 2024

nandwalritik commented Aug 8, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 26, 2024

nandwalritik commented Aug 3, 2024 •

edited by sanchit-gandhi

Loading

sanchit-gandhi Aug 7, 2024 •

edited

Loading

nandwalritik commented Aug 8, 2024 •

edited

Loading