Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add changes for uroman package to handle non-Roman characters #32404

Merged
merged 7 commits into from
Aug 26, 2024

Conversation

nandwalritik
Copy link
Contributor

@nandwalritik nandwalritik commented Aug 3, 2024

What does this PR do?

Fixes #32387

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sanchit-gandhi @ylacombe

I have added the change for using uroman as optional dependency, just want to know if it's correctly implemented.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super first draft @nandwalritik! It's very close to being ready 🤗

My only issue with the PR is regarding backwards compatibility (explained below). Otherwise, my only request is slightly more verbose installation instructions in the docs.

Other than that, the PR looks great!

@@ -172,11 +174,12 @@ def prepare_for_tokenization(
filtered_text = self._preprocess_char(text)

if has_non_roman_characters(filtered_text) and self.is_uroman:
logger.warning(
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my only qualm with the PR: it's currently not backwards compatible.

Previously, if a user was omitting the uroman step and inputting non-roman characters, they would have only get a simple warning. Now, the code will error out and stop them from doing this (i.e. breaking backwards compatibility).

I would be in favour of keeping the logger.warning (rather than an ImportError) and advising the user to install the uroman package

logger.warning(
    "Text to the tokenizer contains non-Roman characters. To apply the `uroman` pre-processing "
    "step automatically, ensure the `uroman` Romanizer is installed with: `pip install uroman`
    "Otherwise, apply the Romanizer manually as per the instructions: https://github.com/isi-nlp/uroman
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't noticed the default behaviour, I have modified accordingly.
Now it will only give warning in case the uroman package is not installed.
Thanks for the inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great after the changes!

You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable
`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function:

If the is_uroman attribute is True, the tokenizer will automatically apply the uroman package to your text inputs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we encourage users to install uroman here and explain why it's required?

pip install --upgrade uroman

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added changes for docs.

@@ -1054,6 +1055,10 @@ def is_phonemizer_available():
return _phonemizer_available


def is_uroman_available():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job here!

uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"])

inputs = tokenizer(text=uromaized_text, return_tensors="pt")
inputs = tokenizer(text=text, return_tensors="pt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much simpler!

@nandwalritik
Copy link
Contributor Author

Super first draft @nandwalritik! It's very close to being ready 🤗

My only issue with the PR is regarding backwards compatibility (explained below). Otherwise, my only request is slightly more verbose installation instructions in the docs.

Other than that, the PR looks great!

Just one doubt here, when I was installing uroman, It requires python version >=3.10.
Do we need to mention it somewhere?

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @nandwalritik! Regarding the requirement of python>=3.10 - let's mention it in the documentation and logging message. If the user has python<3.10, perhaps we can advise them to use the old method of applying uroman? E.g. using the perl code snippet outside of the tokenizer?

@sanchit-gandhi
Copy link
Contributor

Feel free to request a final review directly from @ArthurZucker when the last changes have been made @nandwalritik :)

@nandwalritik
Copy link
Contributor Author

nandwalritik commented Aug 8, 2024

Hi @ArthurZucker can you give a final review to the PR, already updated docs as per sanchit's inputs for backward compatibility.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for my late review, this looks great to me!

@ArthurZucker ArthurZucker merged commit a378a54 into huggingface:main Aug 26, 2024
22 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
…gface#32404)

* Add changes for uroman package to handle non-Roman characters

* Update docs for uroman changes

* Modifying error message to warning, for backward compatibility

* Update instruction for user to install uroman

* Update docs for uroman python version dependency and backward compatibility

* Update warning message for python version compatibility with uroman

* Refine docs
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
…gface#32404)

* Add changes for uroman package to handle non-Roman characters

* Update docs for uroman changes

* Modifying error message to warning, for backward compatibility

* Update instruction for user to install uroman

* Update docs for uroman python version dependency and backward compatibility

* Update warning message for python version compatibility with uroman

* Refine docs
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
…gface#32404)

* Add changes for uroman package to handle non-Roman characters

* Update docs for uroman changes

* Modifying error message to warning, for backward compatibility

* Update instruction for user to install uroman

* Update docs for uroman python version dependency and backward compatibility

* Update warning message for python version compatibility with uroman

* Refine docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[mms tts] add uroman as a soft-dependency
4 participants