Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natively support SONAR text models as M2M100 encoder and decoder models #29646

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

avidale
Copy link

@avidale avidale commented Mar 13, 2024

What does this PR do?

This PR adds native support for SONAR text encoders are decoders (https://github.com/facebookresearch/SONAR).

SONAR for text is architecturally an NLLB model, but with the encoder representations mean-pooled into a single fixed-sized vector before passing them to the decoder. Thus, SONAR encoder works as a sentence embedder, and thanks to pretraining on translation data, it is massively multilingual and language-agnostic. And, unlike other sentence encoder, this one has a decoder that can reconstruct the original texts back from their embeddings.

To supports such models natively, the easiest way would be to create NLLB M2M100 model classes with encoder only or decoder only, similarly to the existing classes T5EncoderModel or MT5EncoderModel.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Details

  • I add two new public model classes, M2M100EncoderModel and M2M100DecoderModel.
  • M2M100EncoderModel is a typical encoder-only model (BERT-style), with an additional option of applying mean pooling to its outputs (this is how SONAR text embeddings are computed)
  • M2M100DecoderModel is a module consisting of M2M100Decoder and an output projection layer. As the input, it always expects the encoder_outputs argument to be present, and ignores its input_ids.
  • Unlike M2M100ForConditionalGeneration, M2M100DecoderModel always has its decoder input embedding and decoder output projection layers tied, because this is how SONAR decoder originally was implemented.
  • I add a script for creating these models from the original fairseq2 checkpoints (it doesn't require fairseq2 as a dependency; instead, it just reads and reformats the torch model state dicts).
  • I add specialized unit tests for the encoder-only model (implemented following T5EncoderOnlyModelTest, see Add T5 Encoder for Feature Extraction #8717), and for the decoder-only model (based loosely on similar ideas, but with more tweaks).
  • I add an integration test based on the checkpoints that I published to the HF hub. They reproduce the example sentence encoding and decoding from the readme in the SONAR repo: https://github.com/facebookresearch/SONAR/tree/main.

Testing

All the unit tests I added are run by

python -m pytest tests/models/m2m_100/test_modeling_m2m_100.py

The integration tests that I added are marked as slow, so they could be run with

RUN_SLOW=1 python -m pytest tests/models/m2m_100/test_modeling_m2m_100.py::SonarIntegrationTests

@avidale avidale changed the title Natively support SONAR text models as M2M100 enocder and decoder models Natively support SONAR text models as M2M100 encoder and decoder models Mar 13, 2024
@amyeroberts
Copy link
Collaborator

Hi @avidale, thanks for opening this PR! Let us know when it's ready for review

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Apr 25, 2024
@avidale
Copy link
Author

avidale commented Apr 25, 2024

Hey, I want to continue this work!
How to I reopen it?

@amyeroberts
Copy link
Collaborator

@avidale I can reopen for you

@amyeroberts amyeroberts reopened this Apr 25, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@avidale
Copy link
Author

avidale commented May 23, 2024

Commenting to avoid it getting closed

@huggingface huggingface deleted a comment from github-actions bot Jun 16, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@avidale
Copy link
Author

avidale commented Jul 11, 2024

Still planning to return to it

Copy link

github-actions bot commented Aug 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@avidale avidale marked this pull request as ready for review August 5, 2024 15:53
@avidale
Copy link
Author

avidale commented Aug 5, 2024

Hi @amyeroberts!
The PR is mostly ready for the review.
I am still uncertain about several things, but I would appreciate if you commented on the current design!

One of the main questions: is it alright to always demand encoder_outputs in the M2M100DecoderModel, or a different presentation of its inputs is preferrable?

@avidale
Copy link
Author

avidale commented Aug 6, 2024

Also tagging @ArthurZucker as a potential reviewer

@avidale
Copy link
Author

avidale commented Aug 22, 2024

Tagging @ArthurZucker, @younesbelkada, and @amyeroberts once more, as potential reviewers.
Could you please give some feedback?

@ArthurZucker
Copy link
Collaborator

Yes! Sorry it has been on my stack for a while, and summer vacations hit!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that you have trained these models! (hello to Benoit Sagot, used to be one of my teachers at MVA!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Just wondering if this is copied from / similar to other models in the lib?

Copy link
Author

@avidale avidale Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoder-only part is partially copied from T5 encoder-only code.
The decoder-only part is partially copied from the full M2M100 encoder-decoder model (I am not aware from other decoder-only models in the lib that are conditional on something not in the decoder.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for the encode part it means you are missing the # Copied from
wrapper ! 🤗

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the encode part it means you are missing the # Copied from wrapper !

I tried adding this wrapper, but because I made some modifications to the code, apart from just renaming the model from T5 to M2M100, the check make repo-consistency started complaining. So I had to remove the wrapper.

ref_embeddings = torch.tensor(
[[-0.005286, 0.002008, -0.000562, 0.006344, 0.006329], [-0.000330, -0.007055, 0.007644, 0.001841, 0.003727]]
)
assert torch.allclose(embeddings[:, :5], ref_embeddings, rtol=1e-3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not assert on the embeddings as they are bound to change depending on the model!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the official SONAR model https://github.com/facebookresearch/SONAR/blob/main/sonar/cards/text_sonar_basic_encoder.yaml, which is the only SONAR text encoder that has ever been released so far.

I intended this test only to reproduce how this particular model is converted (and as a template if anyone ever applies my conversion script to some models).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let's remove this still, specifically because the conversion script should allow anyone that has trained a model with your framework to also convert it without hassle! We have integration tests that make sure embeddings or outputs orvalide, conversion scripts are not the place for this!)

Copy link
Author

@avidale avidale Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function test_conversion_accuracy is not a part of the conversion script, it is an integration test for the conversion script. And it is optional, because it requires downloading the huge original checkpoints.
If you insist, though, I can remove it or move it to another file.

tests/models/m2m_100/test_modeling_m2m_100.py Show resolved Hide resolved
Comment on lines +1700 to +1705
self.shared = nn.Embedding(config.vocab_size, config.d_model)

encoder_config = copy.deepcopy(config)
encoder_config.use_cache = False
encoder_config.is_encoder_decoder = False
self.encoder = M2M100Encoder(encoder_config, self.shared)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit weird. If you pass the shared nn.emebdding, only the weights are used, and it's otherwise a sinusoidal :

if embed_tokens is not None:
self.embed_tokens.weight = embed_tokens.weight
which begs the question to the need for this whole class! ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole M2M100EncoderModel class has several rationales to exist:

  1. The additional code in forward for pooling the token embeddings into a sentence embedding.
  2. Having the same level of nesting the parameters (model->encoder->its layers) makes it easier to convert a trained encoder-decoder model (having the architecture of M2M100/NLLB) into a pair of separated encoder and decoder: one needs only to select the required parameters from the state dict, without having to rename them.
  3. We can put an appropriate docstring on it :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, IMO only 1. is a strong reason, I don't know how much pool_last_hidden_state are used but okay.
2. is all in the conversion script no?

Copy link
Author

@avidale avidale Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Applying pool_last_hidden_state is going to be the main use case: producing pooled sentence embeddings. I am setting this parameter by default to False only for consistency with other encoders.
  2. The conversion script takes care of the case when a fairseq2-trained encoder or decoder is getting converted to the HF format. We could also create another conversion script for extracting an encoder-only model from an encoder-decoder trained one, but this seems an overkill to me.

src/transformers/models/m2m_100/modeling_m2m_100.py Outdated Show resolved Hide resolved
src/transformers/models/m2m_100/modeling_m2m_100.py Outdated Show resolved Hide resolved
)

if encoder_outputs is None:
raise ValueError("M2M100DecoderModel expects the `encoder_outputs` to be always present.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we could try here to interpret input_ids as the batch of sentence embeddings, instead of their original intended meaning as a batch of sequences of token ids.

On the one hand, it would allow the model to function with a single non-keyword argument, and would remove the need of wrapping the sentence embeddings into a BaseModelOutput container.

On the other hand, it would introduce a series of new problems, because there are methods for processing input ids, inherited from the base class, which are supposed to process input_ids as tokens.

Comment on lines +1700 to +1705
self.shared = nn.Embedding(config.vocab_size, config.d_model)

encoder_config = copy.deepcopy(config)
encoder_config.use_cache = False
encoder_config.is_encoder_decoder = False
self.encoder = M2M100Encoder(encoder_config, self.shared)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole M2M100EncoderModel class has several rationales to exist:

  1. The additional code in forward for pooling the token embeddings into a sentence embedding.
  2. Having the same level of nesting the parameters (model->encoder->its layers) makes it easier to convert a trained encoder-decoder model (having the architecture of M2M100/NLLB) into a pair of separated encoder and decoder: one needs only to select the required parameters from the state dict, without having to rename them.
  3. We can put an appropriate docstring on it :-)

@avidale
Copy link
Author

avidale commented Oct 1, 2024

@ArthurZucker any other comments?
What do you think about my responses?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for coming back so late! Just a few nits and let's go 🤗

Comment on lines 1609 to 1618
if (output_attentions or self.config.output_attentions) and encoder_outputs.attentions is None:
# just for the sake of compatibility, adding fake encoder attentions
encoder_outputs.attentions = [
torch.ones(
(batch_size, self.config.encoder_attention_heads, 1, 1),
device=embeddings.device,
dtype=embeddings.dtype,
)
for layer in range(self.config.encoder_layers)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid this, we should just disable the output_attentions in the inputs!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, disabling output_attentions may not be the perfect solution, because the self-attention in the decoder is still non-trivial, and the user may want to inspect it.
But I could just remove this code and disable the corresponding unit test.

Comment on lines 1620 to 1625
if (output_hidden_states or self.config.output_hidden_states) and encoder_outputs.hidden_states is None:
# just for the sake of compatibility, adding fake encoder hidden states
encoder_outputs.hidden_states = [
torch.zeros((batch_size, 1, self.config.d_model), device=embeddings.device, dtype=embeddings.dtype)
for layer in range(self.config.encoder_layers + 1)
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here this does not make sense imo!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: the hidden states of the decoder are meaningful, so disabling output_hidden_states is not a good option, but I could just return empty hidden states for the encoder.

Comment on lines 1680 to 1687
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
)
return reordered_past
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
)
return reordered_past

no longer needed!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Removed two implementations of _reorder_cache.

Copy link
Author

@avidale avidale Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, _reorder_cache is needed!
My integration tests are failing without it, giving a

NotImplementedError: Make sure that a `_reorder_cache` function is correctly implemented in transformers.models.m2m_100.modeling_m2m_100 to enable beam search for <class 'transformers.models.m2m_100.modeling_m2m_100.M2M100DecoderModel'>

Apparently, it is still required for beam search.

So I'm adding this method back.

Comment on lines +1700 to +1705
self.shared = nn.Embedding(config.vocab_size, config.d_model)

encoder_config = copy.deepcopy(config)
encoder_config.use_cache = False
encoder_config.is_encoder_decoder = False
self.encoder = M2M100Encoder(encoder_config, self.shared)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, IMO only 1. is a strong reason, I don't know how much pool_last_hidden_state are used but okay.
2. is all in the conversion script no?

@avidale
Copy link
Author

avidale commented Nov 8, 2024

Hi @ArthurZucker!
I have responded to your last comments.
Some tests are failing, but they all seem to be unrelated to my code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants