Natively support SONAR text models as M2M100 encoder and decoder models #29646

avidale · 2024-03-13T22:14:00Z

What does this PR do?

This PR adds native support for SONAR text encoders are decoders (https://github.com/facebookresearch/SONAR).

SONAR for text is architecturally an NLLB model, but with the encoder representations mean-pooled into a single fixed-sized vector before passing them to the decoder. Thus, SONAR encoder works as a sentence embedder, and thanks to pretraining on translation data, it is massively multilingual and language-agnostic. And, unlike other sentence encoder, this one has a decoder that can reconstruct the original texts back from their embeddings.

To supports such models natively, the easiest way would be to create ~~NLLB~~ M2M100 model classes with encoder only or decoder only, similarly to the existing classes T5EncoderModel or MT5EncoderModel.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Details

I add two new public model classes, M2M100EncoderModel and M2M100DecoderModel.
M2M100EncoderModel is a typical encoder-only model (BERT-style), with an additional option of applying mean pooling to its outputs (this is how SONAR text embeddings are computed)
M2M100DecoderModel is a module consisting of M2M100Decoder and an output projection layer. As the input, it always expects the encoder_outputs argument to be present, and ignores its input_ids.
Unlike M2M100ForConditionalGeneration, M2M100DecoderModel always has its decoder input embedding and decoder output projection layers tied, because this is how SONAR decoder originally was implemented.
I add a script for creating these models from the original fairseq2 checkpoints (it doesn't require fairseq2 as a dependency; instead, it just reads and reformats the torch model state dicts).
I add specialized unit tests for the encoder-only model (implemented following T5EncoderOnlyModelTest, see Add T5 Encoder for Feature Extraction #8717), and for the decoder-only model (based loosely on similar ideas, but with more tweaks).
I add an integration test based on the checkpoints that I published to the HF hub. They reproduce the example sentence encoding and decoding from the readme in the SONAR repo: https://github.com/facebookresearch/SONAR/tree/main.

Testing

All the unit tests I added are run by

python -m pytest tests/models/m2m_100/test_modeling_m2m_100.py

The integration tests that I added are marked as slow, so they could be run with

RUN_SLOW=1 python -m pytest tests/models/m2m_100/test_modeling_m2m_100.py::SonarIntegrationTests

amyeroberts · 2024-03-18T14:45:55Z

Hi @avidale, thanks for opening this PR! Let us know when it's ready for review

github-actions · 2024-04-13T08:03:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Signed-off-by: David Dale <[email protected]>

Just update the branch

avidale · 2024-04-25T08:45:32Z

Hey, I want to continue this work!
How to I reopen it?

amyeroberts · 2024-04-25T09:33:13Z

@avidale I can reopen for you

HuggingFaceDocBuilderDev · 2024-04-25T09:54:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2024-05-20T08:04:44Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

avidale · 2024-05-23T07:57:55Z

Commenting to avoid it getting closed

github-actions · 2024-07-11T08:06:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

avidale · 2024-07-11T08:07:17Z

Still planning to return to it

github-actions · 2024-08-05T08:06:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Signed-off-by: David Dale <[email protected]>

avidale · 2024-08-05T15:55:40Z

Hi @amyeroberts!
The PR is mostly ready for the review.
I am still uncertain about several things, but I would appreciate if you commented on the current design!

One of the main questions: is it alright to always demand encoder_outputs in the M2M100DecoderModel, or a different presentation of its inputs is preferrable?

avidale · 2024-08-06T08:04:14Z

Also tagging @ArthurZucker as a potential reviewer

avidale · 2024-08-22T13:40:15Z

Tagging @ArthurZucker, @younesbelkada, and @amyeroberts once more, as potential reviewers.
Could you please give some feedback?

ArthurZucker · 2024-08-26T09:03:37Z

Yes! Sorry it has been on my stack for a while, and summer vacations hit!

ArthurZucker

Cool that you have trained these models! (hello to Benoit Sagot, used to be one of my teachers at MVA!)

ArthurZucker · 2024-08-28T10:08:54Z

src/transformers/models/m2m_100/modeling_m2m_100.py

Cool! Just wondering if this is copied from / similar to other models in the lib?

The encoder-only part is partially copied from T5 encoder-only code.
The decoder-only part is partially copied from the full M2M100 encoder-decoder model (I am not aware from other decoder-only models in the lib that are conditional on something not in the decoder.)

Okay for the encode part it means you are missing the # Copied from
wrapper ! 🤗

the encode part it means you are missing the # Copied from wrapper !

I tried adding this wrapper, but because I made some modifications to the code, apart from just renaming the model from T5 to M2M100, the check make repo-consistency started complaining. So I had to remove the wrapper.

src/transformers/models/m2m_100/convert_sonar_original_checkpoint_to_transformers.py

ArthurZucker · 2024-08-28T13:16:14Z

src/transformers/models/m2m_100/convert_sonar_original_checkpoint_to_transformers.py

+    ref_embeddings = torch.tensor(
+        [[-0.005286, 0.002008, -0.000562, 0.006344, 0.006329], [-0.000330, -0.007055, 0.007644, 0.001841, 0.003727]]
+    )
+    assert torch.allclose(embeddings[:, :5], ref_embeddings, rtol=1e-3)


let's not assert on the embeddings as they are bound to change depending on the model!

I used the official SONAR model https://github.com/facebookresearch/SONAR/blob/main/sonar/cards/text_sonar_basic_encoder.yaml, which is the only SONAR text encoder that has ever been released so far.

I intended this test only to reproduce how this particular model is converted (and as a template if anyone ever applies my conversion script to some models).

(let's remove this still, specifically because the conversion script should allow anyone that has trained a model with your framework to also convert it without hassle! We have integration tests that make sure embeddings or outputs orvalide, conversion scripts are not the place for this!)

The function test_conversion_accuracy is not a part of the conversion script, it is an integration test for the conversion script. And it is optional, because it requires downloading the huge original checkpoints.
If you insist, though, I can remove it or move it to another file.

tests/models/m2m_100/test_modeling_m2m_100.py

ArthurZucker · 2024-08-28T13:19:26Z

src/transformers/models/m2m_100/modeling_m2m_100.py

+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = M2M100Encoder(encoder_config, self.shared)


this is a bit weird. If you pass the shared nn.emebdding, only the weights are used, and it's otherwise a sinusoidal :

transformers/src/transformers/models/m2m_100/modeling_m2m_100.py

Lines 796 to 797 in 6cffc90

if embed_tokens is not None:

self.embed_tokens.weight = embed_tokens.weight

which begs the question to the need for this whole class! ?

This whole M2M100EncoderModel class has several rationales to exist:

The additional code in forward for pooling the token embeddings into a sentence embedding.

Having the same level of nesting the parameters (model->encoder->its layers) makes it easier to convert a trained encoder-decoder model (having the architecture of M2M100/NLLB) into a pair of separated encoder and decoder: one needs only to select the required parameters from the state dict, without having to rename them.

We can put an appropriate docstring on it :-)

Okay, IMO only 1. is a strong reason, I don't know how much pool_last_hidden_state are used but okay.
2. is all in the conversion script no?

Applying pool_last_hidden_state is going to be the main use case: producing pooled sentence embeddings. I am setting this parameter by default to False only for consistency with other encoders.

The conversion script takes care of the case when a fairseq2-trained encoder or decoder is getting converted to the HF format. We could also create another conversion script for extracting an encoder-only model from an encoder-decoder trained one, but this seems an overkill to me.

src/transformers/models/m2m_100/modeling_m2m_100.py

Co-authored-by: Arthur <[email protected]>

src/transformers/models/m2m_100/modeling_m2m_100.py

avidale · 2024-08-28T14:09:55Z

src/transformers/models/m2m_100/modeling_m2m_100.py

+                )
+
+        if encoder_outputs is None:
+            raise ValueError("M2M100DecoderModel expects the `encoder_outputs` to be always present.")


I am not sure if we could try here to interpret input_ids as the batch of sentence embeddings, instead of their original intended meaning as a batch of sequences of token ids.

On the one hand, it would allow the model to function with a single non-keyword argument, and would remove the need of wrapping the sentence embeddings into a BaseModelOutput container.

On the other hand, it would introduce a series of new problems, because there are methods for processing input ids, inherited from the base class, which are supposed to process input_ids as tokens.

avidale · 2024-08-28T14:17:15Z

src/transformers/models/m2m_100/modeling_m2m_100.py

+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = M2M100Encoder(encoder_config, self.shared)


This whole M2M100EncoderModel class has several rationales to exist:

The additional code in forward for pooling the token embeddings into a sentence embedding.

Having the same level of nesting the parameters (model->encoder->its layers) makes it easier to convert a trained encoder-decoder model (having the architecture of M2M100/NLLB) into a pair of separated encoder and decoder: one needs only to select the required parameters from the state dict, without having to rename them.

We can put an appropriate docstring on it :-)

avidale · 2024-10-01T08:56:08Z

@ArthurZucker any other comments?
What do you think about my responses?

ArthurZucker

Sorry for coming back so late! Just a few nits and let's go 🤗

ArthurZucker · 2024-10-03T07:58:01Z

src/transformers/models/m2m_100/modeling_m2m_100.py

+        if (output_attentions or self.config.output_attentions) and encoder_outputs.attentions is None:
+            # just for the sake of compatibility, adding fake encoder attentions
+            encoder_outputs.attentions = [
+                torch.ones(
+                    (batch_size, self.config.encoder_attention_heads, 1, 1),
+                    device=embeddings.device,
+                    dtype=embeddings.dtype,
+                )
+                for layer in range(self.config.encoder_layers)
+            ]


Let's avoid this, we should just disable the output_attentions in the inputs!

Well, disabling output_attentions may not be the perfect solution, because the self-attention in the decoder is still non-trivial, and the user may want to inspect it.
But I could just remove this code and disable the corresponding unit test.

ArthurZucker · 2024-10-03T07:58:10Z

src/transformers/models/m2m_100/modeling_m2m_100.py

+        if (output_hidden_states or self.config.output_hidden_states) and encoder_outputs.hidden_states is None:
+            # just for the sake of compatibility, adding fake encoder hidden states
+            encoder_outputs.hidden_states = [
+                torch.zeros((batch_size, 1, self.config.d_model), device=embeddings.device, dtype=embeddings.dtype)
+                for layer in range(self.config.encoder_layers + 1)
+            ]


same here this does not make sense imo!

Same: the hidden states of the decoder are meaningful, so disabling output_hidden_states is not a good option, but I could just return empty hidden states for the encoder.

ArthurZucker · 2024-10-03T07:58:22Z

src/transformers/models/m2m_100/modeling_m2m_100.py

+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past


Suggested change

@staticmethod

def _reorder_cache(past_key_values, beam_idx):

reordered_past = ()

for layer_past in past_key_values:

reordered_past += (

tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),

)

return reordered_past

no longer needed!

Great! Removed two implementations of _reorder_cache.

Apparently, _reorder_cache is needed!
My integration tests are failing without it, giving a

NotImplementedError: Make sure that a `_reorder_cache` function is correctly implemented in transformers.models.m2m_100.modeling_m2m_100 to enable beam search for <class 'transformers.models.m2m_100.modeling_m2m_100.M2M100DecoderModel'>

Apparently, it is still required for beam search.

So I'm adding this method back.

ArthurZucker · 2024-10-03T07:59:52Z

src/transformers/models/m2m_100/modeling_m2m_100.py

+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = M2M100Encoder(encoder_config, self.shared)


Okay, IMO only 1. is a strong reason, I don't know how much pool_last_hidden_state are used but okay.
2. is all in the conversion script no?

Signed-off-by: David Dale <[email protected]>

…coder Signed-off-by: David Dale <[email protected]>

Signed-off-by: David Dale <[email protected]>

avidale · 2024-11-08T13:10:52Z

Hi @ArthurZucker!
I have responded to your last comments.
Some tests are failing, but they all seem to be unrelated to my code.

avidale added 2 commits March 13, 2024 22:56

add m2m100 encoder model code

ab00388

Merge branch 'main' into sonar-text-models

5f20047

avidale changed the title ~~Natively support SONAR text models as M2M100 enocder and decoder models~~ Natively support SONAR text models as M2M100 encoder and decoder models Mar 13, 2024

avidale mentioned this pull request Mar 13, 2024

Add support for M2M100EncoderModel (aka NLLB, aka SONAR text encoder) UKPLab/sentence-transformers#2541

Open

avidale added 4 commits April 15, 2024 08:04

add the decoder-only model

97e2356

Signed-off-by: David Dale <[email protected]>

add SONAR checkpoint conversion script

776bbb1

auto style fixers

9dc73a5

Merge remote-tracking branch 'origin/main' into sonar-text-models

f797f86

Just update the branch

github-actions bot closed this Apr 25, 2024

amyeroberts reopened this Apr 25, 2024

huggingface deleted a comment from github-actions bot Jun 16, 2024

avidale added 9 commits August 5, 2024 08:45

add an optional mean pooling

05bcfe3

Signed-off-by: David Dale <[email protected]>

Merge branch 'main' into sonar-text-models

158e70c

Signed-off-by: David Dale <[email protected]>

fix a typo

3b20d77

Signed-off-by: David Dale <[email protected]>

adding a test of model conversion

a3ff402

Signed-off-by: David Dale <[email protected]>

fixup, add doc stub, add integration tests

40f93a7

Signed-off-by: David Dale <[email protected]>

update the docs and fix the integration tests

6ef942e

Signed-off-by: David Dale <[email protected]>

add special tests for the SONAR encoder model

d240680

Signed-off-by: David Dale <[email protected]>

create tests for the sonar decoder

e3cc795

Signed-off-by: David Dale <[email protected]>

fix decoder unit tests

d23973e

Signed-off-by: David Dale <[email protected]>

fix the rest of decoder unit tests

7dbf383

Signed-off-by: David Dale <[email protected]>

avidale marked this pull request as ready for review August 5, 2024 15:53

ArthurZucker reviewed Aug 28, 2024

View reviewed changes

change the copyright header

28eefe3

Co-authored-by: Arthur <[email protected]>

avidale commented Aug 28, 2024

View reviewed changes

ArthurZucker reviewed Oct 3, 2024

View reviewed changes

avidale added 7 commits October 24, 2024 20:23

Merge branch 'main' into sonar-text-models

1c51a62

remove _reorder_cache and dummy encoder attentions and states

25931cb

Signed-off-by: David Dale <[email protected]>

return M2M100 _reorder_cache and skip extra outputs test for SONAR de…

9f18897

…coder Signed-off-by: David Dale <[email protected]>

resurrect one more _reorder_cache

f60ebea

Signed-off-by: David Dale <[email protected]>

Merge branch 'main' into sonar-text-models

bc5479d

Merge branch 'main' into sonar-text-models

6eb0dbd

Merge branch 'main' into sonar-text-models

f9d10d9

	if embed_tokens is not None:
	self.embed_tokens.weight = embed_tokens.weight

Natively support SONAR text models as M2M100 encoder and decoder models #29646

Are you sure you want to change the base?

Natively support SONAR text models as M2M100 encoder and decoder models #29646

Conversation

avidale commented Mar 13, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

Details

Testing

amyeroberts commented Mar 18, 2024

github-actions bot commented Apr 13, 2024

avidale commented Apr 25, 2024

amyeroberts commented Apr 25, 2024

HuggingFaceDocBuilderDev commented Apr 25, 2024

github-actions bot commented May 20, 2024

avidale commented May 23, 2024

github-actions bot commented Jul 11, 2024

avidale commented Jul 11, 2024

github-actions bot commented Aug 5, 2024

avidale commented Aug 5, 2024

avidale commented Aug 6, 2024

avidale commented Aug 22, 2024 • edited Loading

ArthurZucker commented Aug 26, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avidale Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avidale Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avidale Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avidale commented Oct 1, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avidale Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avidale commented Nov 8, 2024

avidale commented Mar 13, 2024 •

edited

Loading

avidale commented Aug 22, 2024 •

edited

Loading

avidale Aug 28, 2024 •

edited

Loading

avidale Aug 29, 2024 •

edited

Loading

avidale Oct 24, 2024 •

edited

Loading

avidale Oct 24, 2024 •

edited

Loading