-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Natively support SONAR text models as M2M100 encoder and decoder models #29646
base: main
Are you sure you want to change the base?
Conversation
Hi @avidale, thanks for opening this PR! Let us know when it's ready for review |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Signed-off-by: David Dale <[email protected]>
Just update the branch
Hey, I want to continue this work! |
@avidale I can reopen for you |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Commenting to avoid it getting closed |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Still planning to return to it |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Hi @amyeroberts! One of the main questions: is it alright to always demand |
Also tagging @ArthurZucker as a potential reviewer |
Tagging @ArthurZucker, @younesbelkada, and @amyeroberts once more, as potential reviewers. |
Yes! Sorry it has been on my stack for a while, and summer vacations hit! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool that you have trained these models! (hello to Benoit Sagot, used to be one of my teachers at MVA!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Just wondering if this is copied from / similar to other models in the lib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encoder-only part is partially copied from T5 encoder-only code.
The decoder-only part is partially copied from the full M2M100 encoder-decoder model (I am not aware from other decoder-only models in the lib that are conditional on something not in the decoder.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay for the encode part it means you are missing the # Copied from
wrapper ! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the encode part it means you are missing the
# Copied from
wrapper !
I tried adding this wrapper, but because I made some modifications to the code, apart from just renaming the model from T5
to M2M100
, the check make repo-consistency
started complaining. So I had to remove the wrapper.
src/transformers/models/m2m_100/convert_sonar_original_checkpoint_to_transformers.py
Outdated
Show resolved
Hide resolved
src/transformers/models/m2m_100/convert_sonar_original_checkpoint_to_transformers.py
Show resolved
Hide resolved
ref_embeddings = torch.tensor( | ||
[[-0.005286, 0.002008, -0.000562, 0.006344, 0.006329], [-0.000330, -0.007055, 0.007644, 0.001841, 0.003727]] | ||
) | ||
assert torch.allclose(embeddings[:, :5], ref_embeddings, rtol=1e-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not assert on the embeddings as they are bound to change depending on the model!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the official SONAR model https://github.com/facebookresearch/SONAR/blob/main/sonar/cards/text_sonar_basic_encoder.yaml, which is the only SONAR text encoder that has ever been released so far.
I intended this test only to reproduce how this particular model is converted (and as a template if anyone ever applies my conversion script to some models).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(let's remove this still, specifically because the conversion script should allow anyone that has trained a model with your framework to also convert it without hassle! We have integration tests that make sure embeddings or outputs orvalide, conversion scripts are not the place for this!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function test_conversion_accuracy
is not a part of the conversion script, it is an integration test for the conversion script. And it is optional, because it requires downloading the huge original checkpoints.
If you insist, though, I can remove it or move it to another file.
self.shared = nn.Embedding(config.vocab_size, config.d_model) | ||
|
||
encoder_config = copy.deepcopy(config) | ||
encoder_config.use_cache = False | ||
encoder_config.is_encoder_decoder = False | ||
self.encoder = M2M100Encoder(encoder_config, self.shared) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit weird. If you pass the shared nn.emebdding, only the weights are used, and it's otherwise a sinusoidal :
transformers/src/transformers/models/m2m_100/modeling_m2m_100.py
Lines 796 to 797 in 6cffc90
if embed_tokens is not None: | |
self.embed_tokens.weight = embed_tokens.weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole M2M100EncoderModel
class has several rationales to exist:
- The additional code in
forward
for pooling the token embeddings into a sentence embedding. - Having the same level of nesting the parameters (model->encoder->its layers) makes it easier to convert a trained encoder-decoder model (having the architecture of M2M100/NLLB) into a pair of separated encoder and decoder: one needs only to select the required parameters from the state dict, without having to rename them.
- We can put an appropriate docstring on it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, IMO only 1. is a strong reason, I don't know how much pool_last_hidden_state
are used but okay.
2. is all in the conversion script no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Applying
pool_last_hidden_state
is going to be the main use case: producing pooled sentence embeddings. I am setting this parameter by default toFalse
only for consistency with other encoders. - The conversion script takes care of the case when a fairseq2-trained encoder or decoder is getting converted to the HF format. We could also create another conversion script for extracting an encoder-only model from an encoder-decoder trained one, but this seems an overkill to me.
Co-authored-by: Arthur <[email protected]>
) | ||
|
||
if encoder_outputs is None: | ||
raise ValueError("M2M100DecoderModel expects the `encoder_outputs` to be always present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we could try here to interpret input_ids
as the batch of sentence embeddings, instead of their original intended meaning as a batch of sequences of token ids.
On the one hand, it would allow the model to function with a single non-keyword argument, and would remove the need of wrapping the sentence embeddings into a BaseModelOutput
container.
On the other hand, it would introduce a series of new problems, because there are methods for processing input ids, inherited from the base class, which are supposed to process input_ids
as tokens.
self.shared = nn.Embedding(config.vocab_size, config.d_model) | ||
|
||
encoder_config = copy.deepcopy(config) | ||
encoder_config.use_cache = False | ||
encoder_config.is_encoder_decoder = False | ||
self.encoder = M2M100Encoder(encoder_config, self.shared) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole M2M100EncoderModel
class has several rationales to exist:
- The additional code in
forward
for pooling the token embeddings into a sentence embedding. - Having the same level of nesting the parameters (model->encoder->its layers) makes it easier to convert a trained encoder-decoder model (having the architecture of M2M100/NLLB) into a pair of separated encoder and decoder: one needs only to select the required parameters from the state dict, without having to rename them.
- We can put an appropriate docstring on it :-)
@ArthurZucker any other comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for coming back so late! Just a few nits and let's go 🤗
if (output_attentions or self.config.output_attentions) and encoder_outputs.attentions is None: | ||
# just for the sake of compatibility, adding fake encoder attentions | ||
encoder_outputs.attentions = [ | ||
torch.ones( | ||
(batch_size, self.config.encoder_attention_heads, 1, 1), | ||
device=embeddings.device, | ||
dtype=embeddings.dtype, | ||
) | ||
for layer in range(self.config.encoder_layers) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid this, we should just disable the output_attentions
in the inputs!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, disabling output_attentions
may not be the perfect solution, because the self-attention in the decoder is still non-trivial, and the user may want to inspect it.
But I could just remove this code and disable the corresponding unit test.
if (output_hidden_states or self.config.output_hidden_states) and encoder_outputs.hidden_states is None: | ||
# just for the sake of compatibility, adding fake encoder hidden states | ||
encoder_outputs.hidden_states = [ | ||
torch.zeros((batch_size, 1, self.config.d_model), device=embeddings.device, dtype=embeddings.dtype) | ||
for layer in range(self.config.encoder_layers + 1) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here this does not make sense imo!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same: the hidden states of the decoder are meaningful, so disabling output_hidden_states
is not a good option, but I could just return empty hidden states for the encoder.
@staticmethod | ||
def _reorder_cache(past_key_values, beam_idx): | ||
reordered_past = () | ||
for layer_past in past_key_values: | ||
reordered_past += ( | ||
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), | ||
) | ||
return reordered_past |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staticmethod | |
def _reorder_cache(past_key_values, beam_idx): | |
reordered_past = () | |
for layer_past in past_key_values: | |
reordered_past += ( | |
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), | |
) | |
return reordered_past |
no longer needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Removed two implementations of _reorder_cache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, _reorder_cache
is needed!
My integration tests are failing without it, giving a
NotImplementedError: Make sure that a `_reorder_cache` function is correctly implemented in transformers.models.m2m_100.modeling_m2m_100 to enable beam search for <class 'transformers.models.m2m_100.modeling_m2m_100.M2M100DecoderModel'>
Apparently, it is still required for beam search.
So I'm adding this method back.
self.shared = nn.Embedding(config.vocab_size, config.d_model) | ||
|
||
encoder_config = copy.deepcopy(config) | ||
encoder_config.use_cache = False | ||
encoder_config.is_encoder_decoder = False | ||
self.encoder = M2M100Encoder(encoder_config, self.shared) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, IMO only 1. is a strong reason, I don't know how much pool_last_hidden_state
are used but okay.
2. is all in the conversion script no?
Signed-off-by: David Dale <[email protected]>
…coder Signed-off-by: David Dale <[email protected]>
Signed-off-by: David Dale <[email protected]>
Hi @ArthurZucker! |
What does this PR do?
This PR adds native support for SONAR text encoders are decoders (https://github.com/facebookresearch/SONAR).
SONAR for text is architecturally an NLLB model, but with the encoder representations mean-pooled into a single fixed-sized vector before passing them to the decoder. Thus, SONAR encoder works as a sentence embedder, and thanks to pretraining on translation data, it is massively multilingual and language-agnostic. And, unlike other sentence encoder, this one has a decoder that can reconstruct the original texts back from their embeddings.
To supports such models natively, the easiest way would be to create
NLLBM2M100 model classes with encoder only or decoder only, similarly to the existing classesT5EncoderModel
orMT5EncoderModel
.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Details
M2M100EncoderModel
andM2M100DecoderModel
.M2M100EncoderModel
is a typical encoder-only model (BERT-style), with an additional option of applying mean pooling to its outputs (this is how SONAR text embeddings are computed)M2M100DecoderModel
is a module consisting of M2M100Decoder and an output projection layer. As the input, it always expects theencoder_outputs
argument to be present, and ignores itsinput_ids
.M2M100ForConditionalGeneration
,M2M100DecoderModel
always has its decoder input embedding and decoder output projection layers tied, because this is how SONAR decoder originally was implemented.fairseq2
checkpoints (it doesn't requirefairseq2
as a dependency; instead, it just reads and reformats the torch model state dicts).T5EncoderOnlyModelTest
, see Add T5 Encoder for Feature Extraction #8717), and for the decoder-only model (based loosely on similar ideas, but with more tweaks).Testing
All the unit tests I added are run by
The integration tests that I added are marked as slow, so they could be run with
RUN_SLOW=1 python -m pytest tests/models/m2m_100/test_modeling_m2m_100.py::SonarIntegrationTests