Conversation
ddaspit
left a comment
There was a problem hiding this comment.
What is the use case for multilingual inference in a single translate call? When inferencing a USFM file, it will only ever be a single source and target. For testing, I believe that the test split for each language pair is in a separate file. In which case, we should be able to use a separate pipeline for each file. Of course, I could be missing something obvious.
@ddaspit made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.
benjaminking
left a comment
There was a problem hiding this comment.
You are correct about that. I don't think we currently have a use case for it, but I had thought that if we had the ability to perform actual many-to-many training and testing, then new use cases might arise.
If you don't think we'd be likely to want to translate to/from multiple languages in one call, I think that using a separate pipeline for each language pair would work. The source iso can be configured via the tokenizer's source language and the target iso can be configured via the model's generation_config. And I believe we can update each of those before we create a new pipeline.
Maybe what I should do is
- Leave this code in a branch, in case the need ever arises for multiple language pairs in a single pipeline
- Implement the simpler one-pipeline-per-language-pair approach and create a PR for it
@benjaminking made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.
ddaspit
left a comment
There was a problem hiding this comment.
Sounds like a good plan.
@ddaspit made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.
This is a draft PR that adds support for multilingual inference (multi-source and multi-target) in SILNLP. It turns out that the existing code already supports training with more than one language pair, so this focuses specifically on inference. There is more work still to be done before this is ready to merge, but I wanted to get feedback before I got too deep into that extra work.
The biggest changes are in
SILTranslationPipelineinhugging_face_config.py. HuggingFace'sTranslationPipelineis set up to have a fixed input language code and output language code and takes care of adding the various special tokens to the input and output tensors. I have overridden some of the pipeline methods to manually construct the tensors, so that we can vary the language codes on a per-instance basis. Right now, it only handles NLLB's format.Most all of the other changes are simply to support this new method for creating the tensors (e.g. making sure that the source and target iso are always passed to the translate functions and not just derived from the model metadata). Let me know if you see any red flags in the way I've implemented things.
This change is