Skip to content

Comments

Multilingual inference#928

Draft
benjaminking wants to merge 4 commits intomasterfrom
multilingual_inference
Draft

Multilingual inference#928
benjaminking wants to merge 4 commits intomasterfrom
multilingual_inference

Conversation

@benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Feb 3, 2026

This is a draft PR that adds support for multilingual inference (multi-source and multi-target) in SILNLP. It turns out that the existing code already supports training with more than one language pair, so this focuses specifically on inference. There is more work still to be done before this is ready to merge, but I wanted to get feedback before I got too deep into that extra work.

The biggest changes are in SILTranslationPipeline in hugging_face_config.py. HuggingFace's TranslationPipeline is set up to have a fixed input language code and output language code and takes care of adding the various special tokens to the input and output tensors. I have overridden some of the pipeline methods to manually construct the tensors, so that we can vary the language codes on a per-instance basis. Right now, it only handles NLLB's format.

Most all of the other changes are simply to support this new method for creating the tensors (e.g. making sure that the source and target iso are always passed to the translate functions and not just derived from the model metadata). Let me know if you see any red flags in the way I've implemented things.


This change is Reviewable

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for multilingual inference in a single translate call? When inferencing a USFM file, it will only ever be a single source and target. For testing, I believe that the test split for each language pair is in a separate file. In which case, we should be able to use a separate pipeline for each file. Of course, I could be missing something obvious.

@ddaspit made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct about that. I don't think we currently have a use case for it, but I had thought that if we had the ability to perform actual many-to-many training and testing, then new use cases might arise.

If you don't think we'd be likely to want to translate to/from multiple languages in one call, I think that using a separate pipeline for each language pair would work. The source iso can be configured via the tokenizer's source language and the target iso can be configured via the model's generation_config. And I believe we can update each of those before we create a new pipeline.

Maybe what I should do is

  1. Leave this code in a branch, in case the need ever arises for multiple language pairs in a single pipeline
  2. Implement the simpler one-pipeline-per-language-pair approach and create a PR for it

@benjaminking made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan.

@ddaspit made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants