Multilingual inference by benjaminking · Pull Request #928 · sillsdev/silnlp

benjaminking · 2026-02-03T19:46:42Z

This is a draft PR that adds support for multilingual inference (multi-source and multi-target) in SILNLP. It turns out that the existing code already supports training with more than one language pair, so this focuses specifically on inference. There is more work still to be done before this is ready to merge, but I wanted to get feedback before I got too deep into that extra work.

The biggest changes are in SILTranslationPipeline in hugging_face_config.py. HuggingFace's TranslationPipeline is set up to have a fixed input language code and output language code and takes care of adding the various special tokens to the input and output tensors. I have overridden some of the pipeline methods to manually construct the tensors, so that we can vary the language codes on a per-instance basis. Right now, it only handles NLLB's format.

Most all of the other changes are simply to support this new method for creating the tensors (e.g. making sure that the source and target iso are always passed to the translate functions and not just derived from the model metadata). Let me know if you see any red flags in the way I've implemented things.

This change is

…step so far)

ddaspit

What is the use case for multilingual inference in a single translate call? When inferencing a USFM file, it will only ever be a single source and target. For testing, I believe that the test split for each language pair is in a separate file. In which case, we should be able to use a separate pipeline for each file. Of course, I could be missing something obvious.

@ddaspit made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.

benjaminking

You are correct about that. I don't think we currently have a use case for it, but I had thought that if we had the ability to perform actual many-to-many training and testing, then new use cases might arise.

If you don't think we'd be likely to want to translate to/from multiple languages in one call, I think that using a separate pipeline for each language pair would work. The source iso can be configured via the tokenizer's source language and the target iso can be configured via the model's generation_config. And I believe we can update each of those before we create a new pipeline.

Maybe what I should do is

Leave this code in a branch, in case the need ever arises for multiple language pairs in a single pipeline
Implement the simpler one-pipeline-per-language-pair approach and create a PR for it

@benjaminking made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.

ddaspit

Sounds like a good plan.

@ddaspit made 1 comment.
Reviewable status: 0 of 6 files reviewed, all discussions resolved.

Ben King added 4 commits January 27, 2026 01:41

Initial implementation of multilingual inference (just for translate …

250d6a3

…step so far)

Complete? support for translate step

b1a1780

Multilingual inference for test step

bd14dd4

Use correct start token for decoder

8b85778

ddaspit reviewed Feb 4, 2026

View reviewed changes

benjaminking commented Feb 4, 2026

View reviewed changes

ddaspit reviewed Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Multilingual inference#928

Multilingual inference#928
benjaminking wants to merge 4 commits intomasterfrom
multilingual_inference

benjaminking commented Feb 3, 2026 •

edited by ddaspit

Loading

Uh oh!

ddaspit left a comment

Uh oh!

benjaminking left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

benjaminking commented Feb 3, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benjaminking commented Feb 3, 2026 •

edited by ddaspit

Loading