[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores

# [API] Public way to construct `SentencePieceTokenizer` (Unigram) from `tokenizer.json` / in-memory pieces+scores

## Summary

`Microsoft.ML.Tokenizers` fully implements the SentencePiece **Unigram** model
(`SentencePieceUnigramModel`), but the only public way to obtain a
`SentencePieceTokenizer` is `SentencePieceTokenizer.Create(Stream)`, which parses a
**SentencePiece protobuf** (`.model` / `ModelProto`). There is no public API to build a
Unigram tokenizer from a Hugging Face `tokenizer.json` (or from in-memory pieces + scores).

Many modern HF models ship a JSON-only Unigram tokenizer (`tokenizer.json` with
`model.type == "Unigram"`, `model.vocab` as `[piece, score]` pairs, `model.unk_id`, and a
`normalizer`/`pre_tokenizer`) and **no** `sentencepiece.model`/`spiece.model`/`tokenizer.model`
protobuf. For these models there is currently no supported way to construct the tokenizer.

## Current behavior

- `SentencePieceTokenizer`'s only constructor is `internal SentencePieceTokenizer(ModelProto modelProto, ...)`.
- The `Sentencepiece.*` generated protobuf types (`ModelProto`, `TrainerSpec`,
  `NormalizerSpec`, …) are `internal`, so callers can't build a `ModelProto` directly.
- The public factory `SentencePieceTokenizer.Create(Stream, bool addBeginOfSentence, bool addEndOfSentence, ...)`
  requires a serialized SentencePiece protobuf stream.
- By contrast, `BpeTokenizer` / `WordPieceTokenizer` expose vocab-file/stream factories, so the
  asymmetry is Unigram-specific.

## Request

A public factory to construct a Unigram `SentencePieceTokenizer` without a protobuf, e.g. one of:

1. **From `tokenizer.json`** — `SentencePieceTokenizer.CreateFromTokenizerJson(Stream json, ...)`
   (parse `model.vocab` pieces+scores, `model.unk_id`, and the `normalizer`/`pre_tokenizer`
   precompiled charsmap + metaspace settings).
2. **From in-memory pieces+scores** —
   `Create(IEnumerable<(string Piece, float Score)> vocab, int unkId, ReadOnlySpan<byte> precompiledCharsMap, bool addDummyPrefix, bool escapeWhitespaces, ...)`.

Either would let callers load JSON-Unigram models that have no `.model` protobuf.

## Workaround we're using

Since the only public entry is a protobuf stream, we **synthesize a SentencePiece `ModelProto`
on the fly** from `tokenizer.json` and feed the bytes to `Create(Stream)`:

- `pieces` ← `model.vocab` `[piece, score]` (mapping special tokens to `CONTROL`/`UNKNOWN` types).
- `trainer_spec.model_type = UNIGRAM`, `trainer_spec.unk_id = model.unk_id` (+ bos/eos/pad ids).
- `normalizer_spec.precompiled_charsmap` ← the `Precompiled` normalizer's `precompiled_charsmap`
  bytes from `tokenizer.json` (gives byte-exact NFKC parity), plus
  `add_dummy_prefix` / `escape_whitespaces` from the `Metaspace` pre-tokenizer.

This works, but it requires hand-writing a SentencePiece protobuf encoder and re-deriving the
wire schema, which is exactly the kind of thing the library could expose directly. It's also
fragile across schema changes.

## Why it matters

JSON-only Unigram tokenizers are common (multilingual static-embedding models, several HF
encoder models). Without a JSON/in-memory factory, every consumer must either ship a converted
`.model` or reimplement the protobuf synthesis above.

## Repro context

- Model: a `potion-multilingual-128M`-style static embedding model — `tokenizer.json` only,
  `model.type == "Unigram"`, ~500k-entry custom vocab, `[PAD]=0` / `[UNK]=1`, with a
  `Sequence`→`Precompiled` normalizer (`precompiled_charsmap` present) and a `Metaspace`
  pre-tokenizer. No `sentencepiece.model`.
- `Microsoft.ML.Tokenizers` main (commit `901da3e`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores #7624

[API] Public way to construct `SentencePieceTokenizer` (Unigram) from `tokenizer.json` / in-memory pieces+scores

Summary

Current behavior

Request

Workaround we're using

Why it matters

Repro context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores #7624

Description

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores

Summary

Current behavior

Request

Workaround we're using

Why it matters

Repro context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[API] Public way to construct `SentencePieceTokenizer` (Unigram) from `tokenizer.json` / in-memory pieces+scores