Skip to content

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores #7624

@ericstj

Description

@ericstj

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores

Summary

Microsoft.ML.Tokenizers fully implements the SentencePiece Unigram model
(SentencePieceUnigramModel), but the only public way to obtain a
SentencePieceTokenizer is SentencePieceTokenizer.Create(Stream), which parses a
SentencePiece protobuf (.model / ModelProto). There is no public API to build a
Unigram tokenizer from a Hugging Face tokenizer.json (or from in-memory pieces + scores).

Many modern HF models ship a JSON-only Unigram tokenizer (tokenizer.json with
model.type == "Unigram", model.vocab as [piece, score] pairs, model.unk_id, and a
normalizer/pre_tokenizer) and no sentencepiece.model/spiece.model/tokenizer.model
protobuf. For these models there is currently no supported way to construct the tokenizer.

Current behavior

  • SentencePieceTokenizer's only constructor is internal SentencePieceTokenizer(ModelProto modelProto, ...).
  • The Sentencepiece.* generated protobuf types (ModelProto, TrainerSpec,
    NormalizerSpec, …) are internal, so callers can't build a ModelProto directly.
  • The public factory SentencePieceTokenizer.Create(Stream, bool addBeginOfSentence, bool addEndOfSentence, ...)
    requires a serialized SentencePiece protobuf stream.
  • By contrast, BpeTokenizer / WordPieceTokenizer expose vocab-file/stream factories, so the
    asymmetry is Unigram-specific.

Request

A public factory to construct a Unigram SentencePieceTokenizer without a protobuf, e.g. one of:

  1. From tokenizer.jsonSentencePieceTokenizer.CreateFromTokenizerJson(Stream json, ...)
    (parse model.vocab pieces+scores, model.unk_id, and the normalizer/pre_tokenizer
    precompiled charsmap + metaspace settings).
  2. From in-memory pieces+scores
    Create(IEnumerable<(string Piece, float Score)> vocab, int unkId, ReadOnlySpan<byte> precompiledCharsMap, bool addDummyPrefix, bool escapeWhitespaces, ...).

Either would let callers load JSON-Unigram models that have no .model protobuf.

Workaround we're using

Since the only public entry is a protobuf stream, we synthesize a SentencePiece ModelProto
on the fly
from tokenizer.json and feed the bytes to Create(Stream):

  • piecesmodel.vocab [piece, score] (mapping special tokens to CONTROL/UNKNOWN types).
  • trainer_spec.model_type = UNIGRAM, trainer_spec.unk_id = model.unk_id (+ bos/eos/pad ids).
  • normalizer_spec.precompiled_charsmap ← the Precompiled normalizer's precompiled_charsmap
    bytes from tokenizer.json (gives byte-exact NFKC parity), plus
    add_dummy_prefix / escape_whitespaces from the Metaspace pre-tokenizer.

This works, but it requires hand-writing a SentencePiece protobuf encoder and re-deriving the
wire schema, which is exactly the kind of thing the library could expose directly. It's also
fragile across schema changes.

Why it matters

JSON-only Unigram tokenizers are common (multilingual static-embedding models, several HF
encoder models). Without a JSON/in-memory factory, every consumer must either ship a converted
.model or reimplement the protobuf synthesis above.

Repro context

  • Model: a potion-multilingual-128M-style static embedding model — tokenizer.json only,
    model.type == "Unigram", ~500k-entry custom vocab, [PAD]=0 / [UNK]=1, with a
    SequencePrecompiled normalizer (precompiled_charsmap present) and a Metaspace
    pre-tokenizer. No sentencepiece.model.
  • Microsoft.ML.Tokenizers main (commit 901da3e).

Metadata

Metadata

Labels

enhancementNew feature or requestuntriagedNew issue has not been triaged

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions