From f1f5faad9470715da7e1407c48a2f8f328b6e861 Mon Sep 17 00:00:00 2001 From: Madhan Kumar Reddy <259025340+MushiSenpai@users.noreply.github.com> Date: Fri, 26 Jun 2026 23:22:37 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20add=20prepare=5Ffor=5Fmodel()=20v4?= =?UTF-8?q?=E2=86=92v5=20migration=20example?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.8 --- MIGRATION_GUIDE_V5.md | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/MIGRATION_GUIDE_V5.md b/MIGRATION_GUIDE_V5.md index 88927f26efe6..67ed322f6aaa 100644 --- a/MIGRATION_GUIDE_V5.md +++ b/MIGRATION_GUIDE_V5.md @@ -290,7 +290,7 @@ tokenizer.extra_special_tokens # Additional tokens **Deprecated Methods:** - `sanitize_special_tokens()`: Already deprecated in v4, removed in v5. -- `prepare_seq2seq_batch()`: Deprecated; use `__call__()` with `text_target` parameter instead. +- `_seq2seq_batch()`: Deprecated; use `__call__()` with `text_target` parameter instead. ```python # v4 @@ -306,6 +306,25 @@ model_inputs["labels"] = model_inputs.pop("input_ids_target") **Removed Methods:** - `create_token_type_ids_from_sequences()`: Removed from base class. Subclasses that need custom token type ID creation should implement this method directly. - `prepare_for_model()`, `build_inputs_with_special_tokens()`, `truncate_sequences()`: Moved from `tokenization_utils_base.py` to `tokenization_python.py` for `PythonBackend` tokenizers. `TokenizersBackend` provides model-ready input via `tokenize()` and `encode()`, so these methods are no longer needed in the base class. + +```python +# v4 — manually build model-ready inputs from pre-tokenized ids +inputs = tokenizer.prepare_for_model( + tokenizer.convert_tokens_to_ids(tokenizer.tokenize(query)), + tokenizer.convert_tokens_to_ids(tokenizer.tokenize(passage)), + add_special_tokens=True, truncation=True, max_length=512, + padding="max_length", return_tensors="pt", +) + +# v5 — call the tokenizer directly; __call__ / encode() return a model-ready BatchEncoding +inputs = tokenizer( + query, passage, + truncation=True, max_length=512, + padding="max_length", return_tensors="pt", +) +``` +`build_inputs_with_special_tokens()` and `truncate_sequences()` follow the same pattern — prefer `tokenizer(...)` / `encode()`. If you only have token ids (not the original text) and must combine a pair, these methods remain available on `PythonBackend` tokenizers in `tokenization_python.py`. + - `_switch_to_input_mode()`, `_switch_to_target_mode()`, `as_target_tokenizer()`: Removed from base class. Use `__call__()` with `text_target` parameter instead. ```python