From f1f5faad9470715da7e1407c48a2f8f328b6e861 Mon Sep 17 00:00:00 2001
From: Madhan Kumar Reddy <259025340+MushiSenpai@users.noreply.github.com>
Date: Fri, 26 Jun 2026 23:22:37 +0800
Subject: [PATCH] =?UTF-8?q?docs:=20add=20prepare=5Ffor=5Fmodel()=20v4?=
 =?UTF-8?q?=E2=86=92v5=20migration=20example?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 MIGRATION_GUIDE_V5.md | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/MIGRATION_GUIDE_V5.md b/MIGRATION_GUIDE_V5.md
index 88927f26efe6..67ed322f6aaa 100644
--- a/MIGRATION_GUIDE_V5.md
+++ b/MIGRATION_GUIDE_V5.md
@@ -290,7 +290,7 @@ tokenizer.extra_special_tokens  # Additional tokens
 
 **Deprecated Methods:**
 - `sanitize_special_tokens()`: Already deprecated in v4, removed in v5.
-- `prepare_seq2seq_batch()`: Deprecated; use `__call__()` with `text_target` parameter instead.
+- `_seq2seq_batch()`: Deprecated; use `__call__()` with `text_target` parameter instead.
 
 ```python
 # v4
@@ -306,6 +306,25 @@ model_inputs["labels"] = model_inputs.pop("input_ids_target")
 **Removed Methods:**
 - `create_token_type_ids_from_sequences()`: Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.
 - `prepare_for_model()`, `build_inputs_with_special_tokens()`, `truncate_sequences()`: Moved from `tokenization_utils_base.py` to `tokenization_python.py` for `PythonBackend` tokenizers. `TokenizersBackend` provides model-ready input via `tokenize()` and `encode()`, so these methods are no longer needed in the base class.
+
+```python
+# v4 — manually build model-ready inputs from pre-tokenized ids
+inputs = tokenizer.prepare_for_model(
+    tokenizer.convert_tokens_to_ids(tokenizer.tokenize(query)),
+    tokenizer.convert_tokens_to_ids(tokenizer.tokenize(passage)),
+    add_special_tokens=True, truncation=True, max_length=512,
+    padding="max_length", return_tensors="pt",
+)
+
+# v5 — call the tokenizer directly; __call__ / encode() return a model-ready BatchEncoding
+inputs = tokenizer(
+    query, passage,
+    truncation=True, max_length=512,
+    padding="max_length", return_tensors="pt",
+)
+```
+`build_inputs_with_special_tokens()` and `truncate_sequences()` follow the same pattern — prefer `tokenizer(...)` / `encode()`. If you only have token ids (not the original text) and must combine a pair, these methods remain available on `PythonBackend` tokenizers in `tokenization_python.py`.
+
 - `_switch_to_input_mode()`, `_switch_to_target_mode()`, `as_target_tokenizer()`: Removed from base class. Use `__call__()` with `text_target` parameter instead.
 
 ```python