Skip to content

Commit 818cb8c

Browse files
floccinaucVADIM RATNER VADIMRA@il.ibm.comVADIM RATNER VADIMRA@il.ibm.comVADIM RATNER VADIMRA@il.ibm.comVADIM RATNER VADIMRA@il.ibm.com
authored
Additional explanation for truncated tokenization warning (#130)
Added an optional text to be printed in the warning. Can be used to describe the tokenizer op caller --------- Co-authored-by: VADIM RATNER [email protected] <[email protected]> Co-authored-by: VADIM RATNER [email protected] <[email protected]> Co-authored-by: VADIM RATNER [email protected] <[email protected]> Co-authored-by: VADIM RATNER [email protected] <[email protected]> Co-authored-by: VADIM RATNER [email protected] <[email protected]> Co-authored-by: VADIM RATNER [email protected] <[email protected]> Co-authored-by: VADIM RATNER [email protected] <[email protected]> Co-authored-by: VADIM RATNER [email protected] <[email protected]>
1 parent 3a6adcb commit 818cb8c

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

fusedrug/data/tokenizer/ops/modular_tokenizer_ops.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,7 @@ def __call__(
193193
on_unknown: Optional[str] = "warn",
194194
verbose: Optional[int] = 1,
195195
validate_ends_with_eos: Optional[bool] = None,
196+
additional_caller_info_text: Optional[str] = "",
196197
) -> NDict:
197198
"""_summary_
198199
@@ -297,7 +298,7 @@ def __call__(
297298
len(encoded.overflowing) > 0
298299
): # note, encoded.overflowing may have multiple items, and each item can contain multiple items
299300
print(
300-
f"Warning: FastModularTokenizer (pid={os.getpid()}) had to truncate sequence: [{overflow_info}] \
301+
f"Warning: FastModularTokenizer (pid={os.getpid()}, {additional_caller_info_text}) had to truncate sequence: [{overflow_info}] \
301302
for tokenizer: {self._tokenizer_path} for sample_id {get_sample_id(sample_dict)}"
302303
)
303304

0 commit comments

Comments
 (0)