Skip to content

Bug: Inconsistent error message in tokenizer validation #421

@abhinavansh18

Description

@abhinavansh18

Description
There is a minor inconsistency between the validation logic and the error message for custom token IDs in the '_add_custom_tokens`'method.

Location

  • File: 'gemma/gm/text/_tokenizer.py'
  • Method: '_add_custom_tokens'

The Problem
The code correctly validates that the custom token ID 'i' is within the range of [0, 98]

However, if this condition is met, the ValueError that is raised contains an incorrect message:

raise ValueError(
f'Custom token id {i} for {token!r} is not in [1, 98].'
)
Mismatch between the zero-based indexing used in the validation logic and the one-based counting reflected in the error string.

Image

I have a fix ready and can open a pull request to resolve this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions