Skip to content

Conversation

@jquesnelle
Copy link

This fixes #345 by checking for all three combinations of bos/eos existence and using an appropriate template for each case

local_working_dir: DataFolderLike | None = None,
save_filename: str = None, # if defined, the final output filename will be this
tokenizer_name_or_path: str = "gpt2", # tokenizer to use, from HF or a local
bos_token: str = "<|startoftext|>", # whether to add the BOS token nefore each document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bos_token: str = "<|startoftext|>", # whether to add the BOS token nefore each document
bos_token: str | None = None, # whether to add the BOS token nefore each document

should disable it by default

Comment on lines +54 to +65
elif self.bos_token is not None or self.eos_token is not None:
special_tokens = []
if self.bos_token:
special_tokens.append(("<BOS>", self.tokenizer.token_to_id(self.bos_token)))
if self.eos_token:
special_tokens.append(("<EOS>", self.tokenizer.token_to_id(self.eos_token)))
if self.bos_token and not self.eos_token:
single = "<BOS> $A"
elif self.eos_token and not self.bos_token:
single = "$A <EOS>"
else:
single = "<BOS> $A <EOS>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif self.bos_token is not None or self.eos_token is not None:
special_tokens = []
if self.bos_token:
special_tokens.append(("<BOS>", self.tokenizer.token_to_id(self.bos_token)))
if self.eos_token:
special_tokens.append(("<EOS>", self.tokenizer.token_to_id(self.eos_token)))
if self.bos_token and not self.eos_token:
single = "<BOS> $A"
elif self.eos_token and not self.bos_token:
single = "$A <EOS>"
else:
single = "<BOS> $A <EOS>"
elif self.bos_token is not None or self.eos_token is not None:
special_tokens = []
single_elems = []
if self.bos_token:
special_tokens.append(("<BOS>", self.tokenizer.token_to_id(self.bos_token)))
single_elems.append("<BOS>")
single_elems.append("$A")
if self.eos_token:
special_tokens.append(("<EOS>", self.tokenizer.token_to_id(self.eos_token)))
single_elems.append("<EOS>")

self._tokenizer.post_processor = TemplateProcessing(
single="$A <EOS>",
special_tokens=[("<EOS>", self.tokenizer.token_to_id(self.eos_token))],
single=single,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
single=single,
single=" ".join(single_elems),

@guipenedo
Copy link
Collaborator

hi, can you please make this pr editable by maintainers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BOS tokens not properly added in some circumstances

2 participants