Skip to content
This repository has been archived by the owner on Jun 21, 2024. It is now read-only.

Was the BoS/EoS token used during pretraining? #8

Open
grantdelozier opened this issue May 30, 2023 · 1 comment
Open

Was the BoS/EoS token used during pretraining? #8

grantdelozier opened this issue May 30, 2023 · 1 comment

Comments

@grantdelozier
Copy link

The readme suggests use of GPT-neox-20b tokenizer. This tokenizer has a BoS and EoS token mapped to token id 0.

However when I look at the model implementation in PaLM-rlhf-pytorch, it looks like token id 0 is also used as a padding/mask value

When the model was pretrained were there any special tokens utilized? Was token id 0 used as a BoS token, Pad Token, or was it not utilized at all?

I would like to experiment with this model for fine-tuning document classification tasks because of its ability to accept very long sequences. In other LMs, use of special tokens like BoS was helpful for certain document/sequence level fine tune tasks. Appreciate your work on this project!

@conceptofmind
Copy link
Owner

Hi @grantdelozier ,

EOS was used during training. The EOS and PAD tokens being the same should not be an issue as you normally just add EOS tokens to the end of an example if you want to PAD it anyways.

I did not add a BOS token although I could try this with something like the Llama tokenizer in another run.

The 2.1B model will hopefully be out tomorrow.

Best,

Enrico

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants