Was the BoS/EoS token used during pretraining? #8

grantdelozier · 2023-05-30T21:23:37Z

The readme suggests use of GPT-neox-20b tokenizer. This tokenizer has a BoS and EoS token mapped to token id 0.

However when I look at the model implementation in PaLM-rlhf-pytorch, it looks like token id 0 is also used as a padding/mask value

When the model was pretrained were there any special tokens utilized? Was token id 0 used as a BoS token, Pad Token, or was it not utilized at all?

I would like to experiment with this model for fine-tuning document classification tasks because of its ability to accept very long sequences. In other LMs, use of special tokens like BoS was helpful for certain document/sequence level fine tune tasks. Appreciate your work on this project!

conceptofmind · 2023-06-05T22:58:50Z

Hi @grantdelozier ,

EOS was used during training. The EOS and PAD tokens being the same should not be an issue as you normally just add EOS tokens to the end of an example if you want to PAD it anyways.

I did not add a BOS token although I could try this with something like the Llama tokenizer in another run.

The 2.1B model will hopefully be out tomorrow.

Best,

Enrico

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Was the BoS/EoS token used during pretraining? #8

Was the BoS/EoS token used during pretraining? #8

grantdelozier commented May 30, 2023

conceptofmind commented Jun 5, 2023

Was the BoS/EoS token used during pretraining? #8

Was the BoS/EoS token used during pretraining? #8

Comments

grantdelozier commented May 30, 2023

conceptofmind commented Jun 5, 2023