Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bert-ita-xxl #42

Open
IreneSucameli opened this issue Apr 6, 2022 · 4 comments
Open

Bert-ita-xxl #42

IreneSucameli opened this issue Apr 6, 2022 · 4 comments

Comments

@IreneSucameli
Copy link

Hi, could you please specify in what percentage Wikipedia, the OPUS and the OSCAR corpus were used for training ita-bert-xxl? Thanks

@stefan-it
Copy link
Collaborator

stefan-it commented Apr 11, 2022

Hi @IreneSucameli ,

I just looked at the dataset sizes:

Wikipedia dump was 2.7GB, OPUS (13GB - 2.7) 10.3GB and OSCAR 68GB.

We did not perform any upsampling/downsampling strategy (as it is e.g. used in some GPT-2 based papers).

@IreneSucameli
Copy link
Author

Hi @stefan-it ,

thank you for the information provided! For what concerns the vocabulary size, instead? Could you kindly tell me how many GB is the vocabulary? Thanks

@stefan-it
Copy link
Collaborator

stefan-it commented Apr 11, 2022

The "normal" and XXL model use the same 31.102 wordpiece-based vocab.

The pre-training corpus for the "normal" model is used, that has a size of 13GB (OPUS + Wikipedia). Please note, that sentencepiece was used for training a SPM model, then we converted the SPM vocab into a wordpiece-based vocab. This was necessary, because in 2019 no library such as Hugging Face Tokenizers did exist. The SPM vocab size was 31.000. Then 100 "unused" tokens were added (as it was done in BERT vocab).

SPM vocab had three special symbols: ['<unk>', '<s>', '</s>'], so effective vocab size would be 31.000 - 3 = 30.997. Adding 100 unused tokens = 30.997 + 100 = 31.097. Then we need to add the following special tokens for BERT: ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'. Then the total/final vocab size is 31.097 + 5 = 31.102.

@IreneSucameli
Copy link
Author

Ok, I see. Thank you very much, you have been very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants