Bert-ita-xxl #42

IreneSucameli · 2022-04-06T08:17:15Z

Hi, could you please specify in what percentage Wikipedia, the OPUS and the OSCAR corpus were used for training ita-bert-xxl? Thanks

stefan-it · 2022-04-11T09:51:35Z

Hi @IreneSucameli ,

I just looked at the dataset sizes:

Wikipedia dump was 2.7GB, OPUS (13GB - 2.7) 10.3GB and OSCAR 68GB.

We did not perform any upsampling/downsampling strategy (as it is e.g. used in some GPT-2 based papers).

IreneSucameli · 2022-04-11T10:02:40Z

Hi @stefan-it ,

thank you for the information provided! For what concerns the vocabulary size, instead? Could you kindly tell me how many GB is the vocabulary? Thanks

stefan-it · 2022-04-11T11:27:27Z

The "normal" and XXL model use the same 31.102 wordpiece-based vocab.

The pre-training corpus for the "normal" model is used, that has a size of 13GB (OPUS + Wikipedia). Please note, that sentencepiece was used for training a SPM model, then we converted the SPM vocab into a wordpiece-based vocab. This was necessary, because in 2019 no library such as Hugging Face Tokenizers did exist. The SPM vocab size was 31.000. Then 100 "unused" tokens were added (as it was done in BERT vocab).

SPM vocab had three special symbols: ['<unk>', '<s>', '</s>'], so effective vocab size would be 31.000 - 3 = 30.997. Adding 100 unused tokens = 30.997 + 100 = 31.097. Then we need to add the following special tokens for BERT: ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'. Then the total/final vocab size is 31.097 + 5 = 31.102.

IreneSucameli · 2022-04-11T13:05:42Z

Ok, I see. Thank you very much, you have been very helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bert-ita-xxl #42

Bert-ita-xxl #42

IreneSucameli commented Apr 6, 2022

stefan-it commented Apr 11, 2022 •

edited

Loading

IreneSucameli commented Apr 11, 2022

stefan-it commented Apr 11, 2022 •

edited

Loading

IreneSucameli commented Apr 11, 2022

Bert-ita-xxl #42

Bert-ita-xxl #42

Comments

IreneSucameli commented Apr 6, 2022

stefan-it commented Apr 11, 2022 • edited Loading

IreneSucameli commented Apr 11, 2022

stefan-it commented Apr 11, 2022 • edited Loading

IreneSucameli commented Apr 11, 2022

stefan-it commented Apr 11, 2022 •

edited

Loading

stefan-it commented Apr 11, 2022 •

edited

Loading