-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bert-ita-xxl #42
Comments
Hi @IreneSucameli , I just looked at the dataset sizes: Wikipedia dump was 2.7GB, OPUS (13GB - 2.7) 10.3GB and OSCAR 68GB. We did not perform any upsampling/downsampling strategy (as it is e.g. used in some GPT-2 based papers). |
Hi @stefan-it , thank you for the information provided! For what concerns the vocabulary size, instead? Could you kindly tell me how many GB is the vocabulary? Thanks |
The "normal" and XXL model use the same 31.102 wordpiece-based vocab. The pre-training corpus for the "normal" model is used, that has a size of 13GB (OPUS + Wikipedia). Please note, that sentencepiece was used for training a SPM model, then we converted the SPM vocab into a wordpiece-based vocab. This was necessary, because in 2019 no library such as Hugging Face Tokenizers did exist. The SPM vocab size was 31.000. Then 100 "unused" tokens were added (as it was done in BERT vocab). SPM vocab had three special symbols: |
Ok, I see. Thank you very much, you have been very helpful! |
Hi, could you please specify in what percentage Wikipedia, the OPUS and the OSCAR corpus were used for training ita-bert-xxl? Thanks
The text was updated successfully, but these errors were encountered: