German BERT Dataset sampling #16

Phil1108 · 2020-06-28T21:55:46Z

Hi,
do you sampled each dataset (Wikipedia, Common Crawl, Subtitles etc.) equally during German-BERT Training?
OpenAI uses a unequal sampling, which may lead to a better result, as stated in the GPT-3 Paper:

Note that during training, datasetsare not sampled in proportion to their size, but rather datasets we view as higher-quality are
sampled more frequently,such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets aresampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data

If yes, which paremeters do you used?

stefan-it · 2020-06-28T23:11:33Z

Hi @Phil1108 ,

I didn't use a specific sampling method (so all parts are sampled equally). But I think this could be interesting for future work to e.g. see the effects on downstream tasks :)

Phil1108 · 2020-06-29T14:58:50Z

@stefan-it Okay thanks. Then I'll give it a try and see how it performs in comparison to your models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German BERT Dataset sampling #16

German BERT Dataset sampling #16

Phil1108 commented Jun 28, 2020

stefan-it commented Jun 28, 2020

Phil1108 commented Jun 29, 2020

German BERT Dataset sampling #16

German BERT Dataset sampling #16

Comments

Phil1108 commented Jun 28, 2020

stefan-it commented Jun 28, 2020

Phil1108 commented Jun 29, 2020