Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German BERT Dataset sampling #16

Open
Phil1108 opened this issue Jun 28, 2020 · 2 comments
Open

German BERT Dataset sampling #16

Phil1108 opened this issue Jun 28, 2020 · 2 comments

Comments

@Phil1108
Copy link

Hi,
do you sampled each dataset (Wikipedia, Common Crawl, Subtitles etc.) equally during German-BERT Training?
OpenAI uses a unequal sampling, which may lead to a better result, as stated in the GPT-3 Paper:

Note that during training, datasetsare not sampled in proportion to their size, but rather datasets we view as higher-quality are
sampled more frequently,such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets aresampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data

If yes, which paremeters do you used?

GPT-3-Table

@stefan-it
Copy link
Collaborator

Hi @Phil1108 ,

I didn't use a specific sampling method (so all parts are sampled equally). But I think this could be interesting for future work to e.g. see the effects on downstream tasks :)

@Phil1108
Copy link
Author

@stefan-it Okay thanks. Then I'll give it a try and see how it performs in comparison to your models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants