Description
Hi,
Thank you for releasing your pretrained model and saving us training time. I am currently exploring possible applications, but I ran into a problem that might also annoy many researchers trying to use your model.
AFAIK, you have not released the WebText corpus (although I know this is currently discussed in issue #24). This is fine by me, except for one aspect: it makes it impossible for me to know if my test
data is somehow included in WebText. Which, in turns, makes it impossible for me to tell if any improvement I am getting is due to the quality of GPT or the fact that the pretrained model has already seen my test data.
If you do not plan to release WebText in the very near future, I was thinking you could release the bloom filters you describe in your technical paper (code + filled filters). This would allow us to evaluate the proportion of 8-grams in our test data that is also in WebText.
Would this be possible?
Thank you.