Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText)

Hi,

Thank you for releasing your pretrained model and saving us training time. I am currently exploring possible applications, but I ran into a problem that might also annoy many researchers trying to use your model.

AFAIK, you have not released the WebText corpus (although I know this is currently discussed in issue #24). This is fine by me, except for one aspect: it makes it impossible for me to know if my test
data is somehow included in WebText. Which, in turns, makes it impossible for me to tell if any improvement I  am getting is due to the quality of GPT or the fact that the pretrained model has already seen my test data.

If you do not plan to release WebText in the very near future, I was thinking you could release the bloom filters you describe in your technical paper (code + filled filters). This would allow us to evaluate the proportion of 8-grams in our test data that is also in WebText.

Would this be possible?
Thank you.

 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions