Skip to content

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

Open
@fabiencro

Description

@fabiencro

Hi,

Thank you for releasing your pretrained model and saving us training time. I am currently exploring possible applications, but I ran into a problem that might also annoy many researchers trying to use your model.

AFAIK, you have not released the WebText corpus (although I know this is currently discussed in issue #24). This is fine by me, except for one aspect: it makes it impossible for me to know if my test
data is somehow included in WebText. Which, in turns, makes it impossible for me to tell if any improvement I am getting is due to the quality of GPT or the fact that the pretrained model has already seen my test data.

If you do not plan to release WebText in the very near future, I was thinking you could release the bloom filters you describe in your technical paper (code + filled filters). This would allow us to evaluate the proportion of 8-grams in our test data that is also in WebText.

Would this be possible?
Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions