New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Idea for further filtering #43

Open

davidgilbertson opened this issue Mar 5, 2023 · 0 comments

davidgilbertson commented Mar 5, 2023

I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).

It's a fairly crude filter but I haven't seen any false positives

import re
import datasets

ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))

Samples of the docs are things like this:

Printed with

for doc in ds_filtered:
    print(doc["text"].replace("\n", " | ")[:400])
    print("\n")

Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment