Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for further filtering #43

Open
davidgilbertson opened this issue Mar 5, 2023 · 0 comments
Open

Idea for further filtering #43

davidgilbertson opened this issue Mar 5, 2023 · 0 comments

Comments

@davidgilbertson
Copy link

I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).

It's a fairly crude filter but I haven't seen any false positives

import re
import datasets

ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))

Samples of the docs are things like this:

image

Printed with

for doc in ds_filtered:
    print(doc["text"].replace("\n", " | ")[:400])
    print("\n")

Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant