We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).
It's a fairly crude filter but I haven't seen any false positives
import re import datasets ds = datasets.load_dataset("openwebtext", split="train") ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))
Samples of the docs are things like this:
Printed with
for doc in ds_filtered: print(doc["text"].replace("\n", " | ")[:400]) print("\n")
Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).
It's a fairly crude filter but I haven't seen any false positives
Samples of the docs are things like this:
Printed with
Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.
The text was updated successfully, but these errors were encountered: