Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom annotator implementation for GneissWeb data #981

Open
2 tasks done
shahrokhDaijavad opened this issue Jan 27, 2025 · 1 comment
Open
2 tasks done

Bloom annotator implementation for GneissWeb data #981

shahrokhDaijavad opened this issue Jan 27, 2025 · 1 comment
Labels
enhancement New feature or request sprint-feb-7

Comments

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Jan 27, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

  • We would like to add Bloom annotator transform which maps a non-empty input table to an output table with an added is_in_GneissWeb column. Each row in the table corresponds to a UUID and its associated document. The Bloom annotator transform verifies whether the document's UUID exists in the GneissWeb Bloom filter.

Feature

  • The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb are also present in GneissWeb and which are not. The GneissWeb Bloom filter is just one use case; the Bloom Annotator transform can work with any Bloom filter.

  • Please refer to README file submitted in the PR for examples.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@ian-cho
Copy link
Collaborator

ian-cho commented Jan 28, 2025

@shahrokhDaijavad Thank you for the help! I added details.
@touma-I Please let me know if it is fine. many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sprint-feb-7
Projects
None yet
Development

No branches or pull requests

2 participants