You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
We would like to add Bloom annotator transform which maps a non-empty input table to an output table with an added is_in_GneissWeb column. Each row in the table corresponds to a UUID and its associated document. The Bloom annotator transform verifies whether the document's UUID exists in the GneissWeb Bloom filter.
Feature
The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb are also present in GneissWeb and which are not. The GneissWeb Bloom filter is just one use case; the Bloom Annotator transform can work with any Bloom filter.
Please refer to README file submitted in the PR for examples.
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Component
Feature
The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb are also present in GneissWeb and which are not. The GneissWeb Bloom filter is just one use case; the Bloom Annotator transform can work with any Bloom filter.
Please refer to README file submitted in the PR for examples.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: