-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support the new minhash 25.02 api #445
base: main
Are you sure you want to change the base?
Support the new minhash 25.02 api #445
Conversation
Signed-off-by: Praateek <[email protected]>
# TODO remove this once 24.12.0 becomes the base version of cudf in nemo-curator | ||
MINHASH_PERMUTED_AVAILABLE = CURRENT_CUDF_VERSION >= parse_version("24.12.0") or ( | ||
CURRENT_CUDF_VERSION.is_prerelease | ||
and CURRENT_CUDF_VERSION.base_version >= "24.12.0" | ||
# TODO remove this once 25.02 becomes the base version of cudf in nemo-curator | ||
|
||
# minhash in < 24.12 used to have a minhash(txt) api which was deprecated in favor of | ||
# minhash(a, b) in 25.02 (in 24.12, minhash_permuted(a,b) was introduced) | ||
MINHASH_DEPRECATED_API = ( | ||
CURRENT_CUDF_VERSION.base_version < parse_version("24.12").base_version | ||
) | ||
MINHASH_PERMUTED_AVAILABLE = (CURRENT_CUDF_VERSION.major == 24) & ( | ||
CURRENT_CUDF_VERSION.minor == 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I'm confused, why are we doing it for 24.10 and 24.12 instead of for 24.12 and 25.02?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the code block here helps
if MINHASH_DEPRECATED_API:
# this is 24.10, the variable is named minhash deprecated api,
# because now cudf only needs four args, rather than two args
return ser.str.minhash(seeds=seeds, width=char_ngram)
else:
if MINHASH_PERMUTED_AVAILABLE:
# this is 24.12 because in this version the four arg function is called minhash_permuted
return ser.str.minhash_permuted(
a=seeds_a, b=seeds_b, seed=seeds[0][0], width=char_ngram
)
else:
# this is the 25.02 case where it's neither the deprecated case neither permuted is available
return ser.str.minhash(a=seeds_a, b=seeds_b, seed=seeds[0][0], width=char_ngram)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I see, thanks! So are we waiting until the next NeMo Curator release (which will have 24.12 as the stable RAPIDS version) before removing MINHASH_DEPRECATED_API
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's correct! And 25.02+ we wouldn't need any of these variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! I think you just need to update this branch, then feel free to merge.
Description
Solves #426 and piggy backs on https://github.com/NVIDIA/NeMo-Curator/pull/442/files
Checklist