[Data] Adding streaming capability for ray.data.Dataset.unique
#51207
Labels
data
Ray Data-related issues
enhancement
Request for new feature and/or capability
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
Description
The current doc indicates that
ray.data.Dataset.unique
is a blocking operation: This operation requires all inputs to be materialized in object store for it to execute..But I presume, conceptually, it's possible to implement a streaming one: keeps a record of "seen" values and drops entry when its value is in the "seen" collection
Use case
A streaming
unique
function will be very useful when the amount of data is too large to be materialized.The text was updated successfully, but these errors were encountered: