[Data] Adding streaming capability for `ray.data.Dataset.unique` #51207

marcmk6 · 2025-03-10T05:26:33Z

Description

The current doc indicates that ray.data.Dataset.unique is a blocking operation: This operation requires all inputs to be materialized in object store for it to execute..
But I presume, conceptually, it's possible to implement a streaming one: keeps a record of "seen" values and drops entry when its value is in the "seen" collection

Use case

A streaming unique function will be very useful when the amount of data is too large to be materialized.

The text was updated successfully, but these errors were encountered:

wingkitlee0 · 2025-03-10T23:31:42Z

This operation requires all inputs to be materialized in object store for it to execute..

I believe the wording can be clearer. It does not requires all the inputs to be materialized at the same time. The current wording actually applies to the doc of other global aggregations.

keeps a record of "seen" values and drops entry when its value is in the "seen" collection

It is doing that right now, in parallel.

streaming

A true streaming op will give you partial results before everything is done. I don't think it's easy to do that, without some internal partitioning (?)

marcmk6 · 2025-03-11T03:37:35Z

It does not requires all the inputs to be materialized at the same time.

Could you elaborate? You meant the current implementation effectively materialize all the entries of unique results? (instead of materializing every single entries in the input dataset in which there are duplicates (before applying unique).)
I think this still can be very costly if the dataset is large.

The current implementation, although being parallel underneath, is still blocking: none of the data entries will go into next stage before the unique operation is done for the input dataset. Is this right?

I believe filter is a streaming operation so I presume the ideal way of working for unique is like "filtering/dropping seen entries in a streaming way".

wingkitlee0 · 2025-03-11T23:59:53Z

Sorry I may be thinking that is something not turned on by default: https://docs.ray.io/en/latest/data/shuffling-data.html#enabling-push-based-shuffle
All aggregate functions are doing partial aggregation first (say per block) in parallel, then they are aggregated the final result.

is still blocking: none of the data entries will go into next stage before the unique operation is done for the input dataset.

Correct. However, the current API returns a list (to the head node). So it is blocking by design (as an all-to-all operation).

I think your suggestion is a parallel version of the following (which may not be easily parallelize-able without shuffle?):

seen = set()
for x in data:
  if x not in seen:
    seen.add(x)
return list(seen)

marcmk6 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 10, 2025

jcotant1 added the data Ray Data-related issues label Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Adding streaming capability for `ray.data.Dataset.unique` #51207

[Data] Adding streaming capability for `ray.data.Dataset.unique` #51207

marcmk6 commented Mar 10, 2025 •

edited

Loading

wingkitlee0 commented Mar 10, 2025

marcmk6 commented Mar 11, 2025 •

edited

Loading

wingkitlee0 commented Mar 11, 2025

[Data] Adding streaming capability for ray.data.Dataset.unique #51207

[Data] Adding streaming capability for ray.data.Dataset.unique #51207

Comments

marcmk6 commented Mar 10, 2025 • edited Loading

Description

Use case

wingkitlee0 commented Mar 10, 2025

marcmk6 commented Mar 11, 2025 • edited Loading

wingkitlee0 commented Mar 11, 2025

[Data] Adding streaming capability for `ray.data.Dataset.unique` #51207

[Data] Adding streaming capability for `ray.data.Dataset.unique` #51207

marcmk6 commented Mar 10, 2025 •

edited

Loading

marcmk6 commented Mar 11, 2025 •

edited

Loading