[Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp #3182

drc-infinyon · 2023-04-21T03:35:45Z

InfinyOn Cloud and Fluvio users need to remove duplicates from the collected records.
Deduplication of records can happen at the record level before the topic at the level of the producer, or after the topic as smart modules.

The deduplication module has the following constraints:

Retention policy - time
Volume of records - record count or size of records

High level ideas

The deduplication process will utilise an index based on designated keys in the records within the data to identify duplicate records.
The module will build the index based on historical data in the topic
Initial implementation scope is suitable for relatively smaller datasets with incremental identifiers/keys like timestamps, which will identify the duplicates
Based on our lessons from this implementation and user feedback, we will identify the implementation at the stream processing unit level

High level diagram of the flow:

To Update:
basic technical design elements describing the solution.

sehz added this to the 0.10.8 milestone Apr 21, 2023

sehz added RFC features/de_duplication labels Apr 21, 2023

sehz modified the milestones: 0.10.8, 0.10.9 May 1, 2023

sehz modified the milestones: 0.10.9, 0.10.10 May 13, 2023

drc-infinyon added this to InfinyOn Public Roadmap May 20, 2023

drc-infinyon moved this to 🏷 Features in InfinyOn Public Roadmap May 20, 2023

sehz removed this from the 0.10.10 milestone Jun 2, 2023

drc-infinyon moved this from 🏷 Features to 🏗 In progress in InfinyOn Public Roadmap Jun 12, 2023

drc-infinyon changed the title ~~deduplication module to remove duplicates from topics using incremental identifiers/keys like timestamp~~ [Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp #3182

[Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp #3182

drc-infinyon commented Apr 21, 2023

[Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp #3182

[Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp #3182

Comments

drc-infinyon commented Apr 21, 2023