Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp #3182

Open
drc-infinyon opened this issue Apr 21, 2023 · 0 comments

Comments

@drc-infinyon
Copy link
Contributor

InfinyOn Cloud and Fluvio users need to remove duplicates from the collected records.
Deduplication of records can happen at the record level before the topic at the level of the producer, or after the topic as smart modules.

The deduplication module has the following constraints:

  • Retention policy - time
  • Volume of records - record count or size of records

High level ideas

  • The deduplication process will utilise an index based on designated keys in the records within the data to identify duplicate records.
  • The module will build the index based on historical data in the topic
  • Initial implementation scope is suitable for relatively smaller datasets with incremental identifiers/keys like timestamps, which will identify the duplicates
  • Based on our lessons from this implementation and user feedback, we will identify the implementation at the stream processing unit level

High level diagram of the flow:
deduplication

To Update:
basic technical design elements describing the solution.

@sehz sehz added this to the 0.10.8 milestone Apr 21, 2023
@sehz sehz modified the milestones: 0.10.8, 0.10.9 May 1, 2023
@sehz sehz modified the milestones: 0.10.9, 0.10.10 May 13, 2023
@drc-infinyon drc-infinyon moved this to 🏷 Features in InfinyOn Public Roadmap May 20, 2023
@sehz sehz removed this from the 0.10.10 milestone Jun 2, 2023
@drc-infinyon drc-infinyon moved this from 🏷 Features to 🏗 In progress in InfinyOn Public Roadmap Jun 12, 2023
@drc-infinyon drc-infinyon changed the title deduplication module to remove duplicates from topics using incremental identifiers/keys like timestamp [Feature] Deduplication functionality to remove duplicates from topics using incremental identifiers/keys like timestamp Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants