incremental matching #1183

rderidder-lda · 2024-01-24T21:30:13Z

rderidder-lda
Jan 24, 2024

Say I've run dedup on millions of records, and now have the entity map sorted out, and it all looks good - all my matched records are grouped up by their canon id. (i use "canon id" from the example https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html)

I now have 1 new record that comes along and i just want to add it to the best group possible (or confirm it has no good match).. Is there a way to do that without rerunning matching on everything?
What is the best way to do this, with performance in mind?

I don't want to re-process the millions of records - i trust that they are not going to change.. I just want to 'add' this new record to one of the canon id's.
The blocking map table is also still available for the previous run with millions of records... if there is a way to make use of it

rderidder-lda · 2024-01-25T18:00:28Z

rderidder-lda
Jan 25, 2024
Author

@fgregg .. is there an example anywhere around performing incremental matching with dedupe? Is it feasible to be able to do this?
Thanks!

0 replies

fgregg · 2024-01-25T18:10:46Z

fgregg
Jan 25, 2024
Maintainer

once you have a good set of matches, you can turn those into gazetteer matches. the gazetteer class has methods for adding new records to the gazetteer.

0 replies

rderidder-lda · 2024-01-25T18:46:22Z

rderidder-lda
Jan 25, 2024
Author

Excited to hear its feasible.. Sorry for not seeing it clearly.. but is https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html an example, where canon_file is the pre-matched data and messy_file is the new records.. I'll dig into that if so.. thanks

0 replies

rderidder-lda · 2024-01-25T20:11:00Z

rderidder-lda
Jan 25, 2024
Author

if I'm reading the that example right.. the canon_file is loaded entirely into the .index method of the gazetteer, and then I can cycle through the messy_file records asking the gazetteer to .search for possible matches, and I get back the matches as a tuple with the messy id and the canon id and a score.. Seems like a 'settings' file can provide other details (i've yet to see what is in these settings files, so I assume it has the training data, various parameters.. not sure what else).
So just not sure about if I can make use of existing block map of the huge list of canon records, or if it has to recreate it or its not needed for the search somehow..
And not sure if there is a way to improve on the .index method having to load in the entire set of canon_file data into memory..

If i'm down the wrong path.. please fire me even just pseudo code steps as to how to run an incremental match.. Thanks @fgregg !

0 replies

rderidder-lda · 2024-02-14T16:29:25Z

rderidder-lda
Feb 14, 2024
Author

@fgregg ... is there an example of using the gazetteer in a way that avoids loading the entire canon_file into memory?
Similar to the my_sql example that uses generator functions.. but for an incremental match.

0 replies

prk2331 · 2024-07-22T08:27:48Z

prk2331
Jul 22, 2024

@rderidder-lda
What is your experience with this incremental match? Does this meets your requirement?
im also looking for this incremental match
Can you please share your thoughts. is it good to go with dedupe for entity resolution.
does this dedupe provide the followings:
1. (How do I mark the profile as a golden/primary profile against all of its similar profiles?)
2. And what about merging of data into the golden/primary master profile? Assume one similar profile contains "Pets" as an interest, and another similar profile's interest is "Finance."? and we want to put both of them, "Pets" and "Finance," into the golden profile.

0 replies

rderidder-lda · 2024-07-22T15:22:25Z

rderidder-lda
Jul 22, 2024
Author

Hi! I have not spent time to figure out the best way to implement incremental matching using dedup. I am hopeful it will meet my needs, but it will be some time before I get back to spending time on figuring it out. 1. It does not handle anything to do with selecting the 'best' record in the match group. It only groups up the matches. You will have to code your own rules to select the 'best' record in each group. 2. as above, this is all your own logic - specially if you want to go to field level survivorship. Dedup only goes as far as creating the match groups for you. It does this very well, using great matching algorithms and a few options you can play with. But once those matches are found, its up to you what you want to do with them. In short, its a matching tool, not a mastering tool. R

0 replies

prk2331 · 2024-07-23T13:17:07Z

prk2331
Jul 23, 2024

@rderidder-lda
thanks for giving me your time to response
Can you please guide me in one query
i tried CSV dedupe example
but i'm not getting what sample size need to give according to my client data

i saw below in there code base
def prepare_training(
self,
data: Data,
training_file: TextIO | None = None,
sample_size: int = 1500,
blocked_proportion: float = 0.9,
) -> None:

my client data have 3,84,984 rows of data
how to calculate what sample size need to pass ? and what block portion need to pass ?
so that dedupe works perfect.?

Thanks

0 replies

rderidder-lda · 2024-07-23T13:42:20Z

rderidder-lda
Jul 23, 2024
Author

Sorry.. I haven't played with the sample size much. Mine is at 500. I believe its the 'training' sample size right? so maybe depends more on how many records you have in your training file. i haven't used the csv example. I used the large volume example.. think it was 'mysql' or something like that. R

…

On Tue, Jul 23, 2024 at 9:17 AM prk2331 ***@***.***> wrote: @rderidder-lda <https://github.com/rderidder-lda> thanks for giving me your time to response i tried CSV dedupe example but i'm not getting what sample size need to give according to my client data i saw below in there code base def prepare_training( self, data: Data, training_file: TextIO | None = None, sample_size: int = 1500, blocked_proportion: float = 0.9, ) -> None: my client data have 3,84,984 rows of data how to calculate what sample size need to pass. so that dedupe works perfect.? Thanks — Reply to this email directly, view it on GitHub <#1183 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARX4FOD2NF2LNO2UAPGTE7DZNZJWRAVCNFSM6AAAAABCJOQOESVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMJSGY2TSMI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incremental matching #1183

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

incremental matching #1183

rderidder-lda Jan 24, 2024

Replies: 9 comments

rderidder-lda Jan 25, 2024 Author

fgregg Jan 25, 2024 Maintainer

rderidder-lda Jan 25, 2024 Author

rderidder-lda Jan 25, 2024 Author

rderidder-lda Feb 14, 2024 Author

prk2331 Jul 22, 2024

rderidder-lda Jul 22, 2024 Author

prk2331 Jul 23, 2024

rderidder-lda Jul 23, 2024 Author

rderidder-lda
Jan 24, 2024

rderidder-lda
Jan 25, 2024
Author

fgregg
Jan 25, 2024
Maintainer

rderidder-lda
Jan 25, 2024
Author

rderidder-lda
Jan 25, 2024
Author

rderidder-lda
Feb 14, 2024
Author

prk2331
Jul 22, 2024

rderidder-lda
Jul 22, 2024
Author

prk2331
Jul 23, 2024

rderidder-lda
Jul 23, 2024
Author