incremental matching #1183
Replies: 9 comments
-
@fgregg .. is there an example anywhere around performing incremental matching with dedupe? Is it feasible to be able to do this? |
Beta Was this translation helpful? Give feedback.
-
once you have a good set of matches, you can turn those into gazetteer matches. the gazetteer class has methods for adding new records to the gazetteer. |
Beta Was this translation helpful? Give feedback.
-
Excited to hear its feasible.. Sorry for not seeing it clearly.. but is https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html an example, where canon_file is the pre-matched data and messy_file is the new records.. I'll dig into that if so.. thanks |
Beta Was this translation helpful? Give feedback.
-
if I'm reading the that example right.. the canon_file is loaded entirely into the .index method of the gazetteer, and then I can cycle through the messy_file records asking the gazetteer to .search for possible matches, and I get back the matches as a tuple with the messy id and the canon id and a score.. Seems like a 'settings' file can provide other details (i've yet to see what is in these settings files, so I assume it has the training data, various parameters.. not sure what else). If i'm down the wrong path.. please fire me even just pseudo code steps as to how to run an incremental match.. Thanks @fgregg ! |
Beta Was this translation helpful? Give feedback.
-
@fgregg ... is there an example of using the gazetteer in a way that avoids loading the entire canon_file into memory? |
Beta Was this translation helpful? Give feedback.
-
@rderidder-lda |
Beta Was this translation helpful? Give feedback.
-
Hi!
I have not spent time to figure out the best way to implement incremental
matching using dedup. I am hopeful it will meet my needs, but it will be
some time before I get back to spending time on figuring it out.
1. It does not handle anything to do with selecting the 'best' record in
the match group. It only groups up the matches. You will have to code
your own rules to select the 'best' record in each group.
2. as above, this is all your own logic - specially if you want to go to
field level survivorship. Dedup only goes as far as creating the match
groups for you. It does this very well, using great matching algorithms
and a few options you can play with. But once those matches are found, its
up to you what you want to do with them.
In short, its a matching tool, not a mastering tool.
R
|
Beta Was this translation helpful? Give feedback.
-
@rderidder-lda i saw below in there code base my client data have 3,84,984 rows of data Thanks |
Beta Was this translation helpful? Give feedback.
-
Sorry.. I haven't played with the sample size much. Mine is at 500. I
believe its the 'training' sample size right? so maybe depends more on how
many records you have in your training file.
i haven't used the csv example. I used the large volume example.. think it
was 'mysql' or something like that.
R
…On Tue, Jul 23, 2024 at 9:17 AM prk2331 ***@***.***> wrote:
@rderidder-lda <https://github.com/rderidder-lda>
thanks for giving me your time to response
i tried CSV dedupe example
but i'm not getting what sample size need to give according to my client
data
i saw below in there code base
def prepare_training(
self,
data: Data,
training_file: TextIO | None = None,
sample_size: int = 1500,
blocked_proportion: float = 0.9,
) -> None:
my client data have 3,84,984 rows of data
how to calculate what sample size need to pass. so that dedupe works
perfect.?
Thanks
—
Reply to this email directly, view it on GitHub
<#1183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARX4FOD2NF2LNO2UAPGTE7DZNZJWRAVCNFSM6AAAAABCJOQOESVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMJSGY2TSMI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Say I've run dedup on millions of records, and now have the entity map sorted out, and it all looks good - all my matched records are grouped up by their canon id. (i use "canon id" from the example https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html)
I now have 1 new record that comes along and i just want to add it to the best group possible (or confirm it has no good match).. Is there a way to do that without rerunning matching on everything?
What is the best way to do this, with performance in mind?
I don't want to re-process the millions of records - i trust that they are not going to change.. I just want to 'add' this new record to one of the canon id's.
The blocking map table is also still available for the previous run with millions of records... if there is a way to make use of it
Beta Was this translation helpful? Give feedback.
All reactions