Function find_near_matches is alternating results for the same situation #40

RagnarCris · 2022-03-22T15:18:15Z

Hi, my name is Cristiano and I'm using the library to determine how far someone has read, by comparing a transcript to a text to be read.
And strangely, for the same parameters and strings, the function find_near_matches returned two different results, in 5 to 6 times that i runned the script.

I'm using the version 0.7.3 of fuzzysearch.
I'm leaving an example of a script in which this case is happening. If you want, just run around 7 times to see the results changing at least two times.

Thanks in advance!

Example:

from fuzzysearch import find_near_matches

def fuzzy_extract(qs, ls, threshold):
    for match in find_near_matches(qs, ls, max_l_dist=10):
        match = ls[match.start:match.end]
        index = ls.find(match)
        yield (match, index)

ref = "bruna ficava perto da casa de fátima e brincava de bola perto da lagoa"


ret = list(fuzzy_extract("d dê néká m",ref,0))
print("fuzzy_search: "+str(ret))

P.S.: The text in the script is in Portuguese.

The output that i get by running 7 times this script is:

The text was updated successfully, but these errors were encountered:

taleinat · 2022-07-14T18:45:42Z

Hi Cristiano, thanks for this report!

That is indeed very strange. I have been able to reproduce this locally using your script; many thanks for that!

I'll look into this.

jcsilva · 2023-03-29T13:41:30Z

Hi @taleinat!

I'm facing a similar issue here. Do you have any suggestion on this?

taleinat · 2023-03-29T14:07:37Z

Hi,

Unfortunately I haven't had a chance to pin this down.

I will mention, though, that the original question is searching for a pattern of length 11 ("d dê néká m") with max_l_dist=10 (the threshold argument seems to be unused!). The results seem like an edge-case, and do seem to indicate a bug, but I also think that @RagnarCris was not running the fuzzy search they actually intended to run, and that's the main reason the results were not as expected.

@jcsilva, I suggest opening a separate issue with the relevant details, and I'll try to see what can be done.

taleinat · 2023-03-29T14:12:08Z

@RagnarCris, apologies for not having followed up your issue.

Is this still relevant for you?

jcsilva · 2023-04-05T14:20:27Z

Hi @taleinat ,

sorry for my late response. To provide a bit more of context, I'm trying to replace @RagnarCris on this topic, as he is now working in other subject here. So, I can open a new issue, but it would be basically the same described here.

As you mentioned, the threshold is not used in the example provided. Actually it is just a small snippet of a major function ... this example could be reduced to:

rom fuzzysearch import find_near_matches

def fuzzy_extract(qs, ls):
    for match in find_near_matches(qs, ls, max_l_dist=10):
        match = ls[match.start:match.end]
        index = ls.find(match)
        yield (match, index)

ref = "bruna ficava perto da casa de fátima e brincava de bola perto da lagoa"


ret = list(fuzzy_extract("d dê néká m",ref))
print("fuzzy_search: "+str(ret))

Using this example, we would like to find a pattern of length 11 with max_l_dis=10. Actually we need to use this "weird" configuration, where pattern length and max_l_dist are close because the text and the pattern usually are a bit different (the text comes from one source of information and the pattern from another one, and they produce data with different qualities).

The issue we didn't understand is that the index return by find_near_matches varies between consecutive executions even using the same input. We thought the algorithm would be deterministic, right? But we observed those unexpected results and we would like to ask you help to try to understand it.

taleinat added the bug label Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function find_near_matches is alternating results for the same situation #40

Function find_near_matches is alternating results for the same situation #40

RagnarCris commented Mar 22, 2022 •

edited

Loading

taleinat commented Jul 14, 2022

jcsilva commented Mar 29, 2023

taleinat commented Mar 29, 2023 •

edited

Loading

taleinat commented Mar 29, 2023

jcsilva commented Apr 5, 2023

Function find_near_matches is alternating results for the same situation #40

Function find_near_matches is alternating results for the same situation #40

Comments

RagnarCris commented Mar 22, 2022 • edited Loading

taleinat commented Jul 14, 2022

jcsilva commented Mar 29, 2023

taleinat commented Mar 29, 2023 • edited Loading

taleinat commented Mar 29, 2023

jcsilva commented Apr 5, 2023

RagnarCris commented Mar 22, 2022 •

edited

Loading

taleinat commented Mar 29, 2023 •

edited

Loading