-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark #358
Comments
Glad to hear that the library proves to be useful :) Do I understand it correctly, that you have essentially two columns |
Not exactly. On the one hand, I have lists of named entities extracted from different text. On the other, multiple columns from a PostgreSQL table, each containing a dictionary with multiple combinations of keywords. The algorithm needs to find the best match and provide a similarity score. If the we are talking about written text, I am using the Damerau-Levenshtein similarity score for glyphs only. If one the other hand, the text are transcripts, I am using the Levenshtein similarity score both for glyphs and metaphone3. In summary, it is a many vs many comparison run in parallel. This is the real scenario. The Pandas dataframe was just for testing, but it was also many vs many. |
Did you try to compare with https://github.com/ashvardanian/StringZilla ? |
Not in depth. However I just had a quick look at their two benchmarks: Long sequencesThis benchmark is mentioned in the readme and mentions them being significantly faster than other.
Usually I would assume that they simply didn't know about better implementations. However their benchmark script (https://github.com/ashvardanian/StringZilla/blob/main/scripts/bench_similarity.ipynb) actually includes e.g.
In addition StringZilla is a lot slower than their reported number, while all other libraries are actually faster on my machine. So without them mentioning the CPU they tested this on (maybe one with AVX512?) I am questioning their results. In addition I find it questionable that they only publish results for the libraries they find to be much slower 🤷♂️ Short StringsFor their short string test they actually perform better than
This has the additional advantage that everyone now compares the same strings. This gives the following results:
When comparing only empty strings I get:
Maybe there are things in the wrapping code which could be improved a bit in rapidfuzz to get the same results 🤔
rapidfuzz can do a lot better by using
|
This is not an issue, but to also tell about the positive things, when comparing the execution time on 10 million string pairs (2-22 character long) stored on a Pandas dataframe, RapidFuzz's Damerau-Levenshtein beats both pyxDamerauLevenshtein's and jelyfish's.
Interestingly, if you run one string pair only, pyxDamerauLevenshtein is considerably faster, but, after only a few string pairs, RapidFuzz catches up and beats it.
The text was updated successfully, but these errors were encountered: