Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the Diagonal Calculation #21

Open
Turnerj opened this issue May 4, 2020 · 4 comments
Open

Improving the Diagonal Calculation #21

Turnerj opened this issue May 4, 2020 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@Turnerj
Copy link
Owner

Turnerj commented May 4, 2020

A few points:

  • Diagonal calculation isn't effective with its use of caching vector-sized strings, specifically the target vector which needs to be flipped. It flips it multiple times because of how the data iterates. If the diagonal calculation could be both diagonal within a column the width of the vector, it would eliminate this.
  • Threading doesn't make use of the diagonal calculation because the boundaries cannot get the correct data easily (both the back boundary for the next thread and the forward boundary for the current thread). Using the opposite of the technique above, instead of diagonal within a column the width of the vector, do diagonal in the row the width of the vector.

Preliminary calculations on performance for single threaded improvement (take these with a grain of salt, they all don't calculate correctly):

No target lookups and no shuffles/permutations aka. Process Column (at best - we would still need one of these calls per vector but we would avoid a lot of duplicates)

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Quickenshtein 23.67 ms 0.125 ms 0.117 ms - - - -

No source lookups aka. Process Row (at best - we would still need one of these calls per vector but we would avoid a lot of duplicates)

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Quickenshtein 24.50 ms 0.143 ms 0.134 ms - - - 40 B

No source or target lookups (INVALID)

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Quickenshtein 21.20 ms 0.138 ms 0.130 ms - - - 40 B

For comparison, it takes about 26.8 ms to calculate it currently with the regular diagonal calculation. This means at best, single threaded could have a 12% improvement however it is likely to be closer to 8% for the "correct" solution.

No idea about how the threading code would perform though.

@Turnerj Turnerj added the enhancement New feature or request label May 4, 2020
@Turnerj Turnerj added the help wanted Extra attention is needed label Jun 20, 2020
@RobSchoenaker
Copy link

RobSchoenaker commented Nov 3, 2021

Not sure if I am talking rubbish, but you may want to consider the single row version
https://rosettacode.org/wiki/Levenshtein_distance#C.2B.2B
It's discussed here:
https://www.youtube.com/watch?v=Cu7Tl7FGigQ

@Turnerj
Copy link
Owner Author

Turnerj commented Nov 3, 2021

Hey @RobSchoenaker - thanks for the link however Quickenshtein already uses that optimization and a bunch more.
If you're curious about some of the exotic optimizations, you can check out this video of me talking about it at dotNETConf last year (https://www.youtube.com/watch?v=JiOYajl2Mds) or any of my blog posts on the subject:

This particular issue is referring to a diagonal calculation approach which allows us to take advantage of SIMD (watch my video or that last blog post if you want to know a bit about SIMD). Basically diagonally calculating allows us to remove the dependency issue when calculating multiple cells at once. We do require 2x the memory that the traditional single row needs but for long strings, we absolutely crush it in speed.

I do plan to have a few more blog posts about Levenshtein Distance in the future too as there are definitely some cool things I'd like to check out with it.

@RobSchoenaker
Copy link

@Turnerj thx for the quick response. I see my comment did not make any sense :-)
Will check out the dotNETConf things!

@Turnerj
Copy link
Owner Author

Turnerj commented Nov 3, 2021

No problem mate! Hopefully you find some of the links I've given interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants