-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Optimal String Alignment (OSA) Distance Algorithm #464
Implement Optimal String Alignment (OSA) Distance Algorithm #464
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #464 +/- ##
==========================================
+ Coverage 94.93% 95.04% +0.10%
==========================================
Files 240 241 +1
Lines 10156 10213 +57
Branches 1441 1450 +9
==========================================
+ Hits 9642 9707 +65
+ Misses 395 389 -6
+ Partials 119 117 -2 ☔ View full report in Codecov by Sentry. |
466bbb8
to
2f49a03
Compare
I’m a bit confused about the Codecov report here. Even though this PR shows 100% coverage for the changes made, there are two indirect changes that are pulling down the overall coverage, which is blocking the PR. I think this might be related to the fact that I refreshed my branch with the latest master, but I’m not entirely sure. What do you think about ignoring the Codecov warning for now, lowering the overall coverage, and opening an issue to address the coverage for the two classes that are not hitting the mark? This seems like the best route, though there’s a chance the fix might not happen. Alternatively, I could add some unit tests to cover those two classes in this PR, but that would expand the scope quite a bit. Let me know your thoughts! |
I have opened an issue, and will try to fix the coverage in HashMap and TimSort, so let's keep this PR open for now, until I have a PR that closes #465 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing the issue first would be ideal even though it wasn't created by this PR. But if you don't have time or can't fix it for any other reason, I can merge this PR as-is. Let me know if I need to merge it sooner.
Summary
This pull request introduces an implementation of the Optimal String Alignment (OSA) Distance algorithm. This string metric is used to measure the difference between two sequences (typically strings) by calculating the minimum number of operations required to transform one string into another. The operations considered include insertion, deletion, substitution, and the transposition of two adjacent characters.
What is the Optimal String Alignment (OSA) Distance?
The Optimal String Alignment (OSA) distance, also referred to as the restricted Damerau-Levenshtein distance, is a variation of the classic Levenshtein distance with an additional operation—transposition of adjacent characters. This metric is particularly useful in scenarios where such transpositions are common, such as typographical errors or spelling mistakes.
Key Operations:
Difference from Damerau-Levenshtein Distance:
While the OSA distance allows for transpositions like the general Damerau-Levenshtein distance, it differs in that it restricts the transposition to be a single operation, ensuring that the same characters are not involved in multiple operations in the same position. This makes OSA more suitable for applications where such operations are expected to be simple, like correcting minor spelling errors.
How It Works
The algorithm uses dynamic programming to compute the distance. The main idea is to build a matrix where each cell
(i, j)
represents the OSA distance between the firsti
characters of strings1
and the firstj
characters of strings2
.The algorithm proceeds as follows:
Initialization:
Filling the Matrix:
s1
ands2
, calculate the cost of insertion, deletion, substitution, and transposition.Final Output:
s1
ands2
.Example
Consider two strings:
"example"
and"exmaple"
.The OSA distance between these two strings is 1 because you can transform
"exmaple"
into"example"
by a single transposition of the characters 'm' and 'a'.Another example:
Here, the distance is 3 due to the following operations:
Motivation
Why Use OSA Distance?
The OSA distance is particularly advantageous in applications where adjacent character transpositions are common. This is typically the case in the following scenarios:
Time Complexity
The time complexity of the OSA distance algorithm is
O(n * m)
, wheren
is the length of the first string andm
is the length of the second string. This makes it efficient for moderate-length strings but may become computationally expensive for very long strings. However, this complexity is comparable to other similar algorithms, such as Levenshtein and Damerau-Levenshtein, making OSA a practical choice for many real-world applications.