Implement Optimal String Alignment (OSA) Distance Algorithm #464

Kalkwst · 2024-08-21T07:10:51Z

Summary

This pull request introduces an implementation of the Optimal String Alignment (OSA) Distance algorithm. This string metric is used to measure the difference between two sequences (typically strings) by calculating the minimum number of operations required to transform one string into another. The operations considered include insertion, deletion, substitution, and the transposition of two adjacent characters.

What is the Optimal String Alignment (OSA) Distance?

The Optimal String Alignment (OSA) distance, also referred to as the restricted Damerau-Levenshtein distance, is a variation of the classic Levenshtein distance with an additional operation—transposition of adjacent characters. This metric is particularly useful in scenarios where such transpositions are common, such as typographical errors or spelling mistakes.

Key Operations:

Insertion: Add a character to the string.
Deletion: Remove a character from the string.
Substitution: Replace one character with another.
Transposition: Swap two adjacent characters.

Difference from Damerau-Levenshtein Distance:

While the OSA distance allows for transpositions like the general Damerau-Levenshtein distance, it differs in that it restricts the transposition to be a single operation, ensuring that the same characters are not involved in multiple operations in the same position. This makes OSA more suitable for applications where such operations are expected to be simple, like correcting minor spelling errors.

How It Works

The algorithm uses dynamic programming to compute the distance. The main idea is to build a matrix where each cell (i, j) represents the OSA distance between the first i characters of string s1 and the first j characters of string s2.

The algorithm proceeds as follows:

Initialization:
- The matrix is initialized with base cases where one string is empty.
Filling the Matrix:
- For each character in s1 and s2, calculate the cost of insertion, deletion, substitution, and transposition.
- The value at each matrix cell is the minimum of these costs.
Final Output:
- The OSA distance is found in the cell corresponding to the comparison of the entire lengths of s1 and s2.

Example

Consider two strings: "example" and "exmaple".

The OSA distance between these two strings is 1 because you can transform "exmaple" into "example" by a single transposition of the characters 'm' and 'a'.

Another example:

int distance = OptimalStringAlignmentDistance("kitten", "sitting");
// Output: 3

Here, the distance is 3 due to the following operations:

Substitution of 'k' with 's'.
Substitution of 'e' with 'i'.
Insertion of 'g' at the end.

Motivation

Why Use OSA Distance?

The OSA distance is particularly advantageous in applications where adjacent character transpositions are common. This is typically the case in the following scenarios:

Typographical Error Detection: When users accidentally swap two adjacent characters while typing.
Spelling Correction: In applications like search engines and text editors where suggestions for misspelled words are needed.
Natural Language Processing (NLP): In tasks like fuzzy string matching where small errors in word forms are common and should be efficiently handled.

Time Complexity

The time complexity of the OSA distance algorithm is O(n * m), where n is the length of the first string and m is the length of the second string. This makes it efficient for moderate-length strings but may become computationally expensive for very long strings. However, this complexity is comparable to other similar algorithms, such as Levenshtein and Damerau-Levenshtein, making OSA a practical choice for many real-world applications.

I have performed a self-review of my code
My code follows the style guidelines of this project
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Comments in areas I changed are up to date
I have added comments to hard-to-understand areas of my code
I have made corresponding changes to the README.md

codecov · 2024-08-21T07:16:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.04%. Comparing base (6b37d04) to head (c4e0e1a).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #464      +/-   ##
==========================================
+ Coverage   94.93%   95.04%   +0.10%     
==========================================
  Files         240      241       +1     
  Lines       10156    10213      +57     
  Branches     1441     1450       +9     
==========================================
+ Hits         9642     9707      +65     
+ Misses        395      389       -6     
+ Partials      119      117       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…gnment

Kalkwst · 2024-08-21T09:09:52Z

I’m a bit confused about the Codecov report here. Even though this PR shows 100% coverage for the changes made, there are two indirect changes that are pulling down the overall coverage, which is blocking the PR.

I think this might be related to the fact that I refreshed my branch with the latest master, but I’m not entirely sure. What do you think about ignoring the Codecov warning for now, lowering the overall coverage, and opening an issue to address the coverage for the two classes that are not hitting the mark? This seems like the best route, though there’s a chance the fix might not happen.

Alternatively, I could add some unit tests to cover those two classes in this PR, but that would expand the scope quite a bit.

Let me know your thoughts!

Kalkwst · 2024-08-21T09:22:03Z

I have opened an issue, and will try to fix the coverage in HashMap and TimSort, so let's keep this PR open for now, until I have a PR that closes #465

siriak

Fixing the issue first would be ideal even though it wasn't created by this PR. But if you don't have time or can't fix it for any other reason, I can merge this PR as-is. Let me know if I need to merge it sooner.

…gnment

Kalkwst added 2 commits August 21, 2024 09:59

Add Optimal String Alignment Algorithm Implementation

b5e5f5f

Update README.md

0477aec

Kalkwst requested a review from siriak as a code owner August 21, 2024 07:10

Fix Codacy suggestions

2f49a03

Kalkwst force-pushed the feature/string-similarity/OptimalStringAlignment branch from 466bbb8 to 2f49a03 Compare August 21, 2024 07:21

Kalkwst added 2 commits August 21, 2024 10:26

Remove null tests

eea5aae

Merge branch 'master' into feature/string-similarity/OptimalStringAli…

67cbcd5

…gnment

siriak approved these changes Aug 21, 2024

View reviewed changes

Merge branch 'master' into feature/string-similarity/OptimalStringAli…

c4e0e1a

…gnment

siriak merged commit 5eb0254 into TheAlgorithms:master Sep 21, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Optimal String Alignment (OSA) Distance Algorithm #464

Implement Optimal String Alignment (OSA) Distance Algorithm #464

Kalkwst commented Aug 21, 2024

codecov bot commented Aug 21, 2024 •

edited

Loading

Kalkwst commented Aug 21, 2024

Kalkwst commented Aug 21, 2024

siriak left a comment

Implement Optimal String Alignment (OSA) Distance Algorithm #464

Implement Optimal String Alignment (OSA) Distance Algorithm #464

Conversation

Kalkwst commented Aug 21, 2024

Summary

What is the Optimal String Alignment (OSA) Distance?

Key Operations:

Difference from Damerau-Levenshtein Distance:

How It Works

Example

Motivation

Why Use OSA Distance?

Time Complexity

codecov bot commented Aug 21, 2024 • edited Loading

Codecov Report

Kalkwst commented Aug 21, 2024

Kalkwst commented Aug 21, 2024

siriak left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 21, 2024 •

edited

Loading