Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Optimal String Alignment (OSA) Distance Algorithm #464

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Kalkwst
Copy link
Contributor

@Kalkwst Kalkwst commented Aug 21, 2024

Summary

This pull request introduces an implementation of the Optimal String Alignment (OSA) Distance algorithm. This string metric is used to measure the difference between two sequences (typically strings) by calculating the minimum number of operations required to transform one string into another. The operations considered include insertion, deletion, substitution, and the transposition of two adjacent characters.

What is the Optimal String Alignment (OSA) Distance?

The Optimal String Alignment (OSA) distance, also referred to as the restricted Damerau-Levenshtein distance, is a variation of the classic Levenshtein distance with an additional operation—transposition of adjacent characters. This metric is particularly useful in scenarios where such transpositions are common, such as typographical errors or spelling mistakes.

Key Operations:

  1. Insertion: Add a character to the string.
  2. Deletion: Remove a character from the string.
  3. Substitution: Replace one character with another.
  4. Transposition: Swap two adjacent characters.

Difference from Damerau-Levenshtein Distance:

While the OSA distance allows for transpositions like the general Damerau-Levenshtein distance, it differs in that it restricts the transposition to be a single operation, ensuring that the same characters are not involved in multiple operations in the same position. This makes OSA more suitable for applications where such operations are expected to be simple, like correcting minor spelling errors.

How It Works

The algorithm uses dynamic programming to compute the distance. The main idea is to build a matrix where each cell (i, j) represents the OSA distance between the first i characters of string s1 and the first j characters of string s2.

The algorithm proceeds as follows:

  1. Initialization:

    • The matrix is initialized with base cases where one string is empty.
  2. Filling the Matrix:

    • For each character in s1 and s2, calculate the cost of insertion, deletion, substitution, and transposition.
    • The value at each matrix cell is the minimum of these costs.
  3. Final Output:

    • The OSA distance is found in the cell corresponding to the comparison of the entire lengths of s1 and s2.

Example

Consider two strings: "example" and "exmaple".

The OSA distance between these two strings is 1 because you can transform "exmaple" into "example" by a single transposition of the characters 'm' and 'a'.

Another example:

int distance = OptimalStringAlignmentDistance("kitten", "sitting");
// Output: 3

Here, the distance is 3 due to the following operations:

  1. Substitution of 'k' with 's'.
  2. Substitution of 'e' with 'i'.
  3. Insertion of 'g' at the end.

Motivation

Why Use OSA Distance?

The OSA distance is particularly advantageous in applications where adjacent character transpositions are common. This is typically the case in the following scenarios:

  • Typographical Error Detection: When users accidentally swap two adjacent characters while typing.
  • Spelling Correction: In applications like search engines and text editors where suggestions for misspelled words are needed.
  • Natural Language Processing (NLP): In tasks like fuzzy string matching where small errors in word forms are common and should be efficiently handled.

Time Complexity

The time complexity of the OSA distance algorithm is O(n * m), where n is the length of the first string and m is the length of the second string. This makes it efficient for moderate-length strings but may become computationally expensive for very long strings. However, this complexity is comparable to other similar algorithms, such as Levenshtein and Damerau-Levenshtein, making OSA a practical choice for many real-world applications.

  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Comments in areas I changed are up to date
  • I have added comments to hard-to-understand areas of my code
  • I have made corresponding changes to the README.md

@Kalkwst Kalkwst requested a review from siriak as a code owner August 21, 2024 07:10
Copy link

codecov bot commented Aug 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.95%. Comparing base (351b95b) to head (67cbcd5).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #464      +/-   ##
==========================================
- Coverage   94.97%   94.95%   -0.03%     
==========================================
  Files         237      238       +1     
  Lines       10058    10115      +57     
  Branches     1422     1431       +9     
==========================================
+ Hits         9553     9605      +52     
- Misses        389      392       +3     
- Partials      116      118       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Kalkwst Kalkwst force-pushed the feature/string-similarity/OptimalStringAlignment branch from 466bbb8 to 2f49a03 Compare August 21, 2024 07:21
@Kalkwst
Copy link
Contributor Author

Kalkwst commented Aug 21, 2024

I’m a bit confused about the Codecov report here. Even though this PR shows 100% coverage for the changes made, there are two indirect changes that are pulling down the overall coverage, which is blocking the PR.

I think this might be related to the fact that I refreshed my branch with the latest master, but I’m not entirely sure. What do you think about ignoring the Codecov warning for now, lowering the overall coverage, and opening an issue to address the coverage for the two classes that are not hitting the mark? This seems like the best route, though there’s a chance the fix might not happen.

Alternatively, I could add some unit tests to cover those two classes in this PR, but that would expand the scope quite a bit.

Let me know your thoughts!

@Kalkwst
Copy link
Contributor Author

Kalkwst commented Aug 21, 2024

I have opened an issue, and will try to fix the coverage in HashMap and TimSort, so let's keep this PR open for now, until I have a PR that closes #465

Copy link
Member

@siriak siriak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the issue first would be ideal even though it wasn't created by this PR. But if you don't have time or can't fix it for any other reason, I can merge this PR as-is. Let me know if I need to merge it sooner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants