Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Cosine Similarity Algorithm for Strings #459

Merged
merged 3 commits into from
Aug 1, 2024

Conversation

Kalkwst
Copy link
Contributor

@Kalkwst Kalkwst commented Jul 31, 2024

Summary

This PR introduces an algorithm to calculate the Cosine Similarity between two strings. Cosine Similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined as the cosine of the angle between the vectors. This implementation allows us to determine how similar two strings are based on the frequency of their characters.

Algorithm Overview

The Cosine Similarity algorithm works by representing each string as a vector of character frequencies, then calculating the cosine of the angle between these vectors. The steps involved are as follows:

  1. Vector Representation: Convert each string into a vector where each element represents the frequency of a character in the string.
  2. Dot Product Calculation: Compute the dot product of the two vectors.
  3. Magnitude Calculation: Calculate the magnitude (or length) of each vector.
  4. Similarity Calculation: Compute the cosine similarity as the dot product of the vectors divided by the product of their magnitudes.

Pseudocode

Below is a high-level pseudocode representation of the algorithm:

function calculateCosineSimilarity(string1, string2):
    // Step 1: Vector Representation
    vector1 = getCharacterFrequencyVector(string1)
    vector2 = getCharacterFrequencyVector(string2)
    
    // Step 2: Dot Product Calculation
    dotProduct = calculateDotProduct(vector1, vector2)
    
    // Step 3: Magnitude Calculation
    magnitude1 = calculateMagnitude(vector1)
    magnitude2 = calculateMagnitude(vector2)
    
    // Step 4: Similarity Calculation
    if magnitude1 == 0 or magnitude2 == 0:
        return 0.0
    else:
        return dotProduct / (magnitude1 * magnitude2)
    
function getCharacterFrequencyVector(string):
    vector = {}
    for character in string:
        if character in vector:
            vector[character] += 1
        else:
            vector[character] = 1
    return vector

function calculateDotProduct(vector1, vector2):
    dotProduct = 0
    for character in vector1:
        if character in vector2:
            dotProduct += vector1[character] * vector2[character]
    return dotProduct

function calculateMagnitude(vector):
    magnitude = 0
    for value in vector.values():
        magnitude += value * value
    return sqrt(magnitude)

Applications of Cosine Similarity

  1. Information Retrieval:
    • Search Engines: Enhancing search results by measuring the similarity between user queries and indexed documents.
    • Recommendation Systems: Recommending items (e.g., books, movies) based on similarity to user preferences or previous interactions.
  2. Data Mining:
    • Clustering: Grouping similar data points together in clustering algorithms, such as k-means clustering.
    • Anomaly Detection: Identifying outliers or unusual patterns in data by comparing similarity scores.
  3. Genomics:
    • Sequence Alignment: Measuring similarity between genetic sequences to identify homologous regions or evolutionary relationships.
  4. Image Analysis:
    • Object Recognition: Comparing feature vectors of images to recognize objects or patterns.
  5. Text Analysis and Natural Language Processing (NLP):
    • Document Similarity: Determining how similar two documents are, which is useful in plagiarism detection, document clustering, and search engines.
    • Sentence and Phrase Matching: Finding similarity between sentences or phrases for applications like question-answering systems and text summarization.

See also


  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Comments in areas I changed are up to date
  • I have added comments to hard-to-understand areas of my code
  • I have made corresponding changes to the README.md

@Kalkwst Kalkwst requested a review from siriak as a code owner July 31, 2024 09:04
Copy link

codecov bot commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.95%. Comparing base (b0838cb) to head (3d97073).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   94.91%   94.95%   +0.03%     
==========================================
  Files         235      236       +1     
  Lines        9967    10022      +55     
  Branches     1408     1416       +8     
==========================================
+ Hits         9460     9516      +56     
+ Misses        391      389       -2     
- Partials      116      117       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@siriak siriak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just have one comment about a test

The test Calculate_PartiallyMatchingStrings_ReturnsCorrectValue previously used the formula 3 / (Math.Sqrt(5) * Math.Sqrt(5)) to calculate the expected value. Upon review, it was noticed that the expected value simplifies directly to 3/5.

This commit updates the test to use the simplified expected value of 3/5 instead of the more complex formula. This change makes the test easier to understand and maintains the same correctness.

My bad for not catching this simplification earlier!
Copy link
Member

@siriak siriak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@siriak siriak merged commit 9eb2196 into TheAlgorithms:master Aug 1, 2024
4 checks passed
@Kalkwst Kalkwst deleted the feature/cosine-similarity branch August 1, 2024 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants