Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hamming Distance KNN similarity metric for long property #193

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

htmlboss
Copy link

@htmlboss htmlboss commented May 11, 2022

This is a draft PR so I can leverage the github action to validate my tests, since the gradle setup for this project does not work on Apple silicon (M1). Not a Java expert so I can't quickly figure out a fix 🤷 Feel free to take a look, but I will mark this as ready for review when my changes are complete!

My end goal is to enable neo4j to build a KNN graph for finding top K similar images by leveraging perceptual hashes (pHash).
A pHash is simply a 64-bit unsigned integer (i.e. long in Java). I will not be adding any pHash-specific code to this PR. Any two pHashes can be compared for similarity by calculating the Hamming Distance between them.

Computing the Hamming Distance isn't specific to pHashes, and is useful in genetic computing, error detection/correction, text analysis, where both inputs are of equal length, and where graph structures are common. In my humble (and noob) opinion, it makes sense to offer this feature to the wider community.

From reading your docs, I believe the most natural area to extend with this metric is the LongPropertySimilarityComputer (https://neo4j.com/docs/graph-data-science/current/algorithms/knn/#_scalar_numbers). I intend for the existing default behaviour to remain unchanged, and to allow a new configuration of the KNN algorithm such that

nodeProperties: {
    pHashLongProperty: 'HAMMING_DISTANCE'
}

TODO: expand summary of changes + questions.
Changes summary:

  • Introduce a new metric for scalar numbers, HAMMING_DISTANCE, which will compute a 0-1 normalized double representing the Hamming Distance between two long properties.
  • Renamed the default LONG_PROPERTY_METRIC to NORMALIZED_ABSOLUTE_DIFFERENCE. Not really satisfied with the new name but I'm open to suggestions!

Before submitting this PR, please make sure:

  • You signed the Neo4j CLA (Contributor License Agreement) so that we are allowed to ship your code in our library (signed with [email protected]).
  • Your contribution is covered by tests

@htmlboss
Copy link
Author

Hmm looks like CI needs approval from first-time contributors. I'll boot up my linux box later to work on this!

Comment on lines 33 to 34
public static double longMetric(long left, long right) {
return Long.bitcount(left ^ right);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: normalize to 0.0-1.0 range.

@jjaderberg
Copy link
Contributor

the gradle setup for this project does not work on Apple silicon (M1)

@htmlboss what problems are you having? Several people working on this project use M1s. There were some problems with double precision but I think those were fixed--I'll ask. Feel free to open an issue for whatever problems you're seeing.

@Mats-SX
Copy link
Contributor

Mats-SX commented May 17, 2022

For the record, I'm working with an M1 Mac and I'm using the Zulu distribution of the JDK, currently version 11.0.14 (11.0.14-zulu). This works with the default setup. There is a part of the project that is geared towards JDK 17, but this is for the next version of Neo4j (5.0) which is still in development. You should be fine with JDK11, which we recommend.

However, it is possible to use JDK17 as well, and we routinely test for that also. There I would recommend the Temurin distribution, 17.0.3-tem.

@htmlboss
Copy link
Author

Thanks for the suggestions! I guess my IntelliJ Idea configuration isn't setup properly 😝 @Mats-SX I'll give your versions a shot when I get a chance!

@htmlboss htmlboss force-pushed the implement-hamming-distance-metric branch from 0d3e307 to d297bdb Compare May 20, 2022 16:18
@FlorentinD FlorentinD mentioned this pull request May 23, 2022
@htmlboss htmlboss force-pushed the implement-hamming-distance-metric branch from d297bdb to 3a6e285 Compare May 29, 2022 15:27
@htmlboss htmlboss force-pushed the implement-hamming-distance-metric branch from 3a6e285 to 23a0a4c Compare June 21, 2022 15:01
@htmlboss htmlboss changed the title [WIP] Add Hamming Distance KNN similarity metric for long property Add Hamming Distance KNN similarity metric for long property Jun 21, 2022
@FlorentinD
Copy link
Contributor

Hey,
@htmlboss I was just wondering if you found the time to try to out Mats suggestion.

Would love to know if there was any further blocker, if we can help you or if you just ended up with a different solution :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants