Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix simhash slicing and add tests. Fixes #19. #20

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tfmorris
Copy link
Contributor

This adds very basic tests for all the static methods in SimHashUtils and fixes the simhash slicing algorithm.

The fixed version uses the current text representation, but I'd actually suggest switching to just using Longs instead of text and computing the slices uses bitmasks. This will not only make the computation of the slices faster and easier to understand, but will speed up the sorting and comparisons during the shuffle phase of clustering (but it does require changes elsewhere in the system).

@habernal
Copy link
Contributor

Good catch, Tom! Since your contributions are getting non-trivial, I'd like to ask you for filling our contributor's license - this is what we apply for all DKPro open-source software (after discussions with the legal department at the Darmstadt Technical University). Please consult http://dkpro.github.io/contributing/ (you can send it via e-mail: [email protected] ). Thanks :)

@habernal
Copy link
Contributor

I hope it did not scare you off, Tom :)

  • I'm adding this to 1.0.1 milestone as we want to keep the functionality of 1.0.0 the very same as in the LREC article.

@habernal habernal added this to the 1.0.1 milestone Mar 21, 2016
@tfmorris
Copy link
Contributor Author

You didn't scare me off. :-) I just needed some quiet time to review the agreement.

Plus, as I suggested above, I've gone off on a different tack and implemented a new hashing scheme and built a little benchmarking framework so I can compare. I'll have more PRs in the pipe as soon as I stop playing around (and sign the CLA).

Also includes a more efficient slicing algorithm that
could be used, but requires changes elsewhere in the system
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants