Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix adding duplicate items #7

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Bladieblah
Copy link

There's no check whether an item already exists in the tree, leading to unnecessarily deep trees, this should fix that issue.

@benhoyt
Copy link
Owner

benhoyt commented Nov 9, 2021

Thanks. This looks good at first glance, but I don't think it's correct. You can have multiple items with zero distance between them. For example, in the case of image hashing, consider the case when more than one image has the same image hash:

import collections
import pybktree

Image = collections.namedtuple('Image', 'data hash')

def image_distance(a, b):
    return pybktree.hamming_distance(a.hash, b.hash)

image1 = Image('one', 123)
image2 = Image('two', 42)
image3 = Image('three', 123)

t = pybktree.BKTree(image_distance)
t.add(image1)
t.add(image2)
t.add(image3)
print(list(t))
# [Image(data='one', hash=123), Image(data='two', hash=42), Image(data='three', hash=123)]

Iterating over the tree should produce all three images, even though image1 and image3 have the same hash.

What's your use case where this has come up?

@Bladieblah
Copy link
Author

Good point, I didn't realise hashes might be used. I was using it to cluster transaction data (textual data), which contain many duplicate samples. I ended with very deep trees, in one case a tree with 27k layers where it should have been 3 (which can still be many nodes of course). I based this change on the wikipedia article explaining the insertion procedure. The article also only considers textual data though, which probably explains the oversight.

In this case it could be solved by replacing the check on distance with a check if the two items are identical, right?

@Bladieblah
Copy link
Author

I added the check, but also kept the check on distance since that one is probably less expensive, especially in the case of images.

@benhoyt
Copy link
Owner

benhoyt commented Nov 10, 2021

Hmm, this does change the behavior (as shown by the failing doctests), because now you can't add more than one of the same item to the tree. This might be more correct, but I'm hesitant to change the behavior without releasing a new major version (which I'm probably not going to do at this point). My suggestion, if you need this, would be to check that you're not adding an equal/identical item with a separate dict.

@c01o
Copy link

c01o commented May 29, 2023

it's nice if we have a flag to modify the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants