You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This could lead to token id overlap, e.g. if hash of "black" is -42 and "white" is 42 (this is exaggerated of course), the token id would be 42 for both of them.
Is this expexted behaviour?
Proposed Solution
The hash call could also provide an unsigned 32 bit int (casted to a positive 64bit int, which would be no problem for qdrant since its 64 bit native).
This could be achieved by specifying the following extra arguments:
Thank you for highlighting this issue!
We've decided to take a bit more sophisticated approach than the suggested, and it requires changes in the core.
I'll post here a link to the issue/pr in the core once it is available
Found thus only cause I was curious how fastembed calculates token ids for bm25/bm42, so I took a deep dive:
Problem
The function compute_token_id computes the absolute value of a signed 32 bit integer returned by the mmh3 hash lib.
This could lead to token id overlap, e.g. if hash of "black" is -42 and "white" is 42 (this is exaggerated of course), the token id would be 42 for both of them.
Is this expexted behaviour?
Proposed Solution
The hash call could also provide an unsigned 32 bit int (casted to a positive 64bit int, which would be no problem for qdrant since its 64 bit native).
This could be achieved by specifying the following extra arguments:
I'm happy to provide a PR if you would like this behaviour to be changed.
The text was updated successfully, but these errors were encountered: