Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash #369

freinold · 2024-10-17T18:38:56Z

Found thus only cause I was curious how fastembed calculates token ids for bm25/bm42, so I took a deep dive:

Problem

The function compute_token_id computes the absolute value of a signed 32 bit integer returned by the mmh3 hash lib.

    @classmethod
    def compute_token_id(cls, token: str) -> int:
        return abs(mmh3.hash(token))

This could lead to token id overlap, e.g. if hash of "black" is -42 and "white" is 42 (this is exaggerated of course), the token id would be 42 for both of them.

Is this expexted behaviour?

Proposed Solution

The hash call could also provide an unsigned 32 bit int (casted to a positive 64bit int, which would be no problem for qdrant since its 64 bit native).
This could be achieved by specifying the following extra arguments:

    @classmethod
    def compute_token_id(cls, token: str) -> int:
        return mmh3.hash(token, seed=0, signed=False)

I'm happy to provide a PR if you would like this behaviour to be changed.

joein · 2024-10-18T14:40:11Z

Hi @freinold ,

Thank you for highlighting this issue!
We've decided to take a bit more sophisticated approach than the suggested, and it requires changes in the core.
I'll post here a link to the issue/pr in the core once it is available

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash #369

Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash #369

freinold commented Oct 17, 2024

joein commented Oct 18, 2024

Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash #369

Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash #369

Comments

freinold commented Oct 17, 2024

Problem

Proposed Solution

joein commented Oct 18, 2024