Building ivf for large datasets #327

jenhsia · 2024-03-10T03:15:23Z

When using function _build_ivf(self) for a large corpus, it often gets stuck at the codes = codes.sort() step.
To avoid sorting of a massive list, we can:

Create an ivf_dict which maps from partition index to the list of embedding indices that belong to that partition.
Using ivf_dict, we can easily create the following without soring:

a sorted list of embedding indices (ivf) by just concatenating the dictionary values, and
a list of the number of embeddings belonging to each partition (ivf_lengths).

…st corresponds to the pid of the i-th element

jenhsia added 2 commits March 9, 2024 21:30

Add pid_list to searcher.collection, where the i-th element of the li…

8214cc9

…st corresponds to the pid of the i-th element

add ivf optimization

1e39efc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building ivf for large datasets #327

Building ivf for large datasets #327

jenhsia commented Mar 10, 2024

Building ivf for large datasets #327

Are you sure you want to change the base?

Building ivf for large datasets #327

Conversation

jenhsia commented Mar 10, 2024