Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory vs performance (and features) #43

Open
faassen opened this issue Jan 28, 2025 · 0 comments
Open

Memory vs performance (and features) #43

faassen opened this issue Jan 28, 2025 · 0 comments

Comments

@faassen
Copy link
Collaborator

faassen commented Jan 28, 2025

I think it's neat we have multiple levels of memory size and features. I'm going to sketch out what we have now with some speculative extensions. The more memory you give, the more features and performance becomes available. The total overhead of this index is still pretty low, though I'm not sure "with fast text access" grows it beyond the size other non-succinct indexes can offer.

Smallest

  • RLFMIndex - smallest but slowest
  • FMndex

With locate()

  • RLFMIndex with quantized suffix array
  • FMIndex with quantized suffix array

With multi text support (text ids, \0 string)

We add the "Doc array". This supports multiple separate texts to be searched separated by \0.
This can be used with or without locate() support. This allows starts-with and the like, as well as text ids.

This supports a usage scenario where if you have an immutable data structure you can have all its strings backed by
fm-index and still support search.

This feature can be type-driven: we cannot construct multiple texts (and reject a text with a \0 in the middle?) without enabling this feature first. The APIs surrounding this can all be behind a special type just like locate is.

With fast text access

We store the original text along with the index (and suffix array). This takes more memory, but reconstruction of strings is much faster. This use case is also described by the XPath paper. Can be used with or without the doc array.

The APIs don't need any changing; existing APIs for searching text and multi text just become faster as they use the real string as a backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant