Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple texts #33

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

multiple texts #33

wants to merge 3 commits into from

Conversation

faassen
Copy link
Collaborator

@faassen faassen commented Jan 17, 2025

As discussed on #24, right now FM-Index doesn't properly support multiple 0s in the text.

It would be nice to support this. This PR is a draft for what the API to support this feature could look like:

  • a TextBuilder making it easy to construct zero-separated texts.

*a LocationInfo struct that lets you obtain the text id for a location, as well as the text belonging to a 0 separated location

  • a text method to get the original text belonging to such a zero separated location (both on the index as well as on the LocationInfo

  • various new search methods that are boundary aware. Each of them should forbid the use of \0 in the query itself (maybe this should be forbidden in general; that depends on how well it works):

    • contains (which is basically the same as search)

    • starts_with

    • ends_with

    • exact

    • maybe in the future lexicographical queries too.

We could gate all these methods with a HasDoc trait so that they don't exist if you don't construct an index with a doc. That would allow us to implement this functionality on FMIndex only (or first) without worrying about RFLMIndex yet.

@faassen faassen closed this Jan 17, 2025
@faassen faassen reopened this Jan 17, 2025
@faassen faassen marked this pull request as draft January 17, 2025 13:11
@@ -24,6 +25,53 @@ pub trait SearchIndex: BackwardIterableIndex {
{
Search::new(self).search(pattern)
}

// If we created a HasDoc trait (or something better named) for those
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be also an option that we implement new APIs on a new trait/struct such as MultiTextSearchIndex that wraps a BackwardIterableIndex instance.

  • We can avoid adding Doc data to every FM-Index variant like FMIndex and RLFMIndex.
  • We can consider different APIs for single-text search index SearchIndex and multi-texts search index MultiTextSearchIndex. For instance, locate query in the former trait may just return pattern positions, whereas the latter may return positions with TextId.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be nice to have a search index API that doesn't allow any \0 (or at least one at the end), and a multi index that accepts \0 (or a builder pattern), that offers the extended APIs.

So let's explore this. I'm a still not entirely sure how to implement doc so I hope to get this in a shape so you can help me implement it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I'll also try to implement the multi-text search based on the APIs you outlined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants