Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple texts #33

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,7 @@ mod rlfmi;
mod sais;
mod seal;
mod search;
mod text_builder;
mod util;

pub use crate::fm_index::FMIndex;
Expand Down
129 changes: 129 additions & 0 deletions src/search.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ use crate::character::Character;
use crate::fm_index::FMIndex;
#[cfg(doc)]
use crate::rlfmi::RLFMIndex;
use crate::text_builder::TextId;

/// A search index.
///
Expand All @@ -24,6 +25,53 @@ pub trait SearchIndex: BackwardIterableIndex {
{
Search::new(self).search(pattern)
}

// If we created a HasDoc trait (or something better named) for those
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be also an option that we implement new APIs on a new trait/struct such as MultiTextSearchIndex that wraps a BackwardIterableIndex instance.

  • We can avoid adding Doc data to every FM-Index variant like FMIndex and RLFMIndex.
  • We can consider different APIs for single-text search index SearchIndex and multi-texts search index MultiTextSearchIndex. For instance, locate query in the former trait may just return pattern positions, whereas the latter may return positions with TextId.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be nice to have a search index API that doesn't allow any \0 (or at least one at the end), and a multi index that accepts \0 (or a builder pattern), that offers the extended APIs.

So let's explore this. I'm a still not entirely sure how to implement doc so I hope to get this in a shape so you can help me implement it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I'll also try to implement the multi-text search based on the APIs you outlined.

// indexes that maintain the `Doc` structure, we could move the following
// methods into a trait that depends on that.

/// Given a text id, return the text associated with it.
///
/// This is the actual text, excluding zero separators.
fn text(&self, id: TextId) -> Vec<Self::T> {
todo!()
}

/// Search for texts that contain pattern.
///
///
/// This is identical to search(), except if pattern were to
/// contain a null character. (should we allow it?)
fn search_contains<K>(&self, pattern: K) -> Search<Self>
where
K: AsRef<[Self::T]>,
{
todo!();
}

/// Search for texts that start with pattern.
fn search_start_with<K>(&self, pattern: K) -> Search<Self>
where
K: AsRef<[Self::T]>,
{
todo!();
}

/// Search for texts that end with pattern.
fn search_ends_with<K>(&self, pattern: K) -> Search<Self>
where
K: AsRef<[Self::T]>,
{
todo!();
}

/// Search for texts that are exactly pattern.
fn search_exact<K>(&self, pattern: K) -> Search<Self>
where
K: AsRef<[Self::T]>,
{
todo!();
}
}

impl<I: BackwardIterableIndex> SearchIndex for I {}
Expand Down Expand Up @@ -135,4 +183,85 @@ where
}
results
}

/// List the position of all occurrences with an iterator.
///
/// TODO: we could also provide an `IntoIterator` for seach that returns this.
pub fn locate_iter(&self) -> LocationInfoIterator<I> {
LocationInfoIterator::new(self.index, self.s, self.e)
}
}

pub struct LocationInfoIterator<'a, I>
where
I: SearchIndex + HasPosition,
{
index: &'a I,
k_iterator: std::ops::Range<u64>,
}

impl<'a, I> LocationInfoIterator<'a, I>
where
I: SearchIndex + HasPosition,
{
pub fn new(index: &'a I, start: u64, end: u64) -> Self {
LocationInfoIterator {
index,
k_iterator: start..end,
}
}
}

impl<'a, I> Iterator for LocationInfoIterator<'a, I>
where
I: SearchIndex + HasPosition,
{
type Item = LocationInfo<'a, I>;

fn next(&mut self) -> Option<Self::Item> {
let k = self.k_iterator.next()?;
Some(LocationInfo {
index: self.index,
k,
})
}

fn size_hint(&self) -> (usize, Option<usize>) {
self.k_iterator.size_hint()
}
}

pub struct LocationInfo<'a, I>
where
I: SearchIndex + HasPosition,
{
index: &'a I,
k: u64,
}

impl<I> LocationInfo<'_, I>
where
I: SearchIndex + HasPosition,
{
/// the position of a location within the larger text
pub fn position(&self) -> u64 {
self.index.get_sa::<seal::Local>(self.k)
}

// the existence of the following methods could depend on the
// `HasDoc` trait.

/// the text id that the location belongs to.
///
/// Each 0 separated text has a unique id identifying it.
pub fn text_id(&self) -> TextId {
todo!()
}

/// the original text at this text id
///
/// This does not include the 0 characters at its boundaries.
pub fn text(&self) -> Vec<I::T> {
todo!()
}
}
44 changes: 44 additions & 0 deletions src/text_builder.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
use crate::Character;

/// A text builder lets you construct a text from multiple parts.
/// Internally each part is separated by the 0 character.
pub struct TextBuilder<T>
where
T: Character,
{
id_counter: usize,
text: Vec<T>,
}

/// A unique id identifying this text.
#[derive(Clone, Copy, Debug, Eq, PartialEq, PartialOrd, Ord, Hash)]
pub struct TextId(usize);

impl<T> TextBuilder<T>
where
T: Character,
{
/// Create a new empty text builder.
pub fn new() -> TextBuilder<T> {
TextBuilder {
id_counter: 0,
text: vec![],
}
}

/// Add a text to the builder.
///
/// Returns a unique id for this text.
fn add_text(&mut self, text: &[T]) -> TextId {
let id = TextId(self.id_counter);
self.id_counter += 1;
self.text.extend_from_slice(text);
self.text.push(T::zero());
id
}

/// Finish the build and return the text.
fn build(self) -> Vec<T> {
self.text
}
}