-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FMIndex gets wrong location after \0 #24
Comments
A related failure is this one:
Instead it reports |
I just verified that this is unrelated to the changes in 0.2.0; this behavior already exists before in the fid version. |
I noticed that these tests don't fail if I set level to 0. The first one doesn't fail with level 1 either, but the second one does. |
From trying to read the code, I think level 0 means one bit per character. Higher levels bring that size down to less bits per character (presumably at the cost of performance). But unfortunately higher levels appear to have bugs. :( |
The problem doesn't appear to be in the construction |
I continue to add more comments as I'm debugging this. It appears to behave correctly unless there is a \0 immediately before the pattern being searched. If there is any text in between, like |
I doubt the problem is in the wavelet matrix either (given that the bug persists in multiple implementations; and I did a few basic sanity checks with 0). I've looked into count character calculation. Count characters seems to be correct; it counts the correct amount of occurrences for 0. I don't know how @ajalab I have a work around for the various other issues I reported with very short strings, but this is a bit of a show stopper for my project using it, as I really need multiple \0 separated strings that are indexed. I guess I can continue with level(0) for a while, but I may need help debugging this from you! |
I am still thinking about this issue. I have read the following in the paper Fast in-memory XPath search using compressed indexes (section 3.2):
where $ is our null, and M is the (virtual) matrix. I never quite understood what this text about a special ordering was about, and I still don't understand it, but I'm starting to wonder whether this could be related to this problem? |
I still haven't figured out the root cause, but it looks there is a problem in suffix array value computation from the sampled array in // in fm_index.rs
#[test]
fn test_tmp_get_sa() {
let text = "world\0hello world\0";
let fm_index = FMIndex::new(text.to_string().into_bytes(), IdConverter::new(256), 2);
let i = fm_index.get_sa::<Local>(2) as usize;
assert_eq!(i, 11);
// assertion `left == right` failed
// left: 5
// right: 11
} The suffix array of the text should be I suspect there is a problem in For debugging purpose, let me show the table of sorted suffixes of the text.
|
It looks you are right. I found that the standard definition of LF-mapping does not work if there are multiple zeros. The below table shows sorted suffixes of a text
Notably,
Given a position
As implemented in where However, the results computed by the above definition do not match the actual positions.
I guess this is the reason why the paper introduces another data structure called Doc, which maps the positions of
Let's consider how we introduce the structure Doc to our library. |
Ah, so it makes sense FM-Index didn't work the way expected! I saw your tests using multiple Thanks so much for looking at this paper! I never connected the fact that the reason I have been staring at this paper for quite a while as I'm trying to implement it for XML data. But with fm-index integration I was on more shaky ground. I have a way to map tree leaf nodes with text to text identifiers (bitmap For my purposes it would be very useful to have the ability to obtain and use text identifiers too in the API. That is, an additional API on I have made implementations of I'd be happy to collaborate designing and implementing these APIs with you if you're open to these facilities. Tomorrow I will sit down and try to sketch out what could be a nice API we could implement. |
I've sketched out bits of such an API on the associated draft PR. The question is whether this functionality belongs in FM-Index or not. Obviously I like it. The tests currently suggest this support is there. But you can argue this stuff not really part of the original FM-Index feature set at all. Without deep integration of I think a text search index that is succinct and has support for multiple texts would be a cool library to have. The use case it opens up is for applications to give their vectors and strings to |
Thank you for drafting the API! I definitely agree that our |
I'm getting an idea about how the Doc data structure should be incorporated to FM-Index. If I understand correctly, that structure should be used along with C ( I'll try writing a sketch for the implementation. |
Awesome, this will open up so many new possibilities! Basically it can serve as a backend to serve strings while enabling search to any read-only data structure. Of course reconstructing the string may be too slow; the XPath paper mentions it keeps a separate uncompressed string around for that purpose. So perhaps we can add that as an optional feature to this library too - trading memory for performance, another option for the builder. |
Unfortunately, after some thoughts with pen and paper I found that my initial assumption was wrong. To make multi-text search work, we need to introduce two improvements:
1. Make sure LF-mapping (
|
As we discussed yesterday, I'll skip implementing the end-marker-order aware suffix array computation in |
The following test fails and I don't understand why.
hello
is at position 6, yet it returns 0.The text was updated successfully, but these errors were encountered: