Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMIndex gets wrong location after \0 #24

Open
faassen opened this issue Jan 9, 2025 · 17 comments
Open

FMIndex gets wrong location after \0 #24

faassen opened this issue Jan 9, 2025 · 17 comments

Comments

@faassen
Copy link
Collaborator

faassen commented Jan 9, 2025

    #[test]
    fn test_locate_after_null() {
        let text = "world\0hello world\0".to_string().into_bytes();
        let fm_index = FMIndex::new(
            text,
            IdConverter::new(256),
            SuffixOrderSampler::new().level(2),
        );
        let search = fm_index.search_backward("hello");
        let mut res = search.locate();
        res.sort();
        assert_eq!(res, vec![6]);
    }

The following test fails and I don't understand why. hello is at position 6, yet it returns 0.

@faassen
Copy link
Collaborator Author

faassen commented Jan 9, 2025

A related failure is this one:

    #[test]
    fn test_locate_after_null2() {
        let text = "world\0hello world\0hello\0".to_string().into_bytes();
        let fm_index = FMIndex::new(
            text,
            IdConverter::new(256),
            SuffixOrderSampler::new().level(2),
        );
        let search = fm_index.search_backward("hello");
        let mut res = search.locate();
        res.sort();
        assert_eq!(res, vec![6, 18]);
    }

Instead it reports [18, 18].

@faassen
Copy link
Collaborator Author

faassen commented Jan 9, 2025

I just verified that this is unrelated to the changes in 0.2.0; this behavior already exists before in the fid version.

@faassen
Copy link
Collaborator Author

faassen commented Jan 9, 2025

I noticed that these tests don't fail if I set level to 0. The first one doesn't fail with level 1 either, but the second one does.

@faassen
Copy link
Collaborator Author

faassen commented Jan 9, 2025

From trying to read the code, I think level 0 means one bit per character. Higher levels bring that size down to less bits per character (presumably at the cost of performance). But unfortunately higher levels appear to have bugs. :(

@faassen
Copy link
Collaborator Author

faassen commented Jan 9, 2025

The problem doesn't appear to be in the construction sais:sais, as if I replace this with the implementation get_suffix_array I found in the tests, the bug is exactly the same.

@faassen
Copy link
Collaborator Author

faassen commented Jan 9, 2025

I continue to add more comments as I'm debugging this. It appears to behave correctly unless there is a \0 immediately before the pattern being searched. If there is any text in between, like \0afoo, where a is in the middle, it appears to behave correctly.

@faassen
Copy link
Collaborator Author

faassen commented Jan 9, 2025

I doubt the problem is in the wavelet matrix either (given that the bug persists in multiple implementations; and I did a few basic sanity checks with 0).

I've looked into count character calculation. Count characters seems to be correct; it counts the correct amount of occurrences for 0.

I don't know how cs is used exactly yet; bucket start position calculation seems to create a 0 for character 0 always. Perhaps this confuses lf_map or get_sa somehow, but I don't understand enough about this to help.

@ajalab I have a work around for the various other issues I reported with very short strings, but this is a bit of a show stopper for my project using it, as I really need multiple \0 separated strings that are indexed. I guess I can continue with level(0) for a while, but I may need help debugging this from you!

@faassen
Copy link
Collaborator Author

faassen commented Jan 14, 2025

I am still thinking about this issue. I have read the following in the paper Fast in-memory XPath search using compressed indexes (section 3.2):

Because there are several $'s in T, we fix a special ordering such that the end marker of the i-th text appears at F[i] in M (see Figure 1, bottom right). This generates a valid Tbwt of all the texts and makes it easy to extract the i-th text starting from its $-terminator.

where $ is our null, and M is the (virtual) matrix.

I never quite understood what this text about a special ordering was about, and I still don't understand it, but I'm starting to wonder whether this could be related to this problem?

@ajalab ajalab mentioned this issue Jan 14, 2025
@ajalab
Copy link
Owner

ajalab commented Jan 16, 2025

I still haven't figured out the root cause, but it looks there is a problem in suffix array value computation from the sampled array in FMIndex::get_sa.

// in fm_index.rs
    #[test]
    fn test_tmp_get_sa() {
        let text = "world\0hello world\0";
        let fm_index = FMIndex::new(text.to_string().into_bytes(), IdConverter::new(256), 2);

        let i = fm_index.get_sa::<Local>(2) as usize;
        assert_eq!(i, 11);
        // assertion `left == right` failed
        //  left: 5
        // right: 11
    }

The suffix array of the text should be SA = [17, 5, 11, 16, ...] but it returns 5 as SA[2].

I suspect there is a problem in get_sa or lf_map when dealing with multiple zeros.


For debugging purpose, let me show the table of sorted suffixes of the text.

  • p is a row number.
  • SA is the position of suffix in the original text, and hence the column is the suffix array of the table.
  • L is the Burrows-Wheeler transform of the text.
 p SA text             L
 0 17 $                d
 1  5 $hello world$    d
 2 11  world$          o
 3 16 d$               l
 4  4 d$hello world$   l
 5  7 ello world$      h
 6  6 hello world$     $
 7 15 ld$              r
 8  3 ld$hello world$  r
 9  8 llo world$       e
10  9 lo world$        l
11 10 o world$         l
12 13 orld$            w
13  1 orld$hello world$w
14 14 rld$             o
15  2 rld$hello world$ o
16 12 world$            
17  0 world$hello world$

@ajalab
Copy link
Owner

ajalab commented Jan 16, 2025

I never quite understood what this text about a special ordering was about, and I still don't understand it, but I'm starting to wonder whether this could be related to this problem?

It looks you are right. I found that the standard definition of LF-mapping does not work if there are multiple zeros.

The below table shows sorted suffixes of a text a$c$b$ ($ represents zero).

      suffix
 p SA F    L
-------------
 0  5 $    c
 1  1 $a$c$b
 2  3 $c$  a
 3  2 a$c$ $
 4  0 b$a$c$
 5  4 c$   $

Notably,

  • F column ($$$abc) has the first characters of suffixes. They are sorted in lexicographic order.
  • L column (cba$$$) has the last characters of the rotated substrings (extended from each suffix). They are Burrows-Wheeler transform of the original text and $T^{bwt}$ in the paper.

Given a position $p$, the LF-Mapping $LF(p)$ (or last-to-first mapping in Wikipedia) computes the position of the character $L[p]$ in the array F. For instance,

  • $L[3]$ (the 3rd $ after b in the original text) is at F[1], so $LF(3) = 1$.
  • $L[4]$ (the 2nd $ after c in the original text) is at F[0], so $LF(4) = 0$.
  • $L[5]$ (the 1st $ after a in the original text) is at F[2], so $LF(5) = 2$.

As implemented in FMIndex::lf_map and mentioned in Fast in-memory XPath search using compressed indexes paper, FM-Index computes LF-mapping with the below formula:

$$LF(p) = C[L[p]] + \mathrm{rank}_{L[p]}(L, p),$$

where $C[c]$ (cs in the crate) is the total number of characters smaller than $c$.

However, the results computed by the above definition do not match the actual positions.

  • $LF(3) = C[\$] + \mathrm{rank}_\$(L, 3) = 0 + 0 = 0$.
  • $LF(4) = C[\$] + \mathrm{rank}_\$(L, 4) = 0 + 1 = 1$.
  • $LF(5) = C[\$] + \mathrm{rank}_\$(L, 5) = 0 + 2 = 2$.

I guess this is the reason why the paper introduces another data structure called Doc, which maps the positions of $s in L to the occurrences in the original text (or text identifiers in the paper).

Now $T^{bwt}$ contains all end-markers in some permuted order. This permutation is represented with a data structure Doc, that maps from positions of $s in $T^{bwt}$ to text identifiers.

  • $Doc[\mathrm{rank}_\$(L, 3)] = Doc[0] = 2$ (3rd $).
  • $Doc[\mathrm{rank}_\$(L, 4)] = Doc[1] = 1$ (2nd $).
  • $Doc[\mathrm{rank}_\$(L, 5)] = Doc[2] = 0$ (1st $).

Let's consider how we introduce the structure Doc to our library.

@faassen
Copy link
Collaborator Author

faassen commented Jan 16, 2025

Ah, so it makes sense FM-Index didn't work the way expected! I saw your tests using multiple $ so I thought it was a solved problem, and only slowly did I realize the connection to this and the mysterious section in the XPath paper.

Thanks so much for looking at this paper! I never connected the fact that the reason Doc is introduced is to solve the $ position problem - with my limited understanding of fm-index I got lost reading that part.

I have been staring at this paper for quite a while as I'm trying to implement it for XML data. vers-vec has a branch which contains a balanced parentheses implementation. In my implementation I have also connected tags to nodes as described by the paper.

But with fm-index integration I was on more shaky ground. I have a way to map tree leaf nodes with text to text identifiers (bitmap B in the paper). I also used a rank/select bitmap to determine the text id based on position, which I think deviates from the paper.

For my purposes it would be very useful to have the ability to obtain and use text identifiers too in the API.

That is, an additional API on Search that gives matching text identifiers. Also an API that given a text identifier gives back the range of positions from the starting $ to the ending $.

I have made implementations of starts-with, ends-with, = and contains in my XML codebase as well, but I think it would be quite neat if we added them to fm-index itself as it looks like we can using the Doc structure.

I'd be happy to collaborate designing and implementing these APIs with you if you're open to these facilities. Tomorrow I will sit down and try to sketch out what could be a nice API we could implement.

@faassen
Copy link
Collaborator Author

faassen commented Jan 17, 2025

I've sketched out bits of such an API on the associated draft PR.

The question is whether this functionality belongs in FM-Index or not. Obviously I like it. The tests currently suggest this support is there. But you can argue this stuff not really part of the original FM-Index feature set at all.

Without deep integration of Doc, my use case for FM-Index is really blocked I think (unless I disable sampling), so that speaks for adding this feature.

I think a text search index that is succinct and has support for multiple texts would be a cool library to have. The use case it opens up is for applications to give their vectors and strings to FM-Index, storing in a really compact form yet have a good way to do search in it.

@ajalab
Copy link
Owner

ajalab commented Jan 18, 2025

Thank you for drafting the API! I definitely agree that our fm-index crate should support multiple texts by adopting the Doc data structure. I'll take a look at #33.

@ajalab
Copy link
Owner

ajalab commented Jan 26, 2025

I'm getting an idea about how the Doc data structure should be incorporated to FM-Index. If I understand correctly, that structure should be used along with C (cs in our crate) when computing LF-mapping lf_map and FL-mapping fl_map.

I'll try writing a sketch for the implementation.

@faassen
Copy link
Collaborator Author

faassen commented Jan 26, 2025

Awesome, this will open up so many new possibilities! Basically it can serve as a backend to serve strings while enabling search to any read-only data structure.

Of course reconstructing the string may be too slow; the XPath paper mentions it keeps a separate uncompressed string around for that purpose. So perhaps we can add that as an optional feature to this library too - trading memory for performance, another option for the builder.

@ajalab
Copy link
Owner

ajalab commented Jan 29, 2025

I'm getting an idea about how the Doc data structure should be incorporated to FM-Index. If I understand correctly, that structure should be used along with C (cs in our crate) when computing LF-mapping lf_map and FL-mapping fl_map.

Unfortunately, after some thoughts with pen and paper I found that my initial assumption was wrong.

To make multi-text search work, we need to introduce two improvements:

  1. Make sure LF-mapping (lf_map) work for multi-texts
  2. Introduce Doc data structure.

1. Make sure LF-mapping (lf_map) work for multi-texts

To achieve this, we need to take the ordering of end markers into consideration.

Because there are several $'s in T, we fix a special ordering such that the end marker of the i-th text appears at F[i] in M (see Figure 1, bottom right).

This means we need to distinguish every end marker in a joined text. For instance, if we have a joined text $T = \mathrm{foo$bar$baz$}$ (where $ is an end marker), we have to treat it as $T = \mathrm{foo$_0bar$_1baz$_2}$.

This matters to the construction of a suffix array (sa) and C array (cs).

  • We need to fix sais to recognize the ordering of end markers when computing the suffix array. I believe this can be achieved by carrying around the index of each character to process, but it will be a bit difficult as the original algorithm is complicated.
  • We need to make sure C array retain the number of end markers before the given end marker like $C[$_2] = 2$, so that the LF-mapping definition $LF(p) = C[L[p]] + \mathrm{rank}_{L[p]}(L, p)$ must work for multi-texts as well. Currently, the C array (cs) is represented as Vec<usize>, and we need to consider better representation for this. One promising option would be splitting C array for usual characters and end markers, and make the latter optional. Perhaps the latter can be implemented efficiently using bit vector.

2. Introduce Doc data structure.

After the first step is done, we can introduce Doc data structure for advanced queries like start-with. The construction of Doc can be done with a temporary data structure that maps the start position of a text to its text ID (e.g., for $T = \mathrm{foo$_0bar$_1baz$_2}$, $\lbrace0 \mapsto 0, 4 \mapsto 1, 8 \mapsto 2\rbrace$ ). This can be a hash map or a sparse bit vector.

@ajalab
Copy link
Owner

ajalab commented Jan 30, 2025

As we discussed yesterday, I'll skip implementing the end-marker-order aware suffix array computation in sais. Instead, I'll use a naive suffix array construction algorithm for PoC first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants