Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient range operations over multiple Set objects #8

Open
davidblewett opened this issue Jan 9, 2018 · 12 comments
Open

Efficient range operations over multiple Set objects #8

davidblewett opened this issue Jan 9, 2018 · 12 comments

Comments

@davidblewett
Copy link
Contributor

I am building an application that generates new FST objects every hour. I would like to be able to do efficient range operations over a group of these FST files. Conceptually, something like combining the __getitem__ implementation with the OpBuilder. Basically, returning a KeyStreamIterator for multiple Set objects' range operation (pseudo code:

 s1 = Set.from_iter(["bar", "baz", "foo", "moo"])
 s2 = Set.from_iter(["bing", "jar, "foo", "moo"])
 s2 = Set.from_iter(["bap", "bonk, "foo", "moo"])
(s1 + s2 + s3)["b":"f"]

The existing API seems cumbersome for doing operations across multiple. You have to pick a (possibly arbitrary) instance, then call .union / .intersection etc with the rest of the objects. Would it be possible to have some sort of MultiSet class that could perform efficient calls across multiple Set objects?

I can take a crack at implementing it; it appears the underlying Rust API would support this.

@BurntSushi
Copy link

it appears the underlying Rust API would support this

It does indeed! Ping me if anyone needs help with this.

@davidblewett
Copy link
Contributor Author

@BurntSushi : can you give me an example of doing this in Rust? I can probably figure out how to translate it to Python/cffi. I'm wondering if we'd to add something to the C layer for this.

@BurntSushi
Copy link

@BurntSushi
Copy link

(That example does a little more than what you're asking, by applying a regex search. But you can just remove that bit.)

@davidblewett
Copy link
Contributor Author

@BurntSushi : I was looking at that, but the regex bit did indeed throw me off. I was looking more closely at OpBuilder; it appears to aggregate calls across multiple FST objects, but I don't see one for range. The range operations appear to use a StreamBuilder, which appears to only be offered by a single Set object?

@davidblewett
Copy link
Contributor Author

For example:
let mut stream = set.range().ge("b").lt("e").into_stream();

I don't see a corresponding .range() method for OpBuilder?

@BurntSushi
Copy link

@davidblewett You have to apply the range to each set, e.g.,

let (set1, set2, set3) = ...;
let mut stream = OpBuilder::new()
    .add(set1.range().ge("b").lt("e"))
    .add(set2.range().ge("b").lt("e"))
    .add(set3.range().ge("b").lt("e"))
    .union();

@davidblewett
Copy link
Contributor Author

Got it. Will fiddle with it.

@davidblewett
Copy link
Contributor Author

@jbaiter : it seems the C interface could be simplified a lot if instead of functions specifically for set::Difference / set::Intersection / set::Union , they were defined to take set::Stream? I'm going to have to add a version of the fst_set_opbuilder_* / free /next functions anyway for that type.

Are there some sort of safety guarantees in retaining the more specific versions?

@davidblewett
Copy link
Contributor Author

@BurntSushi : I tried in #10 , but hit a probably dumb Rust mistake on my part.

@davidblewett
Copy link
Contributor Author

@BurntSushi would you mind taking a look at #10 ? I'm not sure how to fix that compiler error.

@BurntSushi
Copy link

@davidblewett Ah yup! Apologies. I missed your previous ping. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants