docs: Add benchmarks for fst

https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/7?u=alexpovel
alexpovel · Aug 21, 2023 · ad32181 · ad32181
1 parent 70e5918
commit ad32181
Show file tree

Hide file tree

Showing 5 changed files with 124 additions and 36 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -31,6 +31,7 @@ itertools = "0.11.0"
 
 [dev-dependencies]
 criterion = { version = "0.5.1", features = ["html_reports"] }
+fst = "0.4.7"
 phf = { version = "0.11.1", features = ["macros"] }
 rstest = "0.18.1"
 trie-rs = "0.1.1"

diff --git a/README.md b/README.md
@@ -79,21 +79,24 @@ The itch to be scratched is the following:
 - the list is to be distributed as part of the binary
 
 A couple possible approaches come to mind. The summary table, where `n` is the number of
-words, is (for more context, see the individual sections below):
-
-| Approach                                                                                                                 | Pre-compile preprocessing[^1] | Compile time       | Runtime lookup                                                                                     | Binary size |
-| ------------------------------------------------------------------------------------------------------------------------ | ----------------------------- | ------------------ | -------------------------------------------------------------------------------------------------- | ----------- |
-| `b4s`                                                                                                                    | Sort, `O(n log n)`            | Single ref: `O(1)` | Bin. search: `O(log n)`                                                                            | `O(n)`      |
-| `array`                                                                                                                  | Sort, `O(n log n)`            | Many refs: `O(n)`  | [Bin. search](https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search): `O(log n)` | `~ O(3n)`   |
-| [`phf`](https://github.com/rust-phf/rust-phf)/[`HashSet`](https://doc.rust-lang.org/std/collections/struct.HashSet.html) | None                          | Many refs: `O(n)`  | Hash: `O(1)`                                                                                       | `~ O(3n)`   |
-| padded `&str`                                                                                                            | Sort + Pad, `~ O(n log n)`    | Single ref: `O(1)` | Bin. search: `O(log n)`                                                                            | `~ O(n)`    |
+words in the dictionary and `k` the number of characters in a word to look up, is (for
+more context, see the individual sections below):
+
+| Approach             | Pre-compile preprocessing[^1] | Compile time prepr. | Runtime lookup                | Binary size              |
+| -------------------- | ----------------------------- | ------------------- | ----------------------------- | ------------------------ |
+| `b4s`                | [`O(n log n)`][slice-sort]    | Single ref: `O(1)`  | [`O(log n)`][b4s-lib]         | `O(n)`                   |
+| [`fst`][fst-repo]    | [`O(n log n)`][fst-build][^2] | Single ref: `O(1)`  | [`O(k)`][fst-lookup]          | [`< O(n)`][fst-size][^3] |
+| [slice][slice]       | [`O(n log n)`][slice-sort]    | Many refs: `O(n)`   | [`O(log n)`][slice-binsearch] | `~ O(3n)`                |
+| [`phf`][phf-repo]    | None                          | Many refs: `O(n)`   | Hash: `O(1)`                  | `~ O(3n)`                |
+| [`HashSet`][hashset] | None                          | Many refs: `O(n)`   | Hash: `O(1)`                  | `~ O(3n)`                |
+| padded `&str`        | [`~ O(n log n)`][pad-file]    | Single ref: `O(1)`  | Bin. search: `O(log n)`       | `~ O(n)`                 |
 
 This crate is an attempt to provide a solution with:
 
 1. **good, not perfect runtime performance**,
 2. very little, [one-time](https://doc.rust-lang.org/cargo/reference/build-scripts.html#rerun-if-changed) compile-time preprocessing needed (just sorting),
 3. **essentially no additional startup cost** (unlike, say, constructing a `HashSet` at
-  runtime)[^2],
+  runtime)[^4],
 4. **binary sizes as small as possible**,
 5. **compile times as fast as possible**.
 
@@ -120,12 +123,7 @@ A simple slice is an obvious choice, and can be generated in a build script.
 ```rust
 static WORDS: &[&str] = &["abc", "def", "ghi", "jkl"];
 
-fn main() {
-    match WORDS.binary_search(&"ghi") {
-        Ok(i) => println!("Found at index: {:?}", i),
-        Err(i) => println!("Not found, could be inserted at: {:?}", i),
-    }
-}
+assert_eq!(WORDS.binary_search(&"ghi").unwrap(), 2);
 ```
 
 There are two large pains in this approach:
@@ -143,9 +141,8 @@ There are two large pains in this approach:
 
 ### Hash Set
 
-Regular [`HashSet`s](https://doc.rust-lang.org/std/collections/struct.HashSet.html) are
-not available at compile time. Crates like [`phf`](https://github.com/rust-phf/rust-phf)
-change that:
+Regular [`HashSet`s][hashset] are not available at compile time. Crates like
+[`phf`][phf-repo] change that:
 
 ```rust
 use phf::{phf_set, Set};
@@ -157,19 +154,52 @@ static WORDS: Set<&'static str> = phf_set! {
     "jkl"
 };
 
-fn main() {
-    if WORDS.contains(&"ghi") {
-        println!("Found!");
-    } else {
-        println!("Not found!");
-    }
-}
+assert!(WORDS.contains(&"ghi"))
 ```
 
 Similar downsides as for the slices case apply: very long compile times, and
 considerable binary bloat from smart pointers. A hash set ultimately is a slice with
 computed indices, so this is expected.
 
+### Finite State Transducer/Acceptor (Automaton)
+
+The [`fst`][fst-repo] crate is a fantastic candidate, brought up by its author (of
+[`ripgrep`](https://github.com/BurntSushi/ripgrep) and
+[`regex`](https://github.com/rust-lang/regex) fame) in a [discussion on
+`b4s`](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/7?u=alexpovel):
+
+```rust
+use fst::Set; // Don't need FST, just FSA here
+
+static WORDS: &[&str] = &["abc", "def", "ghi", "jkl"];
+
+let set = Set::from_iter(WORDS.into_iter()).unwrap();
+assert!(set.contains("ghi"));
+```
+
+It offers:
+
+- [almost free (in time and space)
+  deserialization](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/9?u=alexpovel):
+  its serialization format is identical to its in-memory representation, unlike [other
+  solutions](#higher-order-data-structures), facilitating startup-up performance
+- compression[^3] (important for
+  [publishing](https://github.com/rust-lang/crates.io/issues/195)), making it the only
+  candidate in this comparison natively leading to *smaller* size than the original word
+  list
+- extension points (fuzzy and case-insensitive searching, bring-your-own-automaton etc.)
+
+`fst` might be the best choice for many applications, and better than `b4s` in most
+scenarios. However, **lookup performance is worse than `b4s` by a factor of c. 10** (so
+`b4s` isn't obsolete... close call though 👀). For faster lookups, [but giving up
+compression](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/13?u=alexpovel)
+([TANSTAAFL](https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lunch)!), try an
+automaton from
+[`regex-automata`](https://docs.rs/regex-automata/latest/regex_automata/dfa/index.html#example-deserialize-a-dfa).
+
+Note that should your use case involve an initial decompression step, the slower runtime
+lookups but built-in compression of `fst` might still come out ahead in combination.
+
 ### Single, sorted and padded string
 
 Another approach could be to use a single string (saving pointer bloat), but pad all
@@ -184,7 +214,8 @@ static WORDS: &str = "abc␣␣def␣␣ghi␣␣jklmn";
 
 The binary search implementation is then straightforward, as the elements are of known,
 fixed lengths (in this case, 5). This approach was [found to not perform
-well](#benchmarks).
+well](#benchmarks). Find its (bare-bones) implementation in the
+[benchmarks](./benches/main.rs).
 
 ### Higher-order data structures
 
@@ -268,15 +299,39 @@ The benchmarks are not terribly scientific (low sample sizes etc.), but serve as
 guideline and sanity check. Run them yourself from the repository root with `cargo
 install just && just bench`.
 
-[^1]: Note that pre-compile preprocessing is ordinarily performed only **a single
-time**, unless the word list itself changes. This column might therefore be moot, and
-considered essentially zero-cost. This viewpoint benefits this crate.
-[^2]: The [program this crate was initially designed
-for](https://github.com/alexpovel/betterletters) is sensitive to startup-time, as the
-program's main processing is *rapid*. Even just 50ms of startup time would be very
-noticeable, slowing down a program run by a factor of about 10.
-
 ## Note on name
 
 The 3-letter name is neat. Should you have a more meaningful, larger project that could
 make better use of it, let me know. I might move this crate to a different name.
+
+[^1]: Note that pre-compile preprocessing is ordinarily performed only **a single
+    time**, unless the word list itself changes. This column might be moot, and
+    considered essentially zero-cost. This viewpoint benefits this crate.
+[^2]: Building itself is `O(n)`, but the raw input might be unsorted (as is assumed for
+    all other approaches as well). Sorting is `O(n log n)`, so building the automaton
+    collapses to `O(n + n log n)` = `O(n log n)`.
+[^3]: As an automaton, the finite state transducer (in this case, finite state acceptor)
+    compresses all common prefixes, like a [trie](https://en.wikipedia.org/wiki/Trie),
+    **but also all suffixes**, unlike a prefix tree. That's a massive advantage should
+    compression be of concern. Languages like German benefit greatly. Take the example
+    of `übersehen`: the countless
+    [conjugations](https://www.duden.de/konjugation/uebersehen_uebersehen) are shared
+    among *all* words, so are only encoded once in the entire automaton. The prefix
+    `über` is also shared among many words, and is also only encoded once. Compression
+    is built-in.
+[^4]: The [program this crate was initially designed
+    for](https://github.com/alexpovel/betterletters) is sensitive to startup-time, as
+    the program's main processing is *rapid*. Even just 50ms of startup time would be
+    very noticeable, slowing down a program run by a factor of about 10.
+
+[slice-sort]: https://doc.rust-lang.org/std/primitive.slice.html#method.sort
+[fst-repo]: https://github.com/BurntSushi/fst
+[fst-build]: https://docs.rs/fst/0.4.7/fst/struct.SetBuilder.html
+[slice-binsearch]: https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search
+[phf-repo]: https://github.com/rust-phf/rust-phf
+[hashset]: https://doc.rust-lang.org/std/collections/struct.HashSet.html
+[pad-file]: ./benches/main.rs
+[slice]: https://doc.rust-lang.org/std/primitive.slice.html
+[b4s-lib]: ./src/lib.rs
+[fst-lookup]: https://blog.burntsushi.net/transducers/#ordered-sets
+[fst-size]: https://blog.burntsushi.net/transducers/#the-dictionary
diff --git a/assets/benchmark.png b/assets/benchmark.png
diff --git a/assets/benchmarks-with-linear-search.png b/assets/benchmarks-with-linear-search.png
diff --git a/benches/main.rs b/benches/main.rs
@@ -1,13 +1,22 @@
 use b4s::AsciiChar;
 use b4s::SortedString;
 use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
+use fst::Set;
 use itertools::Itertools;
 use std::collections::HashSet;
 
 fn generate_hashset(words: Vec<&str>) -> HashSet<&str> {
     HashSet::from_iter(words.iter().cloned())
 }
 
+fn generate_fst(words: Vec<&str>) -> Set<Vec<u8>> {
+    let mut builder = fst::SetBuilder::memory();
+    for word in words {
+        builder.insert(word).unwrap();
+    }
+    builder.into_set()
+}
+
 /// Turns a vector of strings into a single string like:
 /// `Foo␠␠␠␠␠␠Hello␠␠␠␠World␠␠␠␠Bar␠␠␠␠␠␠Automatic`
 fn generate_padded_string_without_delimiter(
@@ -25,6 +34,17 @@ fn generate_padded_string_without_delimiter(
     out
 }
 
+fn get_words() -> Vec<&'static str> {
+    let mut words = include_str!("de-short.txt")
+        .split('\n')
+        .filter(|w| !w.is_empty())
+        .collect::<Vec<_>>();
+
+    words.sort();
+
+    words
+}
+
 /// Implements binary search over the output of `generate_padded_string_without_delimiter`.
 fn binary_search_padded(word: &str, string: &str, block_size: usize) -> bool {
     let num_blocks = string.len() / block_size;
@@ -86,16 +106,19 @@ pub fn criterion_bench(c: &mut Criterion) {
         group.warm_up_time(std::time::Duration::from_secs(1)); // Default is 3s
     }
 
-    let words = include_str!("de-short.txt").split('\n').collect::<Vec<_>>();
+    let words = get_words();
 
     for length in &[
         10, 100, 1_000, 5_000, 10_000, 15_000, 20_000, 30_000, 50_000, 100_000, 200_000, 300_000,
         400_000, 500_000,
     ] {
         let words = compress_list(words.clone(), *length);
 
+        // Some hideous Hungarian notation going on, but whatever...
         let words_set = generate_hashset(words.clone());
 
+        let words_fst = generate_fst(words.clone());
+
         const DELIMITER: char = '\n';
         let words_single_string_with_delimiter = words.clone().join(&DELIMITER.to_string());
         let sorted_string = SortedString::new_checked(
@@ -145,6 +168,12 @@ pub fn criterion_bench(c: &mut Criterion) {
                 |b, i| b.iter(|| sorted_string.binary_search(black_box(i))),
             );
 
+            group.bench_with_input(
+                BenchmarkId::new("fst", &parameter_string),
+                repr_word,
+                |b, i| b.iter(|| words_fst.contains(black_box(i))),
+            );
+
             group.bench_with_input(
                 BenchmarkId::new("padded", &parameter_string),
                 repr_word,
@@ -160,6 +189,8 @@ pub fn criterion_bench(c: &mut Criterion) {
             );
 
             group.bench_with_input(
+                // Be careful: this is *much* slower than all others, making the
+                // `violin.svg` plot and its linear axis look useless.
                 BenchmarkId::new("linear", &parameter_string),
                 repr_word,
                 |b, i| b.iter(|| words.contains(black_box(i))),
@@ -171,7 +202,8 @@ pub fn criterion_bench(c: &mut Criterion) {
             let results = vec![
                 words.binary_search(repr_word).is_ok(),
                 words_set.contains(repr_word),
-                sorted_string.binary_search(black_box(repr_word)).is_ok(),
+                sorted_string.binary_search(repr_word).is_ok(),
+                words_fst.contains(repr_word),
                 binary_search_padded(
                     repr_word,
                     &words_single_padded_string_without_delimiter,