Skip to content

Commit

Permalink
docs: Add benchmarks for fst
Browse files Browse the repository at this point in the history
  • Loading branch information
alexpovel authored Aug 21, 2023
1 parent 70e5918 commit ad32181
Show file tree
Hide file tree
Showing 5 changed files with 124 additions and 36 deletions.
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ itertools = "0.11.0"

[dev-dependencies]
criterion = { version = "0.5.1", features = ["html_reports"] }
fst = "0.4.7"
phf = { version = "0.11.1", features = ["macros"] }
rstest = "0.18.1"
trie-rs = "0.1.1"
Expand Down
123 changes: 89 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,21 +79,24 @@ The itch to be scratched is the following:
- the list is to be distributed as part of the binary

A couple possible approaches come to mind. The summary table, where `n` is the number of
words, is (for more context, see the individual sections below):

| Approach | Pre-compile preprocessing[^1] | Compile time | Runtime lookup | Binary size |
| ------------------------------------------------------------------------------------------------------------------------ | ----------------------------- | ------------------ | -------------------------------------------------------------------------------------------------- | ----------- |
| `b4s` | Sort, `O(n log n)` | Single ref: `O(1)` | Bin. search: `O(log n)` | `O(n)` |
| `array` | Sort, `O(n log n)` | Many refs: `O(n)` | [Bin. search](https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search): `O(log n)` | `~ O(3n)` |
| [`phf`](https://github.com/rust-phf/rust-phf)/[`HashSet`](https://doc.rust-lang.org/std/collections/struct.HashSet.html) | None | Many refs: `O(n)` | Hash: `O(1)` | `~ O(3n)` |
| padded `&str` | Sort + Pad, `~ O(n log n)` | Single ref: `O(1)` | Bin. search: `O(log n)` | `~ O(n)` |
words in the dictionary and `k` the number of characters in a word to look up, is (for
more context, see the individual sections below):

| Approach | Pre-compile preprocessing[^1] | Compile time prepr. | Runtime lookup | Binary size |
| -------------------- | ----------------------------- | ------------------- | ----------------------------- | ------------------------ |
| `b4s` | [`O(n log n)`][slice-sort] | Single ref: `O(1)` | [`O(log n)`][b4s-lib] | `O(n)` |
| [`fst`][fst-repo] | [`O(n log n)`][fst-build][^2] | Single ref: `O(1)` | [`O(k)`][fst-lookup] | [`< O(n)`][fst-size][^3] |
| [slice][slice] | [`O(n log n)`][slice-sort] | Many refs: `O(n)` | [`O(log n)`][slice-binsearch] | `~ O(3n)` |
| [`phf`][phf-repo] | None | Many refs: `O(n)` | Hash: `O(1)` | `~ O(3n)` |
| [`HashSet`][hashset] | None | Many refs: `O(n)` | Hash: `O(1)` | `~ O(3n)` |
| padded `&str` | [`~ O(n log n)`][pad-file] | Single ref: `O(1)` | Bin. search: `O(log n)` | `~ O(n)` |

This crate is an attempt to provide a solution with:

1. **good, not perfect runtime performance**,
2. very little, [one-time](https://doc.rust-lang.org/cargo/reference/build-scripts.html#rerun-if-changed) compile-time preprocessing needed (just sorting),
3. **essentially no additional startup cost** (unlike, say, constructing a `HashSet` at
runtime)[^2],
runtime)[^4],
4. **binary sizes as small as possible**,
5. **compile times as fast as possible**.

Expand All @@ -120,12 +123,7 @@ A simple slice is an obvious choice, and can be generated in a build script.
```rust
static WORDS: &[&str] = &["abc", "def", "ghi", "jkl"];

fn main() {
match WORDS.binary_search(&"ghi") {
Ok(i) => println!("Found at index: {:?}", i),
Err(i) => println!("Not found, could be inserted at: {:?}", i),
}
}
assert_eq!(WORDS.binary_search(&"ghi").unwrap(), 2);
```

There are two large pains in this approach:
Expand All @@ -143,9 +141,8 @@ There are two large pains in this approach:

### Hash Set

Regular [`HashSet`s](https://doc.rust-lang.org/std/collections/struct.HashSet.html) are
not available at compile time. Crates like [`phf`](https://github.com/rust-phf/rust-phf)
change that:
Regular [`HashSet`s][hashset] are not available at compile time. Crates like
[`phf`][phf-repo] change that:

```rust
use phf::{phf_set, Set};
Expand All @@ -157,19 +154,52 @@ static WORDS: Set<&'static str> = phf_set! {
"jkl"
};

fn main() {
if WORDS.contains(&"ghi") {
println!("Found!");
} else {
println!("Not found!");
}
}
assert!(WORDS.contains(&"ghi"))
```

Similar downsides as for the slices case apply: very long compile times, and
considerable binary bloat from smart pointers. A hash set ultimately is a slice with
computed indices, so this is expected.

### Finite State Transducer/Acceptor (Automaton)

The [`fst`][fst-repo] crate is a fantastic candidate, brought up by its author (of
[`ripgrep`](https://github.com/BurntSushi/ripgrep) and
[`regex`](https://github.com/rust-lang/regex) fame) in a [discussion on
`b4s`](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/7?u=alexpovel):

```rust
use fst::Set; // Don't need FST, just FSA here

static WORDS: &[&str] = &["abc", "def", "ghi", "jkl"];

let set = Set::from_iter(WORDS.into_iter()).unwrap();
assert!(set.contains("ghi"));
```

It offers:

- [almost free (in time and space)
deserialization](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/9?u=alexpovel):
its serialization format is identical to its in-memory representation, unlike [other
solutions](#higher-order-data-structures), facilitating startup-up performance
- compression[^3] (important for
[publishing](https://github.com/rust-lang/crates.io/issues/195)), making it the only
candidate in this comparison natively leading to *smaller* size than the original word
list
- extension points (fuzzy and case-insensitive searching, bring-your-own-automaton etc.)

`fst` might be the best choice for many applications, and better than `b4s` in most
scenarios. However, **lookup performance is worse than `b4s` by a factor of c. 10** (so
`b4s` isn't obsolete... close call though 👀). For faster lookups, [but giving up
compression](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/13?u=alexpovel)
([TANSTAAFL](https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lunch)!), try an
automaton from
[`regex-automata`](https://docs.rs/regex-automata/latest/regex_automata/dfa/index.html#example-deserialize-a-dfa).

Note that should your use case involve an initial decompression step, the slower runtime
lookups but built-in compression of `fst` might still come out ahead in combination.

### Single, sorted and padded string

Another approach could be to use a single string (saving pointer bloat), but pad all
Expand All @@ -184,7 +214,8 @@ static WORDS: &str = "abc␣␣def␣␣ghi␣␣jklmn";

The binary search implementation is then straightforward, as the elements are of known,
fixed lengths (in this case, 5). This approach was [found to not perform
well](#benchmarks).
well](#benchmarks). Find its (bare-bones) implementation in the
[benchmarks](./benches/main.rs).

### Higher-order data structures

Expand Down Expand Up @@ -268,15 +299,39 @@ The benchmarks are not terribly scientific (low sample sizes etc.), but serve as
guideline and sanity check. Run them yourself from the repository root with `cargo
install just && just bench`.

[^1]: Note that pre-compile preprocessing is ordinarily performed only **a single
time**, unless the word list itself changes. This column might therefore be moot, and
considered essentially zero-cost. This viewpoint benefits this crate.
[^2]: The [program this crate was initially designed
for](https://github.com/alexpovel/betterletters) is sensitive to startup-time, as the
program's main processing is *rapid*. Even just 50ms of startup time would be very
noticeable, slowing down a program run by a factor of about 10.

## Note on name

The 3-letter name is neat. Should you have a more meaningful, larger project that could
make better use of it, let me know. I might move this crate to a different name.

[^1]: Note that pre-compile preprocessing is ordinarily performed only **a single
time**, unless the word list itself changes. This column might be moot, and
considered essentially zero-cost. This viewpoint benefits this crate.
[^2]: Building itself is `O(n)`, but the raw input might be unsorted (as is assumed for
all other approaches as well). Sorting is `O(n log n)`, so building the automaton
collapses to `O(n + n log n)` = `O(n log n)`.
[^3]: As an automaton, the finite state transducer (in this case, finite state acceptor)
compresses all common prefixes, like a [trie](https://en.wikipedia.org/wiki/Trie),
**but also all suffixes**, unlike a prefix tree. That's a massive advantage should
compression be of concern. Languages like German benefit greatly. Take the example
of `übersehen`: the countless
[conjugations](https://www.duden.de/konjugation/uebersehen_uebersehen) are shared
among *all* words, so are only encoded once in the entire automaton. The prefix
`über` is also shared among many words, and is also only encoded once. Compression
is built-in.
[^4]: The [program this crate was initially designed
for](https://github.com/alexpovel/betterletters) is sensitive to startup-time, as
the program's main processing is *rapid*. Even just 50ms of startup time would be
very noticeable, slowing down a program run by a factor of about 10.

[slice-sort]: https://doc.rust-lang.org/std/primitive.slice.html#method.sort
[fst-repo]: https://github.com/BurntSushi/fst
[fst-build]: https://docs.rs/fst/0.4.7/fst/struct.SetBuilder.html
[slice-binsearch]: https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search
[phf-repo]: https://github.com/rust-phf/rust-phf
[hashset]: https://doc.rust-lang.org/std/collections/struct.HashSet.html
[pad-file]: ./benches/main.rs
[slice]: https://doc.rust-lang.org/std/primitive.slice.html
[b4s-lib]: ./src/lib.rs
[fst-lookup]: https://blog.burntsushi.net/transducers/#ordered-sets
[fst-size]: https://blog.burntsushi.net/transducers/#the-dictionary
Binary file modified assets/benchmark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/benchmarks-with-linear-search.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 34 additions & 2 deletions benches/main.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,22 @@
use b4s::AsciiChar;
use b4s::SortedString;
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
use fst::Set;
use itertools::Itertools;
use std::collections::HashSet;

fn generate_hashset(words: Vec<&str>) -> HashSet<&str> {
HashSet::from_iter(words.iter().cloned())
}

fn generate_fst(words: Vec<&str>) -> Set<Vec<u8>> {
let mut builder = fst::SetBuilder::memory();
for word in words {
builder.insert(word).unwrap();
}
builder.into_set()
}

/// Turns a vector of strings into a single string like:
/// `Foo␠␠␠␠␠␠Hello␠␠␠␠World␠␠␠␠Bar␠␠␠␠␠␠Automatic`
fn generate_padded_string_without_delimiter(
Expand All @@ -25,6 +34,17 @@ fn generate_padded_string_without_delimiter(
out
}

fn get_words() -> Vec<&'static str> {
let mut words = include_str!("de-short.txt")
.split('\n')
.filter(|w| !w.is_empty())
.collect::<Vec<_>>();

words.sort();

words
}

/// Implements binary search over the output of `generate_padded_string_without_delimiter`.
fn binary_search_padded(word: &str, string: &str, block_size: usize) -> bool {
let num_blocks = string.len() / block_size;
Expand Down Expand Up @@ -86,16 +106,19 @@ pub fn criterion_bench(c: &mut Criterion) {
group.warm_up_time(std::time::Duration::from_secs(1)); // Default is 3s
}

let words = include_str!("de-short.txt").split('\n').collect::<Vec<_>>();
let words = get_words();

for length in &[
10, 100, 1_000, 5_000, 10_000, 15_000, 20_000, 30_000, 50_000, 100_000, 200_000, 300_000,
400_000, 500_000,
] {
let words = compress_list(words.clone(), *length);

// Some hideous Hungarian notation going on, but whatever...
let words_set = generate_hashset(words.clone());

let words_fst = generate_fst(words.clone());

const DELIMITER: char = '\n';
let words_single_string_with_delimiter = words.clone().join(&DELIMITER.to_string());
let sorted_string = SortedString::new_checked(
Expand Down Expand Up @@ -145,6 +168,12 @@ pub fn criterion_bench(c: &mut Criterion) {
|b, i| b.iter(|| sorted_string.binary_search(black_box(i))),
);

group.bench_with_input(
BenchmarkId::new("fst", &parameter_string),
repr_word,
|b, i| b.iter(|| words_fst.contains(black_box(i))),
);

group.bench_with_input(
BenchmarkId::new("padded", &parameter_string),
repr_word,
Expand All @@ -160,6 +189,8 @@ pub fn criterion_bench(c: &mut Criterion) {
);

group.bench_with_input(
// Be careful: this is *much* slower than all others, making the
// `violin.svg` plot and its linear axis look useless.
BenchmarkId::new("linear", &parameter_string),
repr_word,
|b, i| b.iter(|| words.contains(black_box(i))),
Expand All @@ -171,7 +202,8 @@ pub fn criterion_bench(c: &mut Criterion) {
let results = vec![
words.binary_search(repr_word).is_ok(),
words_set.contains(repr_word),
sorted_string.binary_search(black_box(repr_word)).is_ok(),
sorted_string.binary_search(repr_word).is_ok(),
words_fst.contains(repr_word),
binary_search_padded(
repr_word,
&words_single_padded_string_without_delimiter,
Expand Down

0 comments on commit ad32181

Please sign in to comment.