Skip to content

Inaccurate size limit approximation #1264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
codido opened this issue May 13, 2025 · 4 comments
Closed

Inaccurate size limit approximation #1264

codido opened this issue May 13, 2025 · 4 comments

Comments

@codido
Copy link

codido commented May 13, 2025

What version of regex are you using?

1.11.1

Describe the bug at a high level.

While RegexBuilder::size_limit() only sets an approximation for the size limit, the approximation can be quite inaccurate, in some cases by a factor of ~3.

What are the steps to reproduce the behavior?

The following (admittedly hacky) benchmark shows the difference between the minimal size limit (within a margin of error) required to build a regular expression and the actual heap usage:

use std::fs::read_to_string;

use {regex::RegexBuilder, tikv_jemallocator::Jemalloc};

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

const MIN_WORDS: usize = 1024;
const MAX_WORDS: usize = 16384;
const WORDS_STEP: usize = 16;
const MAX_REGEX_SIZE: usize = 8 * 1024 * 1024;
const REGEX_SIZE_MARGIN: usize = 4096;

fn estimate_heap_size<T>(thing: T) -> usize {
    let epoch = tikv_jemalloc_ctl::epoch::mib().unwrap();
    let allocated = tikv_jemalloc_ctl::stats::allocated::mib().unwrap();

    epoch.advance().unwrap();
    let with = allocated.read().unwrap();

    drop(thing);

    epoch.advance().unwrap();
    let without = allocated.read().unwrap();

    with.saturating_sub(without)
}

fn benchmark_regex_size(pattern: &str) -> Option<(usize, usize)> {
    let mut lower = 0;
    let mut upper = MAX_REGEX_SIZE;

    loop {
        let limit = lower + (upper - lower).div_ceil(2);
        if limit == MAX_REGEX_SIZE {
            return None;
        }

        if let Ok(re) = RegexBuilder::new(pattern).size_limit(limit).build() {
            if upper - lower <= REGEX_SIZE_MARGIN {
                let heap_size = estimate_heap_size(re);
                return Some((limit, heap_size));
            }

            upper = limit;
        } else {
            lower = limit;
        }
    }
}

fn main() {
    let words = read_to_string("/usr/share/dict/words").unwrap();

    println!("No. Words, Size Limit, Heap Size");
    for i in (MIN_WORDS..=MAX_WORDS).step_by(WORDS_STEP) {
        print!("{i}, ");
        let regex = words.lines().take(i).collect::<Vec<_>>().join("|");
        if let Some((regex_size, heap_size)) = benchmark_regex_size(&regex) {
            println!("{regex_size}, {heap_size}");
        } else {
            println!("-, -");
        }
    }
}

It relies on the following dependencies:

[dependencies]
regex = "1.11.1"
tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
tikv-jemallocator = "0.6"

What is the actual behavior?

The following chart shows size limits vs. the actual heap usage:

Image

@BurntSushi
Copy link
Member

Please provide a concrete example.

@codido
Copy link
Author

codido commented May 13, 2025

Taking one of the first examples there for brevity:

1080|10-point|10th|11-point|12-point|16-point|18-point|1st|2|20-point|2,4,5-t|2,4-d|2D|2nd|30-30|3-D|3-d|3D|3M|3rd|48-point|4-D|4GL|4H|4th|5-point|5-T|5th|6-point|6th|7-point|7th|8-point|8th|9-point|9th|-a|A|A.|a|a'|a-|a.|A-1|A1|a1|A4|A5|AA|aa|A.A.A.|AAA|aaa|AAAA|AAAAAA|AAAL|AAAS|Aaberg|Aachen|AAE|AAEE|AAF|AAG|aah|aahed|aahing|aahs|AAII|aal|Aalborg|Aalesund|aalii|aaliis|aals|Aalst|Aalto|AAM|aam|AAMSI|Aandahl|A-and-R|Aani|AAO|AAP|AAPSS|Aaqbiye|Aar|Aara|Aarau|AARC|aardvark|aardvarks|aardwolf|aardwolves|Aaren|Aargau|aargh|Aarhus|Aarika|Aaron|aaron|Aaronic|aaronic|Aaronical|Aaronite|Aaronitic|Aaron's-beard|Aaronsburg|Aaronson|AARP|aarrgh|aarrghh|Aaru|AAS|aas|A'asia|aasvogel|aasvogels|AAU|AAUP|AAUW|AAVSO|AAX|A-axes|A-axis|A.B.|AB|Ab|ab|ab-|A.B.A.|ABA|Aba|aba|Ababa|Ababdeh|Ababua|abac|abaca|abacas|abacate|abacaxi|abacay|abaci|abacinate|abacination|abacisci|abaciscus|abacist|aback|abacli|Abaco|abacot|abacterial|abactinal|abactinally|abaction|abactor|abaculi|abaculus|abacus|abacuses|Abad|abada|Abadan|Abaddon|abaddon|abadejo|abadengo|abadia|Abadite|abaff|abaft|Abagael|Abagail|Abagtha|Abailard|abaisance|abaised|abaiser|abaisse|abaissed|abaka|Abakan|abakas|Abakumov|abalation|abalienate|abalienated|abalienating|abalienation|abalone|abalones|Abama|abamp|abampere|abamperes|abamps|Abana|aband|abandon|abandonable|abandoned|abandonedly|abandonee|abandoner|abandoners|abandoning|abandonment|abandonments|abandons|abandum|abanet|abanga|Abanic|abannition|Abantes|abapical|abaptiston|abaptistum|Abarambo|Abarbarea|Abaris|abarthrosis|abarticular|abarticulation|Abas|abas|abase|abased|abasedly|abasedness|abasement|abasements|abaser|abasers|abases|Abasgi|abash|abashed|abashedly|abashedness|abashes|abashing|abashless|abashlessly|abashment|abashments|abasia|abasias|abasic|abasing|abasio|abask|abassi|Abassieh|Abassin|abastard|abastardize|abastral|abatable|abatage|Abate|abate|abated|abatement|abatements|abater|abaters|abates|abatic|abating|abatis|abatised|abatises|abatjour|abatjours|abaton|abator|abators|ABATS|abattage|abattis|abattised|abattises|abattoir|abattoirs|abattu|abattue|Abatua|abature|abaue|abave|abaxial|abaxile|abay|abayah|abaze|abb|Abba|abba|abbacies|abbacomes|abbacy|Abbadide|Abbai|abbandono|abbas|abbasi|Abbasid|abbasid|abbassi|Abbassid|Abbasside|Abbate|abbate|abbatial|abbatical|abbatie|abbaye|Abbe|abbe|abbes|abbess|abbesses|abbest|Abbevilean|Abbeville|Abbevillian|abbevillian|Abbey|abbey|abbeys|abbeystead|abbeystede|Abbi

The minimal size limit for building this pattern appears to be 78688 bytes, but the estimated heap usage is around 240k.

@BurntSushi
Copy link
Member

The point of the size limit is to place a limit on heap growth that cannot easily be done from the length of the pattern. The word "approximate" is used in the docs multiple times to make it clear you shouldn't expect anything precise from this.

The top-level RegexBuilder::size_limit API corresponds to this lower level NFA size limit method. Its docs are a little more precise and should give a clue as to the problem here: the limit is applied to each NFA built. For any given regex, multiple NFAs may be built.

On top of that, there is a smattering of other stuff that is built as well. For example, this program uses lower level APIs and demonstrates how disabling features can decrease heap memory usage substantially:

use regex_automata::meta::{Config, Regex};

#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

fn main() -> anyhow::Result<()> {
    let pattern = std::fs::read_to_string("pattern")?;
    let pattern = pattern.trim();

    let regex = Regex::new(pattern).unwrap();
    println!("default size: {:?}", estimate_heap_size(regex));

    let config =
        Config::new().hybrid(false).dfa(false).onepass(false).backtrack(false);
    let regex = Regex::builder().configure(config).build(pattern).unwrap();
    println!("default size: {:?}", estimate_heap_size(regex));

    Ok(())
}

fn estimate_heap_size<T>(thing: T) -> usize {
    let epoch = tikv_jemalloc_ctl::epoch::mib().unwrap();
    let allocated = tikv_jemalloc_ctl::stats::allocated::mib().unwrap();

    epoch.advance().unwrap();
    let with = allocated.read().unwrap();

    drop(thing);

    epoch.advance().unwrap();
    let without = allocated.read().unwrap();

    with.saturating_sub(without)
}

Where the pattern file contains your example above. The output on my machine is:

$ cargo r -rq
default size: 246592
default size: 148768

With all the features disabled, the heap usage is now about twice the memory limit you found for this pattern. Which makes sense because two NFAs are built for this pattern. So you get roughly double the heap memory.

This is very tricky to reason about because the heap memory used can be very pattern dependent. Your example, in particular, contains regex meta characters beyond the alternation operator. Specifically, .. If we escape those, we get a different result for heap memory used:

$ cargo r -rq
default size: 113344
default size: 53632

Why? In this case, when the regex is just an alternation of literals with no special regex meta characters, an optimization kicks in that builds a special trie that is more memory efficient. You can see this by asking for the Thompson NFA size via regex-cli:

$ regex-cli debug thompson -q -f pattern
        parse time:  194.83µs
    translate time:  165.06µs
  compile nfa time:  303.404µs
            memory:  55124
            states:  2184
       pattern len:  1
       capture len:  1
        has empty?:  false
          is utf8?:  true
       is reverse?:  false
   line terminator:  "\n"
       lookset any:  ∅
lookset prefix any:  ∅

$ regex-cli debug thompson -q -f pattern-escaped
        parse time:  192.633µs
    translate time:  165.076µs
  compile nfa time:  222.074µs
            memory:  18500
            states:  626
       pattern len:  1
       capture len:  1
        has empty?:  false
          is utf8?:  true
       is reverse?:  false
   line terminator:  "\n"
       lookset any:  ∅
lookset prefix any:  ∅

If you remove the -q flag above, you'll see the actual NFA.

Another introspection tool is logging. If we modify the program a bit to ask for logging, regex will dump some data about its strategy:

use regex_automata::meta::{Config, Regex};

#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

fn main() -> anyhow::Result<()> {
    env_logger::init();

    let pattern = std::fs::read_to_string("pattern-escaped")?;
    let pattern = pattern.trim();

    let config =
        Config::new().hybrid(false).dfa(false).onepass(false).backtrack(false);
    let regex = Regex::builder().configure(config).build(pattern).unwrap();
    println!("default size: {:?}", estimate_heap_size(regex));

    Ok(())
}

fn estimate_heap_size<T>(thing: T) -> usize {
    let epoch = tikv_jemalloc_ctl::epoch::mib().unwrap();
    let allocated = tikv_jemalloc_ctl::stats::allocated::mib().unwrap();

    epoch.advance().unwrap();
    let with = allocated.read().unwrap();

    drop(thing);

    epoch.advance().unwrap();
    let without = allocated.read().unwrap();

    with.saturating_sub(without)
}

And my Cargo.toml (note that the logging feature of regex-automata is enabled):

[package]
publish = false
name = "regex-i1264-size-limit-inaccuracy"
version = "0.1.0"
edition = "2024"

[dependencies]
anyhow = "1.0.98"
env_logger = "0.11.8"
regex-automata = { version = "0.4.9", features = ["logging"] }
tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
tikv-jemallocator = "0.6.0"

[[bin]]
name = "regex-i1264-size-limit-inaccuracy"
path = "main.rs"

[profile.release]
debug = true

And its output:

$ RUST_LOG=trace cargo r -rq
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::regex] building meta regex with 1 patterns:
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::regex] PatternID(0): 1080|10-point|10th|11-point|12-point|16-point|18-point|1st|2|20-point|2,4,5-t|2,4-d|2D|2nd|30-30|3-D|3-d|3D|3M|3rd|48-point|4-D|4GL|4H|4th|5-point|5-T|5th|6-point|6th|7-point|7th|8-point|8th|9-point|9th|-a|A|A\.|a|a'|a-|a\.|A-1|A1|a1|A4|A5|AA|aa|A\.A\.A\.|AAA|aaa|AAAA|AAAAAA|AAAL|AAAS|Aaberg|Aachen|AAE|AAEE|AAF|AAG|aah|aahed|aahing|aahs|AAII|aal|Aalborg|Aalesund|aalii|aaliis|aals|Aalst|Aalto|AAM|aam|AAMSI|Aandahl|A-and-R|Aani|AAO|AAP|AAPSS|Aaqbiye|Aar|Aara|Aarau|AARC|aardvark|aardvarks|aardwolf|aardwolves|Aaren|Aargau|aargh|Aarhus|Aarika|Aaron|aaron|Aaronic|aaronic|Aaronical|Aaronite|Aaronitic|Aaron's-beard|Aaronsburg|Aaronson|AARP|aarrgh|aarrghh|Aaru|AAS|aas|A'asia|aasvogel|aasvogels|AAU|AAUP|AAUW|AAVSO|AAX|A-axes|A-axis|A\.B\.|AB|Ab|ab|ab-|A\.B\.A\.|ABA|Aba|aba|Ababa|Ababdeh|Ababua|abac|abaca|abacas|abacate|abacaxi|abacay|abaci|abacinate|abacination|abacisci|abaciscus|abacist|aback|abacli|Abaco|abacot|abacterial|abactinal|abactinally|abaction|abactor|abaculi|abaculus|abacus|abacuses|[... snip ...]
[2025-05-13T13:15:40Z DEBUG regex_automata::util::prefilter] prefixes (len=Some(237), exact=false) extracted before optimization: Seq[E("1080"), I("10-p"), E("10th"), I("11-p"), I("12-p"), I("16-p"), I("18-p"), E("1st"), E("2"), I("20-p"), I("2,4,"), I("2,4-"), E("2D"), E("2nd"), I("30-3"), E("3-D"), E("3-d"), E("3D"), E("3M"), E("3rd"), I("48-p"), E("4-D"), E("4GL"), E("4H"), E("4th"), I("5-po"), E("5-T"), E("5th"), I("6-po"), E("6th"), I("7-po"), E("7th"), I("8-po"), E("8th"), I("9-po"), E("9th"), E("-a"), E("A"), E("A."), E("a"), E("a\'"), E("a-"), E("a."), E("A-1"), E("A1"), E("a1"), E("A4"), E("A5"), E("AA"), E("aa"), I("A.A."), E("AAA"), E("aaa"), I("AAAA"), E("AAAL"), E("AAAS"), I("Aabe"), I("Aach"), E("AAE"), E("AAEE"), E("AAF"), E("AAG"), E("aah"), I("aahe"), I("aahi"), E("aahs"), E("AAII"), E("aal"), I("Aalb"), I("Aale"), I("aali"), E("aals"), I("Aals"), I("Aalt"), E("AAM"), E("aam"), I("AAMS"), I("Aand"), I("A-an"), E("Aani"), E("AAO"), E("AAP"), I("AAPS"), I("Aaqb"), E("Aar"), I("Aara"), E("AARC"), I("aard"), I("Aare"), I("Aarg"), I("aarg"), I("Aarh"), I("Aari"), I("Aaro"), I("aaro"), I("Aaro"), I("aaro"), I("Aaro"), E("AARP"), I("aarr"), E("Aaru"), E("AAS"), E("aas"), I("A\'as"), I("aasv"), E("AAU"), E("AAUP"), E("AAUW"), I("AAVS"), E("AAX"), I("A-ax"), E("A.B."), E("AB"), E("Ab"), E("ab"), E("ab-"), I("A.B."), E("ABA"), E("Aba"), E("aba"), I("Abab"), I("abac"), I("Abac"), I("abac"), E("Abad"), I("abad"), I("Abad"), I("abad"), I("Abad"), I("abaf"), I("Abag"), I("Abai"), I("abai"), I("abak"), I("Abak"), I("abak"), I("Abak"), I("abal"), I("Abam"), I("abam"), I("Aban"), I("aban"), I("Aban"), I("aban"), I("Aban"), I("abap"), I("Abar"), I("abar"), E("Abas"), I("abas"), I("Abas"), I("abas"), E("abasing"), E("abasio"), E("abask"), E("abassi"), E("Abassieh"), E("Abassin"), E("abastard"), E("abastardize"), E("abastral"), E("abatable"), E("abatage"), E("Abate"), E("abate"), E("abated"), E("abatement"), E("abatements"), E("abater"), E("abaters"), E("abates"), E("abatic"), E("abating"), E("abatis"), E("abatised"), E("abatises"), E("abatjour"), E("abatjours"), E("abaton"), E("abator"), E("abators"), E("ABATS"), E("abattage"), E("abattis"), E("abattised"), E("abattises"), E("abattoir"), E("abattoirs"), E("abattu"), E("abattue"), E("Abatua"), E("abature"), E("abaue"), E("abave"), E("abaxial"), E("abaxile"), E("abay"), E("abayah"), E("abaze"), E("abb"), E("Abba"), E("abba"), E("abbacies"), E("abbacomes"), E("abbacy"), E("Abbadide"), E("Abbai"), E("abbandono"), E("abbas"), E("abbasi"), E("Abbasid"), E("abbasid"), E("abbassi"), E("Abbassid"), E("Abbasside"), E("Abbate"), E("abbate"), E("abbatial"), E("abbatical"), E("abbatie"), E("abbaye"), E("Abbe"), E("abbe"), E("abbes"), E("abbess"), E("abbesses"), E("abbest"), E("Abbevilean"), E("Abbeville"), E("Abbevillian"), E("abbevillian"), E("Abbey"), E("abbey"), E("abbeys"), E("abbeystead"), E("abbeystede"), E("Abbi")]
[2025-05-13T13:15:40Z DEBUG regex_automata::util::prefilter] prefixes (len=Some(34), exact=false) extracted after optimization: Seq[E("1080"), I("10-p"), E("10th"), I("11-p"), I("12-p"), I("16-p"), I("18-p"), E("1st"), E("2"), I("30-3"), E("3-D"), E("3-d"), E("3D"), E("3M"), E("3rd"), I("48-p"), E("4-D"), E("4GL"), E("4H"), E("4th"), I("5-po"), E("5-T"), E("5th"), I("6-po"), E("6th"), I("7-po"), E("7th"), I("8-po"), E("8th"), I("9-po"), E("9th"), E("-a"), E("A"), E("a")]
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::literal] skipping Aho-Corasick because there are too few literals
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::strategy] creating prefilter from 34 literals: [E("1080"), I("10-p"), E("10th"), I("11-p"), I("12-p"), I("16-p"), I("18-p"), E("1st"), E("2"), I("30-3"), E("3-D"), E("3-d"), E("3D"), E("3M"), E("3rd"), I("48-p"), E("4-D"), E("4GL"), E("4H"), E("4th"), I("5-po"), E("5-T"), E("5th"), I("6-po"), E("6th"), I("7-po"), E("7th"), I("8-po"), E("8th"), I("9-po"), E("9th"), E("-a"), E("A"), E("a")]
[2025-05-13T13:15:40Z DEBUG aho_corasick::packed::api] trying to build Teddy packed matcher
[2025-05-13T13:15:40Z DEBUG aho_corasick::packed::teddy::builder] skipping Teddy (mask len: 1) because there are too many patterns
[2025-05-13T13:15:40Z DEBUG aho_corasick::nfa::noncontiguous] building non-contiguous NFA
[2025-05-13T13:15:40Z DEBUG aho_corasick::util::prefilter] no prefilter available
[2025-05-13T13:15:40Z DEBUG aho_corasick::nfa::noncontiguous] non-contiguous NFA built, <states: 84, size: 20249>
[2025-05-13T13:15:40Z DEBUG aho_corasick::ahocorasick] forcefully chose DFA
[2025-05-13T13:15:40Z DEBUG aho_corasick::dfa] building DFA
[2025-05-13T13:15:40Z DEBUG aho_corasick::dfa] DFA built, <states: 164, size: 44080, alphabet len: 39, stride: 64>
[2025-05-13T13:15:40Z DEBUG regex_automata::util::prefilter] prefilter built: aho-corasick
[2025-05-13T13:15:40Z DEBUG regex_automata::nfa::thompson::builder] intermediate NFA compilation via builder is complete, intermediate NFA size: 1085 states, 40132 bytes on heap
[2025-05-13T13:15:40Z DEBUG regex_automata::nfa::thompson::builder] NFA compilation via builder complete, final NFA size: 626 states, 18500 bytes on heap, has empty? false, utf8? true
[2025-05-13T13:15:40Z DEBUG regex_automata::nfa::thompson::compiler] HIR-to-NFA compilation complete, config: Config { utf8: Some(true), reverse: None, nfa_size_limit: Some(Some(10485760)), shrink: Some(false), which_captures: Some(All), look_matcher: Some(LookMatcher { lineterm: \n }) }
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::wrappers] PikeVM built
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::strategy] skipping reverse anchored optimization because the regex is not always anchored at the end
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::strategy] skipping reverse suffix optimization because we don't have a lazy DFA or a full DFA
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::strategy] skipping reverse inner optimization because we don't have a lazy DFA or a full DFA
[2025-05-13T13:15:40Z DEBUG regex_automata::meta::strategy] using core strategy
default size: 50752

Overall, this is working as intended. "size limit" might be a bad name for the API, but it is serving its intended purpose. Back when the API was first introduced, the internals of regex were much simpler and it was probably closer to an actual indicator of heap memory used. But now it's just a way to make regex compilation fail when it gets "too big," where "too big" is something that needs to be calibrated. If it weren't for repetitions in the concrete syntax, I don't believe there would be a need for the API at all.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2025
@codido
Copy link
Author

codido commented May 13, 2025

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants