-
Notifications
You must be signed in to change notification settings - Fork 462
Inaccurate size limit approximation #1264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Please provide a concrete example. |
Taking one of the first examples there for brevity:
The minimal size limit for building this pattern appears to be 78688 bytes, but the estimated heap usage is around 240k. |
The point of the size limit is to place a limit on heap growth that cannot easily be done from the length of the pattern. The word "approximate" is used in the docs multiple times to make it clear you shouldn't expect anything precise from this. The top-level On top of that, there is a smattering of other stuff that is built as well. For example, this program uses lower level APIs and demonstrates how disabling features can decrease heap memory usage substantially: use regex_automata::meta::{Config, Regex};
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
fn main() -> anyhow::Result<()> {
let pattern = std::fs::read_to_string("pattern")?;
let pattern = pattern.trim();
let regex = Regex::new(pattern).unwrap();
println!("default size: {:?}", estimate_heap_size(regex));
let config =
Config::new().hybrid(false).dfa(false).onepass(false).backtrack(false);
let regex = Regex::builder().configure(config).build(pattern).unwrap();
println!("default size: {:?}", estimate_heap_size(regex));
Ok(())
}
fn estimate_heap_size<T>(thing: T) -> usize {
let epoch = tikv_jemalloc_ctl::epoch::mib().unwrap();
let allocated = tikv_jemalloc_ctl::stats::allocated::mib().unwrap();
epoch.advance().unwrap();
let with = allocated.read().unwrap();
drop(thing);
epoch.advance().unwrap();
let without = allocated.read().unwrap();
with.saturating_sub(without)
} Where the
With all the features disabled, the heap usage is now about twice the memory limit you found for this pattern. Which makes sense because two NFAs are built for this pattern. So you get roughly double the heap memory. This is very tricky to reason about because the heap memory used can be very pattern dependent. Your example, in particular, contains regex meta characters beyond the alternation operator. Specifically,
Why? In this case, when the regex is just an alternation of literals with no special regex meta characters, an optimization kicks in that builds a special trie that is more memory efficient. You can see this by asking for the Thompson NFA size via
If you remove the Another introspection tool is logging. If we modify the program a bit to ask for logging, use regex_automata::meta::{Config, Regex};
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
fn main() -> anyhow::Result<()> {
env_logger::init();
let pattern = std::fs::read_to_string("pattern-escaped")?;
let pattern = pattern.trim();
let config =
Config::new().hybrid(false).dfa(false).onepass(false).backtrack(false);
let regex = Regex::builder().configure(config).build(pattern).unwrap();
println!("default size: {:?}", estimate_heap_size(regex));
Ok(())
}
fn estimate_heap_size<T>(thing: T) -> usize {
let epoch = tikv_jemalloc_ctl::epoch::mib().unwrap();
let allocated = tikv_jemalloc_ctl::stats::allocated::mib().unwrap();
epoch.advance().unwrap();
let with = allocated.read().unwrap();
drop(thing);
epoch.advance().unwrap();
let without = allocated.read().unwrap();
with.saturating_sub(without)
} And my [package]
publish = false
name = "regex-i1264-size-limit-inaccuracy"
version = "0.1.0"
edition = "2024"
[dependencies]
anyhow = "1.0.98"
env_logger = "0.11.8"
regex-automata = { version = "0.4.9", features = ["logging"] }
tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
tikv-jemallocator = "0.6.0"
[[bin]]
name = "regex-i1264-size-limit-inaccuracy"
path = "main.rs"
[profile.release]
debug = true And its output:
Overall, this is working as intended. "size limit" might be a bad name for the API, but it is serving its intended purpose. Back when the API was first introduced, the internals of |
Thank you! |
What version of regex are you using?
1.11.1
Describe the bug at a high level.
While
RegexBuilder::size_limit()
only sets an approximation for the size limit, the approximation can be quite inaccurate, in some cases by a factor of ~3.What are the steps to reproduce the behavior?
The following (admittedly hacky) benchmark shows the difference between the minimal size limit (within a margin of error) required to build a regular expression and the actual heap usage:
It relies on the following dependencies:
What is the actual behavior?
The following chart shows size limits vs. the actual heap usage:
The text was updated successfully, but these errors were encountered: