`<regex>`: Speed up searches for regexes that start with assertions #5576

muellerj2 · 2025-06-07T22:36:50Z

Towards #5468. This extends the skip heuristic to the remaining unhandled assertions: word boundaries and lookahead assertions.

The handling of word boundaries turns out to be much simpler in _Matcher2::_Skip() than in _Matcher2::_Is_wbound(), because we don't have to do any special handling for the start or end of the input string: _Skip() is never called for the start (or as the comment at the top says, --_First_arg is valid) and it cannot skip beyond the end of the input string (_Last) anyway. So we only have to handle the middle case and have to keep looking for the first position where the word character property changes (\b) or where it does not change (\B).
For negative lookahead assertions, the easiest thing to do is to just ignore the assertion body (i.e., to assume that the assertion succeeds) and keep looking for a suitable node that can be used for skipping in the remainder of the main regex.
- Keep in mind that _Skip() is just a heuristic, so it's fine if we don't exclude some non-matches. There is only one thing we must not do here: Skip something that could match.
For positive lookahead assertions, we can look for the first place where the start of the assertion and the regex after the assertion both match.
- Because this is implemented via recursion, I enforce a maximum recursion depth of 50 to prevent stack overflow. This limit is only reached by regexes that start with more than fifty positive lookahead assertions, so this should be more than enough for any practical regex. And the worst thing that happens if it is reached is that the optimization is (maybe only partially) disengaged.
- While the code might look similar to the one that caused the quadratic slowdown in <regex>: Nonlinear slowdown with increasing string length #5452, this does not suffer from the same problem because we only move forward in the checks: If evaluating the start of the assertion body or the regex after the assertion body tells us that we should skip until position x, then we search next for the first possible position the other part of the regex matches starting from x. No position is evaluated more than once for the assertion and once for the remainder of the regex after the assertion, which implies a running time linear in the input.
The skip heuristic now also continues evaluation beyond _N_endif nodes. (Nothing relevant happens at such nodes. They don't imply anything about the input string. Their only feature is that all the branches of an _N_if node join here again, but the problematic case is when branching happens, not when branches end.)
- Because of <regex>: Avoid generating unnecessary single-branch _N_if nodes #5539, this doesn't actually make a difference if the current matcher and parser are combined. It can improve performance, though, when the current matcher evaluates an NFA generated by an old parser.

Tests

I added some tests for lookahead assertions. The tests use regex_replace() because it performs several searches into a single test call.

There already was test coverage for skipping of word boundaries in VSO_0000000_regex_use's test_DDB_153116_replacements() function. But the test coverage had a gap: It didn't check for correct behavior if two (or more) non-word characters are next to each other. I closed this gap by adding some spaces to the existing tests.

Benchmark

benchmark	before	after	speedup
bm_lorem_search/"bibe"/2	29157.3	29157.3	1.00
bm_lorem_search/"bibe"/3	59375	59988.8	0.99
bm_lorem_search/"bibe"/4	117188	117188	1.00
bm_lorem_search/"(bibe)"/2	47711.9	47433	1.01
bm_lorem_search/"(bibe)"/3	97656.2	98349.4	0.99
bm_lorem_search/"(bibe)"/4	200911	196725	1.02
bm_lorem_search/"(bibe)+"/2	62779	64174.1	0.98
bm_lorem_search/"(bibe)+"/3	125558	125558	1.00
bm_lorem_search/"(bibe)+"/4	251116	251116	1.00
bm_lorem_search/"(?:bibe)+"/2	51562.5	51562.5	1.00
bm_lorem_search/"(?:bibe)+"/3	102534	102539	1.00
bm_lorem_search/"(?:bibe)+"/4	204041	204041	1.00
bm_lorem_search/R"(\bbibe)"/2	245857	85449.2	2.88
bm_lorem_search/R"(\bbibe)"/3	470948	172631	2.73
bm_lorem_search/R"(\bbibe)"/4	941265	353021	2.67
bm_lorem_search/R"(\Bibe)"/2	256319	199507	1.28
bm_lorem_search/R"(\Bibe)"/3	531250	409807	1.30
bm_lorem_search/R"(\Bibe)"/4	1045850	837054	1.25
bm_lorem_search/R"((?=….)bibe)"/2	452080	53125	8.51
bm_lorem_search/R"((?=….)bibe)"/3	889369	102534	8.67
bm_lorem_search/R"((?=….)bibe)"/4	1759380	244849	7.19
bm_lorem_search/R"((?=bibe)….)"/2	306920	59988.8	5.12
bm_lorem_search/R"((?=bibe)….)"/3	599888	98349.4	6.10
bm_lorem_search/R"((?=bibe)….)"/4	1339290	194972	6.87
bm_lorem_search/R"((?!lorem)bibe)"/2	455097	48828.1	9.32
bm_lorem_search/R"((?!lorem)bibe)"/3	906808	96256.9	9.42
bm_lorem_search/R"((?!lorem)bibe)"/4	1843160	194972	9.45

StephanTLavavej · 2025-06-11T23:20:40Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-06-14T07:41:52Z

I resolved a trivial adjacent-add conflict with #5535 in <regex>.

StephanTLavavej · 2025-06-14T09:41:54Z

Must go faster! 🚗 🦖 😻

<regex>: Speed up searches for regexes that start with assertions

8b321d2

muellerj2 requested a review from a team as a code owner June 7, 2025 22:36

github-project-automation bot added this to STL Code Reviews Jun 7, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Jun 7, 2025

StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels Jun 8, 2025

StephanTLavavej self-assigned this Jun 8, 2025

StephanTLavavej approved these changes Jun 10, 2025

View reviewed changes

StephanTLavavej removed their assignment Jun 10, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Jun 10, 2025

StephanTLavavej mentioned this pull request Jun 10, 2025

Maintainer priorities #4700

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Jun 11, 2025

Merge branch 'main' into regex-speedup-assertion-searches

adc8323

StephanTLavavej approved these changes Jun 14, 2025

View reviewed changes

StephanTLavavej merged commit a4c5e3f into microsoft:main Jun 14, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Jun 14, 2025

muellerj2 mentioned this pull request Jun 14, 2025

<regex>: Limit recursion in the parser #5588

Open

muellerj2 deleted the regex-speedup-assertion-searches branch June 16, 2025 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Speed up searches for regexes that start with assertions #5576

`<regex>`: Speed up searches for regexes that start with assertions #5576

Uh oh!

muellerj2 commented Jun 7, 2025

Uh oh!

StephanTLavavej commented Jun 11, 2025

Uh oh!

StephanTLavavej commented Jun 14, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Jun 14, 2025

Uh oh!

Uh oh!

<regex>: Speed up searches for regexes that start with assertions #5576

<regex>: Speed up searches for regexes that start with assertions #5576

Uh oh!

Conversation

muellerj2 commented Jun 7, 2025

Tests

Benchmark

Uh oh!

StephanTLavavej commented Jun 11, 2025

Uh oh!

StephanTLavavej commented Jun 14, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Jun 14, 2025

Uh oh!

Uh oh!

`<regex>`: Speed up searches for regexes that start with assertions #5576

`<regex>`: Speed up searches for regexes that start with assertions #5576