Skip to content

<regex>: Speed up searches for regexes that start with assertions #5576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

muellerj2
Copy link
Contributor

Towards #5468. This extends the skip heuristic to the remaining unhandled assertions: word boundaries and lookahead assertions.

  • The handling of word boundaries turns out to be much simpler in _Matcher2::_Skip() than in _Matcher2::_Is_wbound(), because we don't have to do any special handling for the start or end of the input string: _Skip() is never called for the start (or as the comment at the top says, --_First_arg is valid) and it cannot skip beyond the end of the input string (_Last) anyway. So we only have to handle the middle case and have to keep looking for the first position where the word character property changes (\b) or where it does not change (\B).
  • For negative lookahead assertions, the easiest thing to do is to just ignore the assertion body (i.e., to assume that the assertion succeeds) and keep looking for a suitable node that can be used for skipping in the remainder of the main regex.
    • Keep in mind that _Skip() is just a heuristic, so it's fine if we don't exclude some non-matches. There is only one thing we must not do here: Skip something that could match.
  • For positive lookahead assertions, we can look for the first place where the start of the assertion and the regex after the assertion both match.
    • Because this is implemented via recursion, I enforce a maximum recursion depth of 50 to prevent stack overflow. This limit is only reached by regexes that start with more than fifty positive lookahead assertions, so this should be more than enough for any practical regex. And the worst thing that happens if it is reached is that the optimization is (maybe only partially) disengaged.
    • While the code might look similar to the one that caused the quadratic slowdown in <regex>: Nonlinear slowdown with increasing string length #5452, this does not suffer from the same problem because we only move forward in the checks: If evaluating the start of the assertion body or the regex after the assertion body tells us that we should skip until position x, then we search next for the first possible position the other part of the regex matches starting from x. No position is evaluated more than once for the assertion and once for the remainder of the regex after the assertion, which implies a running time linear in the input.
  • The skip heuristic now also continues evaluation beyond _N_endif nodes. (Nothing relevant happens at such nodes. They don't imply anything about the input string. Their only feature is that all the branches of an _N_if node join here again, but the problematic case is when branching happens, not when branches end.)

Tests

I added some tests for lookahead assertions. The tests use regex_replace() because it performs several searches into a single test call.

There already was test coverage for skipping of word boundaries in VSO_0000000_regex_use's test_DDB_153116_replacements() function. But the test coverage had a gap: It didn't check for correct behavior if two (or more) non-word characters are next to each other. I closed this gap by adding some spaces to the existing tests.

Benchmark

benchmark before after speedup
bm_lorem_search/"bibe"/2 29157.3 29157.3 1.00
bm_lorem_search/"bibe"/3 59375 59988.8 0.99
bm_lorem_search/"bibe"/4 117188 117188 1.00
bm_lorem_search/"(bibe)"/2 47711.9 47433 1.01
bm_lorem_search/"(bibe)"/3 97656.2 98349.4 0.99
bm_lorem_search/"(bibe)"/4 200911 196725 1.02
bm_lorem_search/"(bibe)+"/2 62779 64174.1 0.98
bm_lorem_search/"(bibe)+"/3 125558 125558 1.00
bm_lorem_search/"(bibe)+"/4 251116 251116 1.00
bm_lorem_search/"(?:bibe)+"/2 51562.5 51562.5 1.00
bm_lorem_search/"(?:bibe)+"/3 102534 102539 1.00
bm_lorem_search/"(?:bibe)+"/4 204041 204041 1.00
bm_lorem_search/R"(\bbibe)"/2 245857 85449.2 2.88
bm_lorem_search/R"(\bbibe)"/3 470948 172631 2.73
bm_lorem_search/R"(\bbibe)"/4 941265 353021 2.67
bm_lorem_search/R"(\Bibe)"/2 256319 199507 1.28
bm_lorem_search/R"(\Bibe)"/3 531250 409807 1.30
bm_lorem_search/R"(\Bibe)"/4 1045850 837054 1.25
bm_lorem_search/R"((?=….)bibe)"/2 452080 53125 8.51
bm_lorem_search/R"((?=….)bibe)"/3 889369 102534 8.67
bm_lorem_search/R"((?=….)bibe)"/4 1759380 244849 7.19
bm_lorem_search/R"((?=bibe)….)"/2 306920 59988.8 5.12
bm_lorem_search/R"((?=bibe)….)"/3 599888 98349.4 6.10
bm_lorem_search/R"((?=bibe)….)"/4 1339290 194972 6.87
bm_lorem_search/R"((?!lorem)bibe)"/2 455097 48828.1 9.32
bm_lorem_search/R"((?!lorem)bibe)"/3 906808 96256.9 9.42
bm_lorem_search/R"((?!lorem)bibe)"/4 1843160 194972 9.45

@muellerj2 muellerj2 requested a review from a team as a code owner June 7, 2025 22:36
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Jun 7, 2025
@StephanTLavavej StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels Jun 8, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jun 8, 2025
@StephanTLavavej StephanTLavavej removed their assignment Jun 10, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Jun 10, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Jun 11, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

I resolved a trivial adjacent-add conflict with #5535 in <regex>.

@StephanTLavavej StephanTLavavej merged commit a4c5e3f into microsoft:main Jun 14, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Jun 14, 2025
@StephanTLavavej
Copy link
Member

Must go faster! 🚗 🦖 😻

@muellerj2 muellerj2 deleted the regex-speedup-assertion-searches branch June 16, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regex meow is a substring of homeowner
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants