Description
Split from #5452. Other than the performance fix in #5457, I listed a few optimizations that could be investigated for _Skip
in this comment:
- The evaluation of
_N_if
should really do a breadth-first rather than a depth-first search. I think the most effective way to achieve that in the long term is to construct something equivalent to an_N_class
node (or perhaps two) that matches all possible characters that could appear at the start of a match of a regular expression, and then make_Skip
only evaluate this special node. This special node could either be computed at the end of parsing (to be stored in an extended version of the root node) or at the beginning of matching.- The
_N_str
implementation should be translated into an actual call tofind
orfind_if
(depending on syntax option flags) or perhaps even be lowered to a call tosearch
._Skip
should not just always give up when encountering an_N_endif
(or_N_rep
) node, but this requires some analysis when it is appropriate to keep going.
I also mentioned dynamic programming/memoization as a potential alternative to breadth-first search in a follow-up comment:
We memoize the first possible match for each branch and only evaluate again when we move beyond this position. That means we need to retain some kind of map between nodes and positions. Recognizing if we moved beyond a memoized position is complicated by the fact that the iterators on the searched string might not be random access iterators.
But I no longer think this alternative is viable because we can't retain this cache between regex_search
calls, so memoization is not sufficient to avoid quadratic running time in all cases when the skip optimization tests all possible branches.
In addition, this approach also penalizes cases where regex_search
is only called once or so and the regex (but not the first branch in the regex) matches close to the start of the searched string. In this case, this approach spends up to a linear amount of time to test the first branch, even though most of the string doesn't have to be read at all.