Skip to content

Latest commit

 

History

History
19 lines (14 loc) · 4.13 KB

README.md

File metadata and controls

19 lines (14 loc) · 4.13 KB

regexer

Regular expression engine written in C#. You will need Visual Studio 2012 to open the soloution and Doxygen to build the documentation. The solution contains a test project with simple tests as well.

Features

  • Capturing Groups: You can enclose specific parts of the pattern in round brackets, creating a capture group; when you match a string you will have a collection of the content of the capture groups along with other information such as the position of the first and last character. Groups can be optionally named with the following syntax: (?<name>pattern), this allows groups to be retrieved by name as well as by index. Group indexing starts from 1, with 0 being the whole match. For more info see RegexMatch, RegexGroup and buildMatch().
  • Match Replacement: You can search for specific patterns in a string and replace them with other content; be it the value returned by a function or a specific string. In both cases, you will have access to the matched part of the string either as a parameter of the function or through a special syntax in the replacement string: {n}, where n is the 1-based index of the capture group (0 is the whole match). For more info see the two replacement overloads, with a function and with a string.
  • Lookarounds: also called zero-width assertions, allow you to check if a certain pattern is found or not found before or after the current cursor location but without moving the cursor itself (which is another way of saying that the matched part is discarded). Thus there are four possibilities: positive lookahead (?=pattern), negative lookahead (?!pattern), positive lookbehind (?<=pattern) and negative lookbehind (?<!pattern). See here for a password validation regex using lookarounds; the actual matching logic is implemented in LookaheadToken and LookbehindToken.
  • Convenience Wrappers allowing a more flexible and comfortable usage; you will find static methods in the Regex class following .NET's regular expression interface and extension methods on strings for added clarity and ease of use.
  • Standard Syntax: Pattern are specified using the standard syntax for regexes; the or operator |, quantifiers ?, *, +, and {m,n} with the usual shortcuts, start of string ^ and end of string $, named groups (?<name>pattern) and character classes \w, \W, \d, \D, \s, \S, [pattern] and [^pattern].

Interesting points

The most interesting parts are the compilation stage, where the pattern is first parsed into tokens then converted to a syntax tree and the matching logic itself, where the tree is traversed and "polimorphic recursion" (let's call it this way) shows all its power.

One of the most challenging aspects of regular expression matching is backtracking. Backtracking is necessary because greedy quantifiers are really too gredy and eat up too many characters of the input string, making the following parts of the pattern to fail matching; moreover, lazy quantifiers turn out to be really too lazy and can make the match fail as well. For example, consider the string aaaaaa and the pattern (a+)a. The matching starts with the first capturing group, which matches the whole input; following, a fails to match because we are at the end of the string, so there is no character a to match. But this is obviously wrong, as the input actually matches the pattern! The solution is to ask the first group, a+, to kindly back off a little so that a can match the last character. Things get interesting for more complicated patterns, such as a+a+a+a+; try to think a situation where a lazy quantifier is being too lazy and screwing things up.