Skip to content
/ scnr Public

A scanner/lexer with sufficient regex support, minimal compile time and support of multiple scanner modes

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

jsinger67/scnr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rust Docs.rs Crates.io

About scnr

This crate provides a scanner/lexer with sufficient regex support and minimal compile time. The scanners support multiple scanner modes out of the box. Scanner modes are known from Lex/Flex as Start conditions.

It is still in an early phase and not ready for production yet. Early adopters can quite safely use it. In case you find a bug, please report it.

How to use it

use scnr::ScannerBuilder;

static PATTERNS: &[&str] = &[
    r";",                          // Semicolon
    r"0|[1-9][0-9]*",              // Number
    r"//.*(\r\n|\r|\n)",           // Line comment
    r"/\*([.\r\n--*]|\*[^/])*\*/", // Block comment
    r"[a-zA-Z_]\w*",               // Identifier
    r"=",                          // Assignment
];

const INPUT: &str = r#"
// This is a comment
a = 10;
b = 20;
/* This is a block comment
   that spans multiple lines */
c = a;
"#;

fn main() {
    let scanner = ScannerBuilder::new()
        .add_patterns(PATTERNS)
        .build()
        .expect("ScannerBuilder error");
    let find_iter = scanner.find_iter(INPUT);
    for ma in find_iter {
        println!("Match: {:?}: '{}'", ma, &INPUT[ma.span().range()]);
    }
}

Guard rails

  • The scanners should be built quickly.
  • The scanners base solely on DFAs, no backtracking is implemented
  • The scanners will probably never support u8, i.e. patterns are of type convertible to &str and the input is of type convertible to &str. We concentrate on programming languages rather than byte sequences.

Not supported regex features

We don't support anchored matches, i.e. ^, $, \b, \B, \A, \z and so on, are not available. Mostly, this can be tolerated because of the overall properties of the scanner, and especially the fact that the longest match will win mitigates the need for such anchors.

To elaborate this a bit more:

Lets say you have a pattern for the keyword if and a pattern for an identifier /[a-zA-Z_][a-zA-Z0-9_]*/. Both could match the if but the keyword will win iff you have its pattern inserted before the pattern of the identifier. If the scanner encounters an input like, e.g. ifi the identifier will match because of the longest match rule. With these guaranties it is simply unnecessary to declare the keyword 'if' with attached word boundaries (\b).

Also we currently don't support flags (i, m, s, R, U, u, x), like in r"(?i)a+(?-i)b+". I need to evaluate if this is a problem, but at the moment I belief that this is tolerable.

There is no need for capture groups in the context of token matching, so I see no necessity to implement this feature.

Not supported Flex features

As follows from the above regex restrictions anchors ^ and $ are currently not supported.

Lookahead

As of version 0.3.4 scnr supports trailing contexts, like in Flex, e.g. ab/cd.

Additionally to Flex scnr supports not only positive lookahead but also negative lookahead.

The configuration of these look ahead constraints can be done via the Pattern struct which now contains an optional member lookahead. The inner type of the Option is Lookahead that contains a patter string and a flag the determines whether the lookahead pattern should match (positive lookahead) or not match (negative lookahead).

With the help of a positive lookahead you can define a semantic like

match pattern R only if it is followed by pattern S

On the other hand with a negative lookahead you can define a semantic like

match pattern R only if it is NOT followed by pattern S

The lookahead pattern denoted above as S are not considered as part of the matched string.

Greediness of repetitions

Some words about greediness.

The normal Lex/Flex POSIX matching is greedy. It some sort adheres to the longest match rule but poses some overhead during backtracking on the scanner's runtime.

Since scnr works with minimized DFAs only (current situation, may change) it always matches repetitions like * and + non-greedily.

Exit conditions on repetitions

But you have to be very specific about the content of the repeated expression in that sense that the transition from a repeated expression to the following part of the regular expression should be unambiguous.

Lets have a look at this regex with a repeated expression of . in the middle.

/\*.*\*/

The DFA for this looks like this:

CppComments1

The point is the state 3 where it depends on the input whether to continue the repetition or to proceed with the following part, here state 1. But the . matches * too which introduces an ambiguity that contradicts the common notion of deterministic finite automata. How this is resolved depends on the implementation of the scanner runtime. This should clearly be avoided.

So, the first thing we can do is to be more precise about the content of the repeated expression. We can remove the * from the .:

/\*[.--*]*\*/

CppComments2

This looks more deterministic, but now we reveal another problem, which was actually inherent already in the first variant.

Scanning the following input will mess up the match:

/* a* */

The scanner enters state 1 when reading the * after the a and then fails on matching the space while instead expecting a /. The reason is that the repeated expression doesn't care about the part that follows it.

So, we need to become more specific about this aspect, too:

/\*([.--*]|\*[^/])*\*/

This says that the repeated expression is any character except *, or a * followed by a character other than /.

CppComments3

This solution will do the job perfectly, because its automaton is able the return to the repetition if the exit condition fails.

Scanner modes

A more flexible but also a little more complex approach to the above mentioned obstacles like ambiguity on exit conditions and handling of following expressions in the repeated expressions is to introduce a second scanner mode that is entered on the comment start /\\*, then handles all tokens inside a comment and enters INITIAL mode on the comment end \\*/ again.

The scanner modes can be defined for instance in json:

[
  {
    "name": "INITIAL",
    "patterns": [["/\\*", 1]],
    "transitions": [[1, 1]]
  },
  {
    "name": "COMMENT",
    "patterns": [
      ["\\*/", 2],
      ["[.\\r\\n]", 3]
    ],
    "transitions": [[2, 0]]
  }
]

Note, that this kind of JSON data can be deserialized to Vec<ScannerMode> thanks to serde and serde_json.

Here you see two modes. The scanner always starts in mode 0, usually INITIAL. When encountering a token type 1, comment start, it switches to mode 1, COMMENT. Here the comment end token type 2 has higher precedence than the [.\\r\\n] token 3, simply by having a lower index in the patterns slice. On token 2 it switches to mode INITIAL again. All other tokens are covered by token type 3, comment content.

In this scenario you can imagine that the parser knows that token type 3 is comment content and can handle it accordingly.

About

A scanner/lexer with sufficient regex support, minimal compile time and support of multiple scanner modes

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Languages