Skip to content
blackwhale edited this page Aug 22, 2011 · 21 revisions

Welcome to FReD's wiki. FReD is an acronym for "Fast Regular Expressions for D", accepted GSOC 2011 project originally named Enhance regular expressions.

Intro

Having regular expression (regex) engine in standard library is totally expected for any modern language. The solid and performant implementation of regex is one of the greatest points of pride (if not of sale).

While concrete results and benchmarks are frequently biased, there are some general properties. Regexes usually came in two flavors: Finite Automation (FA, be it deterministic or non-deterministic) and backtracking, each of these with their respective set of traits. The popular implementation strategy is Virtual Machine approach, which is quite extendable for further optimization.

FA are more stable O(N) in time on input, however they are usually more limited as implementing features (such as back references) becomes problematic, also not to mention the time for constructing DFA from pattern. Backtracking allows easily implementing some interesting features like back referencing, lookahead/lookbehind and such, but has a dreadful exponential time on input in worst case.

Now speaking of D, what the standard library currently has is essentially an optimized recursive backtracking based on VM. There are plenty not implemented features that it may very well have had. Also there are a lot of issues that need proper resolution.

Basically, the goal was seen as making new fast and robust regular expression library that leverages D's great capabilities in generic programming. Also hopefully using modern D2 idioms.

Originally proposed things

Enhance and bugfix the existing engine in std.regex

I spent sometime fixing and adding advanced stuff to it (lookahead & non-grouping parens). Later on I uncovered a couple of fundamental flaws in it that make this undertaking sort of unfeasible.

Still, that was great tour on hard to detect regex implementation bugs and common delusions.

Make alternative FA based from scratch

This turns out to be the focal point of the project. Currently the best concieved candidate for this is VM-based Thompson-style NFA engine. Doing a backtracking replacement engine is in question.

Static regexes

This involves using CTFE capabilities of D to do a lot of various sensible things if pattern is known at compile time. The obvious part consists in things like compiling regex and creating all relevant data structures. More interestingly it could do an expensive analysis and optimizations, not to mention using specialy tailored engine.

Actual design decisions

  • Unicode support

The relevant standard is UTS 18. FReD must provide Level 1 support. Level 2 is going to be done partialy with emphasis on the most useful things and having the ability to explicitly turn this extra stuff on/off for performance reasons. UTF-8, UTF-16 & UTF32 should all be supported transparently (like the actual std implementation tries to do). Normalization issues are better be factored out into separate UTF normalizer.

Good human-readable overview is here: http://www.regular-expressions.info/unicode.html

  • Syntax

ECMA 262 compatible syntax (think JavaScript) with a couple of popular extensions like named match groups (as oposed to just numbered), probably free spacing mode a-la perl and comments.

  • StreamInterface, a way to support streaming regexs

  • IR, an intermediate representation of regex

  • two engines: traditional backtracking and Thompson NFA, with the latter chosen as the default

  • static regex, that generates a streamlined native code for backtracker engine via mixins