index.html

<html>
<head>
<link rel="alternate" title="Ocean of Awareness RSS" type="application/rss+xml" title="RSS" href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/index.rss" />
<title>Ocean of Awareness</title>
<style type="text/css">
   strong {font-weight: 700;}
</style>
</head>
<body>
<div
  style="color:white;background-color:#38B0C0;padding:1em;clear:left;text-align:center;">
<h1>Ocean of Awareness</h1>
</div>
  <div style="margin:0;padding:10px 30px 10px 10px;width:150px;float:left;border-right:2px solid #38B0C0">
  <p>
  <strong>Jeffrey Kegler's blog</strong>
  about Marpa, his new parsing algorithm,
    and other topics of interest</p>
  <p><a href="http://www.jeffreykegler.com/">Jeffrey's personal website</a></p>
      <p>
	<a href="https://twitter.com/jeffreykegler" class="twitter-follow-button" data-show-count="false">Follow @jeffreykegler</a>
      </p>
      <p style="text-align:center">
	<!-- Place this code where you want the badge to render. -->
	<a href="//plus.google.com/101567692867247957860?prsrc=3" rel="publisher" style="text-decoration:none;">
	<img src="//ssl.gstatic.com/images/icons/gplus-32.png" alt="Google+" style="border:0;width:32px;height:32px;"/></a>
      </p>
  <h3>Marpa resources</h3>
  <p><a href="http://jeffreykegler.github.com/Marpa-web-site/">The Marpa website</a></p>
  <p>The Ocean of Awareness blog: <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog">home page</a>,
  <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/metapages/chronological.html">chronological index</a>,
  and
  <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/metapages/annotated.html">annotated index</a>.
  </p>
  </div>
  <div style="margin-left:190px;border-left:2px solid #38B0C0;padding:25px;">
<h3>Sun, 20 Dec 2015</h3>
<br />
<center><a name="topdown"> <h2>Top-down parsing is guessing</h2> </a>
</center>
<html>
  <head>
  </head>
  <body><p>
      <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    </p>
    <p>
      Top-down parsing is guessing.  Literally.
      Bottom-up parsing is looking.
    </p>
    <p>The way you'll often hear that phrased is that top-down parsing is
      looking, starting at the top,
      and bottom-up parsing is looking, starting at the bottom.
      But that is misleading, because the input is at the bottom --
      at the top there is nothing to look at.
      A usable top-down parser
      <b>must</b>
      have a bottom-up component,
      even if that component is just lookahead.
    </p>
    <p>A more generous, but still accurate, way to describe the top-down
      component of parsers is "prediction".
      And prediction is, indeed, a very useful component of a parser,
      when used in combination with other techniques.
    </p>
    <p>Of course, if a parser does nothing but predict,
      it can predict only one input.
      Top-down parsing must always be combined with a bottom-up
      component.
      This bottom-up component may be as modest as lookahead, but it
      <b>must</b>
      be there or else top-down parsing is really not parsing at all.
    </p>
    <h2>So why is top-down parsing used so much?</h2>
    <p>Top-down parsing may be unusable in its pure form,
      but from one point of view that is irrelevant.
      Top-down parsing's biggest advantage is that it is highly flexible --
      there's no reason to stick to its "pure" form.
    </p>
    <p>A top-down parser can be written as a series of subroutine calls --
      a technique called recursive descent.
      Recursive descent
      allows you to hook in custom-written bottom-up logic at every
      top-down choice point,
      and it is a technique which is
      completely understandable to programmers with little or no training
      in parsing theory.
      When dealing with recursive descent parsers,
      it is more useful to be a seasoned, far-thinking programmer
      than it is to be a mathematician.
      This makes recursive descent very appealing to
      seasoned, far-thinking programmers,
      and they are the audience that counts.
    </p>
    <h2>Switching techniques</h2>
    <p>You can even use the flexibility of top-down to switch
      away from top-down parsing.
      For example, you could claim that a top-down parser could do anything my
      own parser
      (<a href="http://jeffreykegler.github.io/Marpa-web-site/">Marpa</a>)
      could do, because a recursive descent parser can call
      a Marpa parser.
    </p>
    <p>
      A less dramatic switchoff, and one that still leaves the parser with a good claim to be
      basically top-down, is very common.
      Arithmetic expressions are essential for a computer language.
      But they are also
      among the many things
      top-down parsing cannot handle, even with ordinary lookahead.
      Even so, most computer languages these days are parsed top-down --
      by recursive descent.
      These recursive descent parsers deal with expressions
      by temporarily handing control over to an bottom-up operator
      precedence parser.
      Neither of these parsers is
      extremely smart about the hand-over and hand-back
      -- it is up to the programmer to make sure the two play together nicely.
      But used with caution, this approach works.
    </p>
    <h2>Top-down parsing and language-oriented programming</h2>
    <p>But what about taking top-down methods into the future of language-oriented programming,
      extensible languages, and grammars which write grammars?
      Here we are forced to confront the reality -- that the effectiveness of
      top-down parsing comes entirely from the foreign elements that are added to it.
      Starting from a basis of top-down parsing is literally starting
      with nothing.
      As I have shown in more detail
      <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2015/12/composable.html">elsewhere</a>,
      top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.
    </p>
    <p>Perl 6 grammars are top-down -- PEG with lots of extensions.
      These extensions
      include backtracking, backtracking control,
      a new style of tie-breaking and lots of opportunity
      for the programmer to intervene and customize everything.
      But behind it all is a top-down parse engine.
    </p>
    <p>One aspect of Perl 6 grammars might be seen as breaking
      out of the top-down trap.
      That trick of switching over to a
      bottom-up operator precedence parser for expressions,
      which I mentioned above,
      is built into Perl 6 and semi-automated.
      (I say semi-automated because making sure the two parsers "play nice"
      with each other is not automated -- that's still up to the programmer.)
    </p>
    <p>As far as I know, this semi-automation of expression handling
      is new with Perl 6 grammars, and it
      may prove handy for duplicating what is done
      in recursive descent parsers.
      But it adds no new technique to those already in use.
      And features
      like
    </p><ul>
      <li>mulitple types of expression, which can be told apart
        based on their context,</li>
      <li><i>n</i>-ary expressions for arbitrary
        <i>n</i>, and</li>
      <li>the autogeneration of multiple rules, each allowing
        a different precedence scheme,
        for expressions of arbitrary arity and associativity,
      </li>
    </ul>
    <p>
      all of which are available and in current use in
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">Marpa</a>,
      are impossible for the technology behind Perl 6 grammars.
    </p>
    <p>I am a fan of the Perl 6 effort.
      Obviously, I have doubts about one specific set of hopes for Perl 6 grammars.
      But these hopes have not been central to the Perl 6 effort,
      and I will be an eager student of the Perl 6 team's work over the coming months.
    </p>
    <h2>Comments</h2>
    <p>
      To learn more about Marpa,
      there's
      <a href="http://savage.net.au/Marpa.html">the
        official web site maintained by Ron Savage</a>.
      I also have
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
      Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
    </p>
  </body>
</html>
<br />
<p>posted at: 15:21 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/12/topdown.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Sun, 13 Dec 2015</h3>
<br />
<center><a name="composable"> <h2>Grammar reuse</h2> </a>
</center>
<html>
  <head>
  </head>
  <body><p>
      <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    </p>
    <p>Every year the Perl 6 community creates an "Advent" series of posts.
      I always follow these, but
      <a href="https://perl6advent.wordpress.com/2015/12/08/day-8-grammars-generating-grammars/">one
        in particular</a>
      caught my attention this year.
      It presents a vision of a future where programming is language-driven.
      A vision that I share.
      The post went on to encourage its readers to follow up on this vision,
      and suggested an approach.
      But I do not think the particular approach suggested would be fruitful.
      In this post I'll explain why.
    </p>
    <h2>Reuse</h2>
    <p>The focus of the Advent post was language-driven programming,
      and that is the aspect that excites me most.
      But the points that I wish to make are more easily understood if
      I root them in a narrower, but more familiar issue
      -- grammar reuse.
    </p>
    <p>
      Most programmers will be very familiar with grammar reuse from regular expressions.
      In the regular expression ("RE") world,
      programming by cutting and pasting
      is very practical and often practiced.
    </p>
    <p>
      For this post I will consider grammar reusability to be the ability
      to join two grammars and create a third.
      This is also sometimes called grammar composition.
      For this purpose, I will widen the term "grammar" to include
      RE's and PEG parser specifications.
      Ideally, when you compose two grammars,
      what you get is
    </p><ul>
      <li>a language you can reasonably predict, and
      </li>
      <li>if each of the two original grammars can be parsed in reasonable time,
        a language that can be parsed in reasonable time.
      </li>
    </ul>
    <p>
      Not all language representations are reusable.
      RE's are, and BNF is.
      PEG looks like a combination of BNF and RE's,
      but PEG, in fact, is its own very special form of parser specification.
      And PEG parser specifications are
      one of the least reusable language representations ever invented.
    </p>
    <h2>Reuse and regular expressions</h2>
    <p>RE's are as well-behaved under
      reuse as a language representation can get.
      The combination of two RE's is always another RE,
      and you can reasonably determine what language the combined RE recognizes by
      examining it.
      Further, every RE is parseable in linear time.
    </p>
    <p>The one downside, often mentioned by critics, is that RE's
      do not scale in terms of readability.
      Here, however, the problem is not really one of reusability.
      The problem is that RE's are quite limited in their capabilities,
      and programmers often exploit the excellent behavior of RE's under reuse
      to push them into applications for which RE's just do not have the power.
    </p>
    <h2>Reuse and PEG</h2>
    <p>When programmers first look at PEG syntax, they often think they've encountered
      paradise. They see both BNF and RE's, and imagine they'll have the
      best of each.
      But the convenient behavior of
      RE's depends on their unambiguity.
      You simply cannot write
      an unambiguous RE -- it's impossible.
    </p>
    <p>
      More powerful and more flexible, BNF allows you to describe many more grammars --
      including ambiguous ones.
      How does PEG resolve this?  With a Gordian knot approach.
      Whenever it encounters an ambiguity, it throws all but one of the choices away.
      The author of the PEG specification gets some control over what is thrown away --
      he specifies an order of preference for the choices.
      But degree of control is less than it seems,
      and in practice PEG is
      the nitroglycerin of parsing -- marvelous when it works,
      but tricky and dangerous.
    </p>
    <p>
      Consider these 3 PEG specifications:
    </p><blockquote><pre>
	("a"|"aa")"a"
	("aa"|"a")"a"
	A = "a"A"a"/"aa"</pre></blockquote>
    <p>
      All three clearly accept only strings which are repetitions
      of the letter "<tt>a</tt>".
      But which strings?
      For the answers,
      suggestions for dealing with PEG if you are committed to it,
      and more,
      look at
      <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2015/03/peg.html">
        my previous post on PEG</a>.
    </p>
    <p>
      When getting an RE or a BNF grammar to work,
      you can go back to the grammar and ask
      yourself "Does my grammar look like my intended language?".
      With PEG, this is not really possible.
      With practice, you might get used to figuring out single line PEG specs
      like the first two above.
      (If you can get the last one, you're amazing.)
      But tracing these through multiple rule layers required by useful grammars is,
      in practice, not really possible.
    </p>
    <p>
      In real life,
      PEG specifications are written by hacking them until the test suite works.
      And, once you get a PEG specification to pass the test suite for a practical-sized grammar,
      you are very happy to leave it alone.
      Trying to compose two PEG specifications is rolling the dice with the odds against you.
    </p>
    <h2>Reuse and the native Perl 6 parser</h2>
    <p>
      The native Perl 6 parser is an extended PEG parser.
      The extensions are very interesting from the PEG point of view.
      The PEG "tie breaking" has been changed,
      and backtracking can be used.
      These features mean the Perl 6 parser can be extended to languages
      well beyond what
      ordinary PEG parsers can handle.
      But, if you use the extra features, reuse will be even trickier than
      if you stuck with vanilla PEG.
    </p>
    <h2>Reuse and general BNF parsing</h2>
    <p>
      As mentioned, general BNF is reusable, and so general BNF parsers like
      <a href="http://savage.net.au/Marpa.html">Marpa</a>
      are as reusable as regular expressions, with two caveats.
      First, if the two grammars are not doing their own lexing, their lexers will have
      to be compatible.
    </p>
    <p>Second,
      with regular expressions you had the advantage that
      <b>every</b>
      regular expression parses in linear time, so that speed was guaranteed to be acceptable.
      Marpa users reuse grammars and pieces of grammars all the time.
      The result is always the language specified by the merged BNF,
      and I've never heard anyone complain that performance deterioriated.
    </p><p>
      But, while it may not happen often,
      it is possible to combine two Marpa grammars that run in linear time
      and end up with one that does not.
      You can guarantee your merged Marpa grammar will stay linear if you follow 2 rules:
    </p><ul>
      <li>keep the grammar unambiguous;</li>
      <li>don't use an unmarked middle recursion.</li>
    </ul>
    <p>
      Unmarked middle recursions are not things you're likely to need a lot: they
      are those palindromes where you have to count to find the middle:
      grammars like "<tt>A ::= a | a A a</tt>".
      If you use a middle recursion at all, it is almost certainly going to
      be marked, like "<tt>A ::= b | a A a</tt>",
      which generates strings like "<tt>aabaa</tt>".
      With Marpa, as with RE's, reuse is easy and practical.
      And, as I hope to show in a future post, unlike RE's,
      Marpa opens the road to language-driven programming.
    </p>
    <h2>Perl 6</h2>
    <p>I'm a fan of the Perl 6 effort.
      I certainly
      <b>should</b>
      be a supporter, after the many favors they've done for me
      and the Marpa community over the years.
      The considerations of this post
      will disappoint some of the
      hopes for applications of the native Perl 6 parser.
      But these applications have not been central to the Perl 6 effort,
      of which I will be an eager student over the coming months.
    </p>
    <h2>Comments</h2>
    <p>
      To learn more about Marpa,
      there's
      <a href="http://savage.net.au/Marpa.html">the
        official web site maintained by Ron Savage</a>.
      I also have
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
      Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
    </p>
  </body>
</html>
<br />
<p>posted at: 10:13 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/12/composable.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Sat, 29 Aug 2015</h3>
<br />
<center><a name="fast_handy"> <h2>Fast handy languages</h2> </a>
</center>
<html>
  <head>
  </head>
  <body><p>
      <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    </p>
    <p>Back around 1980, I had access to UNIX and a language I wanted to parse.
      I knew that UNIX had all the latest CS tools.
      So I expected to type in my BNF and "Presto, Language!".
    </p>
    <p>Not so easy, I was told.
      Languages were difficult things created with complex tools
      written by experts who understood the issues.
      I recall thinking that,
      while English had a syntax that is
      as hard as they come,
      toddlers manage to parse it
      just fine.
      But experts are experts,
      and more so at second-hand.
    </p>
    <p>I was steered to an LALR-based parser called yacc.
      (Readers may be more familiar with bison, a yacc successor.)
      LALR had extended the class of quickly parseable grammars a bit
      beyond recursive descent.
      But recursive descent was simple in principle,
      and its limits were easy to discover and work around.
      LALR, on the hand, was OK when it worked, but
      figuring out why it failed when it failed
      was more like decryption than debugging,
      and this was the case both with parser development
      and run-time errors.
      I soon gave up on yacc
      and found another way to solve my problem.
    </p>
    <p>Few people complained about yacc on the Internet.
      If you noise it about that you are unable
      to figure out how to use
      what everybody says is the state-of-the-art tool,
      the conclusions drawn may not be the ones you want.
      But my experience seems to have been more than common.
    </p>
    <p>LALR's claim to fame was that it was the basis of the
      industry-standard C compiler.
      Over three decades,
      its maintainers suffered amid the general silence.
      But by 2006, they'd had enough.
      GCC (the new industry standard)
      ripped its LALR engine out.
      By then
      <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2014/09/chron.html">the
        trend back to recursive descent</a>
      was well underway.
    </p>
    <h3>A surprise discovery</h3>
    <p>Back in the 1970's,
      there had been more powerful alternatives
      to LALR and recursive descent.
      But they were
      <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2013/04/fast_enough.html">reputed
        to be slow</a>.
    </p>
    <p>For some applications slow is OK.
      In 2007 I decided that a parsing tool that parsed
      all context-free languages at state-of-the-art speeds,
      slow or fast as the case might be,
      would be a useful addition to programmer toolkits.
      And I ran into a surprise.
    </p>
    <p>Hidden in the literature was an amazing discovery --
      an 1991 article by Joop Leo that
      described how to modify Earley's
      algorithm to be fast for every language class in practical use.
      (When I say "fast" in this article, I will mean "linear".)
      Leo's article had been almost completely ignored --
      my project (<a href="http://savage.net.au/Marpa.html">Marpa</a>)
      would become its first
      practical implementation.
    </p>
    <h3>Second-order languages</h3>
    <p>The implications of Leo's discovery go well beyond speed.
      If you can rely on the BNF that you write always producing
      a practical parser, you can auto-generate your language.
      In fact,
      you can write languages which write languages.
    </p>
    <h3>Which languages are fast?</h3>
    <p>The Leo/Earley algorithm is not fast
      for every BNF-expressible language.
      BNF is powerful, and you can write exponentially
      ambiguous languages in it.
      But programmers these days
      mostly care about unambiguous languages --
      we are accustomed to tools and techniques
      that parse only a subset of these.
    </p>
    <p>
      As I've said, Marpa is fast for every language
      class in practical use today.
      Marpa is almost certainly fast for any language
      that a modern programmer has in mind.
      Unless you peek ahead at the hints I am about to give you,
      in fact, it is actually
      <b>hard</b>
      to write an unambiguous
      grammar that goes non-linear on Marpa.
      Simply mixing up lots of left, right and middle recursions
      will
      <b>not</b>
      be enough to make an
      unambiguous grammar go non-linear.
      You will also need to violate a rule
      in the set that
      I am about to give you.
    </p>
    <p>To guarantee that Marpa is fast for your BNF language,
      follow three rules:
    </p>
    <ul>
      <li>Rule 1: Your BNF must be unambiguous.
      </li>
      <li>Rule 2: Your BNF must have no "unmarked" middle recursions.
      </li>
      <li>Rule 3: All of the right-recursive symbols
        in your BNF must be dedicated
        to right recursion.
      </li>
    </ul>
    <p>Rule 3 turns out to be very easy to obey.
      I discuss it in detail in the next section,
      which will be about how to break these rules and
      get away with it.
    </p>
    <p>Before we do that,
      let's look at what an "unmarked" middle recursion is.
      Here's an example of a "marked" middle recursion:
    </p><blockquote><pre>
       M ::= 'b'
       M ::= 'a' M 'a'
    </pre></blockquote><p>
      Here the "b" symbol is the marker.
      This marked middle recursion generates sequences like
    </p><blockquote><pre>
       b
       a b a
       a a b a a
    </pre></blockquote>
    <p>Now for an "unmarked" middle recursion:
    </p><blockquote><pre>
       M ::= 'a' 'a'
       M ::= 'a' M 'a'
    </pre></blockquote><p>
      This unmarked middle recursion generates sequences like
    </p><blockquote><pre>
       a a
       a a a a
       a a a a a a
    </pre></blockquote><p>
      In this middle recursion there is no marker.
      To know where the middle is,
      you have to scan all the way to the end,
      and then count back.
    </p>
    <p>A rule of thumb is that if you can "eyeball" the middle
      of a long sequence,
      the recursion is marked.
      If you can't, it is unmarked.
      Unfortunately, we can't characterize exactly what a marker
      must look like -- a marker can encode the moves of a Turing machine,
      so marker detection is undecidable.
    </p>
    <h3>How to get away with breaking the rules</h3>
    <p>The rules about ambiguity and recursions are "soft".
      If you only use limited ambiguity and
      keep your rule-breaking recursions short,
      your parser will stay fast.
    </p>
    <p>Above, I promised to explain rule 3, which insisted that
      a right recursive symbol be "dedicated".
      A right recursive symbol is "dedicated" if it appears only
      as the recursive symbol in a right recursion.
      If your grammar is unambiguous, but you've used an "undedicated"
      right-recursive symbol, that is easy to fix.
      Just rewrite the grammar, replacing the "undedicated" symbol
      with two different symbols.
      Dedicate one of the two to the right recursion,
      and use the other symbol everywhere else.
    </p>
    <h3>When NOT to use Marpa</h3>
    <p>The languages I have described as "fast" for Marpa
      include all those in practical use and many more.
      But do you really want to use Marpa for all of them?
      Here are four cases for which Marpa is probably not
      your best alternative.
    </p>
    <p>The first case: a language that parses easily with a regular
      expression.
      The regular expression will be much faster.
      Don't walk away from a good thing.
    </p>
    <p>The second case:
      a language
      that is easily parsed using a single
      loop and some state that fits into constant space.
      This parser might be very easy to write and maintain.
      If you are using a much slower higher level language,
      Marpa's optimized C language
      may be a win on CPU speed.
      But, as before, if you have a good thing,
      don't walk away from it.
    </p><p>The third case:
      a variation on the second.
      Here your single loop might be getting out of hand,
      making you yearn for the syntax-driven convenience
      of Marpa,
      but your state still fits into constant space.
      In its current implementation, Marpa keeps all of its
      parse tables forever, so Marpa does
      <b>not</b>
      parse in constant space.
      Keeping the tables
      allows Marpa to deal with the full structure of its
      input, in a way that a SAX-ish approaches cannot.
      But if space requirements are an issue,
      and your application allows a simplified constant-space
      approach,
      Marpa's power and convenience may not be enough to
      make up for that.
    </p>
    <p>The fourth case:
      a language that
    </p><ul>
      <li>is very small;
      </li>
      <li>changes slowly or not at all, and does not grow in complexity;
      </li>
      <li>merits careful hand-optimization, and has available the staff
        to do it;
      </li>
      <li>merits and has available the kind of on-going support that will
        keep your code optimized under changing circumstances; and
      </li>
      <li>is easily parseable via recursive descent:</li>
    </ul>
    <p>
      It is rare that all of these are the case,
      but when that happens,
      recursive descent is often preferable to Marpa.
      Lua and JSON
      are two languages which meet the above criteria.
      In Lua's case, it targets platforms with very restricted memories,
      which is an additional reason to prefer recursive descent --
      Marpa has a relatively large footprint.
    </p>
    <p>It was not good luck that made
      both Lua and JSON good targets for recursive descent --
      they were designed around its limits.
      JSON is a favorite test target of Marpa for just these reasons.
      There are carefully hand-optimized C language parsers for us to
      benchmark against.
    </p>
    <p>We get closer and closer,
      but Marpa will
      never beat small hand-optimized JSON parsers in software.
      However, while recursive descent is a technique for hand-writing parsers,
      Marpa is a mathematical algorithm.
      Someday,
      instructions for manipulating Earley items could be implemented directly
      in silicon.
      When and if that day comes,
      Earley's algorithm will beat recursive descent even at
      parsing the grammars that were designed for it.
    </p>
    <h3>Comments</h3>
    <p>Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
      To learn more about Marpa,
      there's
      <a href="http://savage.net.au/Marpa.html">the
        official web site maintained by Ron Savage</a>.
      I also have
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
    </p>
  </body>
</html>
<br />
<p>posted at: 16:36 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/08/fast_handy.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Tue, 10 Mar 2015</h3>
<br />
<center><a name="linear"> <h2>Linear?  Yeah right.</h2> </a>
</center>
<html>
  <head>
  </head>
  <body><p>
      <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    </p>
    <h3>Linear?</h3>
    <p>I have claimed that my new parser,
      <a href="http://savage.net.au/Marpa.html">Marpa</a>,
      is linear for vast classes of grammars,
      going well beyond what the traditional parsers can do.
      But skepticism is justified.
      When it comes to parsing algorithms,
      there have been a lot of time complexity claims
      that are hand-wavy, misleading or just
      plain false.
      This post describes how someone,
      who
      is exercising the appropriate degree of skepticism,
      might conclude that believing Marpa's claims is a reasonable
      and prudent thing to do.
    </p>
    <p>
      Marpa's linearity claims seem to be,
      in comparison with the other parsers in practical use
      today,
      bold.
      Marpa claims linearity,
      not just for every class of grammar for which
      yacc/bison, PEG and recursive descent currently claim linearity,
      but for considerably more.
      (The mathematical details of these claims
      are in a
      <a href="#DETAILS">section at the end</a>.)
      It seems too good to be true.
    </p>
    <h3>Why should I believe you?</h3>
    <p>The most important thing to realize,
      in assessing the believability of Marpa's time complexity claims,
      is that they are not new.
      They were already proved in a long-accepted paper in the refereed literature.
      They are the time complexity claims proved by Joop Leo for his algorithm
      in 1991, over two decades ago.
      Marpa is derived from Leo's algorithm, and its time complexity claims
      are those proved for Leo's algorithm.
    </p>
    <p>Above I said that Marpa's time complexity claims "seem" bold.
      On any objective assessment, they are in fact a bit of a yawn.
      The claims
      <b>seem</b>
      surprising only because a lot of people
      are unaware of Leo's results.
      That is, they are surprising in the same sense that someone who
      had avoided hearing about radio waves would be surprised
      to learn that he can
      communicate instantly
      with someone on the other side of the world.
    </p>
    <p>So, if there's so little to prove, why does the
      <a href="https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer">
        Marpa paper</a>
      have proofs?
      In Marpa, I made many implementation decisions about,
      and some changes to, the Leo/Earley algorithm.
      None of my changes produced better time complexity results --
      my only claim is that I did not change the Leo/Earley
      algorithm in a way that slowed it down.
      To convince myself of this claim, I reworked
      the original proofs of
      Leo and Earley,
      changing them to reflect my changes,
      and demonstrated that the results
      that Leo had obtained still held.
    </p>
    <p>Proofs of this kind,
      which introduce no new mathematical techniques,
      but simply take a previous result and march from
      here to there by well-know means,
      are called "tedious".
      In journals, where there's a need to conserve space,
      they are usually omitted,
      especially if,
      as is the case with Marpa's time complexity proofs,
      the results are intuitively quite plausible.
    </p>
    <h3>Getting from plausible to near-certain</h3>
    <p>So let's say you are not going to work through every line
      of Marpa's admittedly tedious proofs.
      We've seen that the results are intuitively plausible,
      as long as you don't reject the previous literature.
      But can we do better than merely "plausible"?
    </p><p>As an aside, many people misunderstand the phrase
      "mathematically proven", especially as it applies to branches
      of math like parsing theory.
      The fact is that proofs in papers often contain errors.
      Usually these are minor,
      and don't affect the result.
      On the other hand, Jay Earley's paper,
      while one of the best Computer Science papers ever published,
      also contained a very
      serious error.
      And this error slipped past his Ph.D. committee and
      his referees.
      Mathematical arguments and proofs do not allow us to achieve
      absolute certainty.
      They can only improve the degree of certainty.
    </p>
    <p>There's a second way to dramatically increase
      your degree of conviction
      in Marpa's linearity claims, and it is quite simple.
      Create examples of problematic grammars,
      run them and time them.
      This is not as satisfying as a mathematical proof,
      because no set of test grammars can be exhaustive.
      But if you can't find a counter-example
      to Marpa's linearity claims among the grammars of
      most interest to you,
      that should help lift
      your level of certainty to "certain for
      all practical purposes".
    </p>
    <p>Much of this increase in certainty can be
      achieved without bothering to run your own tests.
      Marpa is in wide use at this point.
      If Marpa was going quadratic on grammars
      for which it claimed to be linear,
      and these were grammars of practical interest,
      that would very likely have been noticed by now.
    </p>
    <h3>I'm still not convinced</h3>
    <p>Let's suppose all this has not brought you to
      the level of certainty you need to use Marpa.
      That means the reasonable thing is to continue to
      struggle to work with the restrictions of the
      traditional algorithms, right?
      No, absolutely not.
    </p>
    <p>OK, so you don't believe that Marpa preserves
      the advances in power and speed made by Leo.
      Does that mean that parsers have to stay underpowered?
      No, it simply means that there should be a
      more direct implementation of Leo's 1991,
      bypassing Marpa.
    </p><p>But if you are looking for an implementation of Leo's
      1991 algorithm,
      I think you may end up coming back to Marpa as the most
      reasonable choice.
      Marpa's additional features
      include the ability to use custom,
      procedural logic,
      as you can with recursive descent.
      And Marpa has worked out a lot of the
      implementation details for you.
    </p><h3>Comments</h3>
    <p>Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
      To learn more about Marpa,
      there's
      <a href="http://savage.net.au/Marpa.html">the
        official web site maintained by Ron Savage</a>.
      I also have
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
    </p>
    <h3>Appendix: Some technical details</h3>
    <p>
      <a name="DETAILS">Above</a>
      I talked about algorithms, classes of grammars and their
      linearity claims.
      I didn't give details because most folks aren't interested.
      For those who are, they are in this section.
    </p>
    <p>
      yacc is linear for a grammar class called LALR,
      which is a subset of another grammar class
      called LR(1).
      If you are willing to hassle with GLR,
      bison claims linearity for all of LR(1).
      Recursive descent is a technique, not an algorithm,
      but it is top-down with look-ahead,
      and therefore can be seen as some form of LL(k),
      where k depends on how it is implemented.
      In practice, I suspect k is never much bigger than 3,
      and usually pretty close to 1.
      With packratting,
      PEG can be made linear for everything it
      parses but there is a catch --
      <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2015/03/peg.html">
        only in limited cases do you know
        what language your PEG grammar actually parses</a>.
      In current practice, that means your PEG grammar
      must be LL(1).
      Some of the PEG literature looks at techniques for
      extending this as far as LL-regular, but there are no
      implementations, and it remains to be seen if the
      algorithms described are practical.
    </p>
    <p>
      The
      <a href="https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer">
        Marpa paper</a>
      contains a proof,
      based on a proof of the same claim by
      Joop Leo,
      that Marpa is linear for LR-regular grammars.
      The LR-regular grammars
      include the LR(k) grammars for every k.
      So Marpa is linear for LR(1), LR(2), LR(8675309), etc.
      LR-regular also includes LL-regular.
      So every class of grammar under discussion
      in the PEG literature is
      already parsed in linear time by Marpa.
      From this,
      it is also safe to conclude that,
      if a grammar can be parsed by
      anything reasonably described as recursive descent,
      it can be parsed in linear time by Marpa.
    </p>
  </body>
</html>
<br />
<p>posted at: 22:01 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/03/linear.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Mon, 02 Mar 2015</h3>
<br />
<center><a name="peg"> <h2>PEG: Ambiguity, precision and confusion</h2> </a>
</center>
<html>
  <head>
  </head>
  <body><p>
      <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    </p>
    <h3>Precise?</h3>
    <p><a href="http://bford.info/packrat/">PEG parsing</a>
      is a new notation
      for a notorously tricky algorithm that goes back
      to the earliest computers.
      In its PEG form,
      this algorithm acquired an seductive new interface,
      one that looks like the best of
      extended BNF combined with the best of regular expressions.
      Looking at a sample of it, you are tempted to imagine
      that writing a parser has suddenly become a very straightforward
      matter. Not so.
    </p>
    <p>For those not yet in the know on this,
      I'll illustrate with a pair of examples from
      <a href="http://www.romanredz.se/papers/FI2008.pdf">
        an excellent 2008 paper by Redziejowski</a>.
      Let's start with these two PEG specifications.
    </p><blockquote><pre>
    ("a"|"aa")"a"
    ("aa"|"a")"a"
    </pre></blockquote><p>
      One of these two PEG grammars accepts
      the string "<tt>aaa</tt>" but not the string "<tt>aa</tt>".
      The other does the opposite -- it accepts the string
      the string "<tt>aa</tt>" but not the string "<tt>aaa</tt>".
      Can you tell which one?
      (For the answer,
      see page 4 of
      <a href="http://www.romanredz.se/papers/FI2008.pdf">
        Redziejowski 2008</a>.)
    </p><p>
      Here is another example:
    </p><blockquote><pre>
    A = "a"A"a"/"aa"
    </pre></blockquote><p>
      What language does this describe?
      All the strings in the
      language are obviously
      the letter "<tt>a</tt>",
      repeated some number of times.
      But which string lengths are in the language,
      and which are not?
      Again the answer is on
      page 4 of
      <a href="http://www.romanredz.se/papers/FI2008.pdf">
        Redziejowski 2008</a>
      -- it's exactly those strings
      whose length is a power of 2.
    </p>
    <p>With PEG, what you see in the extended BNF
      is not what you get.
      PEG parsing has been called "precise",
      apparently based on
      the idea that PEG parsing is in a certain sense unambiguous.
      In this case "precise" is taken as synonymous with
      "unique".
      That is, PEG parsing is precise in exactly the same
      sense that Jimmy Hoffa's body
      is at a precise location.
      There is (presumably) exactly one such place,
      but we are hard put to be any more specific about the matter.
    </p>
    <h3>Syntax-driven?</h3>
    <p>The advantage of using
      a syntax-driven parser generator is that
      the syntax you specify describes
      the language that will be parsed.
      For most practical grammars, PEG is not syntax-driven
      in this sense.
      Several important
      PEG researchers understand this issue,
      and have tried to deal with it.
      I will talk about their work below.
      This is much more at stake than bragging rights
      over which algorithm is really syntax-driven and
      which is not.
    </p>
    <p>When you do not know the language your parser is
      parsing, you of course have the problem that
      your parser
      might not parse all the strings in your language.
      That can be dealt with by
      fixing the parser to accept the correct input,
      as you encounter problems.
    </p>
    <p>A second, more serious, problem is often forgotten.
      Your PEG parser might accept strings that are
      <b>not</b>
      in your language.
      At worst, this creates a security loophole.
      At best, it leaves with a choice:
      break compatiblity,
      or leave the problem unfixed.
    </p>
    <p>It's important to be able to convince
      yourself
      that your code is correct by examining it and thinking
      about it.
      Beginning programmers often simply hack things,
      and call code complete once it passes the test suite.
      Test suites don't catch everything,
      but there is a worse problem with the beginner's approach.
    </p>
    <p>
      Since the beginner has no clear idea of why his code
      works, even when it does,
      it is unlikely to be well-organized or readable.
      Programming techniques like PEG,
      where the code can be made to work,
      but where it is much harder,
      and in practice usually not possible,
      to be sure why the code works,
      become maintenance nightmares.
    </p>
    <p>
      The maintenance implications are especially worrisome if
      the PEG parser is for a language with a life cycle that may
      involve bug fixes or other changes.
      The impact of even
      small changes to a PEG specification
      is hard to predict and
      hard to discover after the fact.
    </p>
    <h3>Is PEG unambiguous?</h3>
    <p>PEG is not unambiguous in any
      helpful sense of that word.
      BNF allows you to specify ambiguous grammars,
      and that feature is tied to its power and flexibility
      and often useful in itself.
      PEG will only deliver one of those parses.
      But without
      an easy way of knowing which parse,
      the underlying ambiguity is not addressed --
      it is just ignored.
    </p>
    <p>My Marpa parser
      is a general BNF parser based on Earley's.
      It also can simply throw all but one of the parses in an ambiguous
      parse away.
      But I would not feel justified in saying to a user who has an
      issue with ambiguity,
      that Marpa has solved her problem
      by throwing all but one arbitrarily chosen result.
    </p>
    <p>
      Sticking with Marpa for a moment,
      we can see one
      example of a more helpful approach
      to ambiguity.
      Marpa allows a user to rank rules,
      so that all but the highest ranking rules are not used
      in a parse.
      Marpa's rule rankings are specified in its BNF,
      and they work together with the BNF
      in an intuitive way.
      In every case,
      Marpa delivers precisely the parses its BNF and its rule
      rankings specify.
      And it
      is "precision" in this sense that a parser writer is looking for.
    </p>
    <h3>Is there a sensible way to use PEG?</h3>
    <p>
      I'll return to Marpa at the end of this post.
      For now,
      let's assume
      that you are not interested in using Marpa --
      you are committed to PEG,
      and you want to make the best of PEG.
      Several excellent programmers have focused
      on PEG,
      without blinding themselves to its limitations.
      I've already mentioned one important paper
      by Redziejowski.
      Many of
      <a href="http://www.romanredz.se/pubs.htm">
        Redziejowski's collected papers</a>
      are about PEG,
      and Redziejowski, in his attempts to use PEG,
      does not sugarcoat its problems.
    </p>
    <p>Roberto Ierusalimschy, author of Lua and one of the best
      programmers of our time,
      has written a PEG-based parser of his own.
      Roberto is fully aware of PEG's limits,
      but he makes a very good case for choosing PEG
      as the basis of LPEG, his parser generator.
      LPEG is intended for use with Lua,
      a ruthlessly minimal language.
      Roberto's minimalist implementation limits the power of his parser,
      but his aim is to extend regular expressions in a disciplined way,
      and a compact parser of limited power is quite acceptable for his
      purposes.
    </p>
    <h3>Matching the BNF to the PEG spec</h3>
    <p>
      As Redziejowski and Ierusalimschy and the other authors
      of
      <a href="http://arxiv.org/abs/1304.3177">Mascarenhas et al, 2013</a>
      recognize,
      not knowing what language you are
      parsing is more than an annoyance.
      We can call a language
      "well-behaved for PEG"
      if the PEG spec delivers exactly
      the language the BNF describes.
    </p>
    <p>Which languages are are well-behaved for PEG?
      According to
      <a href="http://arxiv.org/abs/1304.3177">Mascarenhas et al, 2013</a>,
      the LL(1) languages are well-behaved.
      (The LL(1) languages are the languages
      a top-down parser can parse based on at
      most one character of input.)
      Syntax-driven parsers for LL(1) have been around
      for much longer than PEG --
      one such parser is described
      in the first paper to describe recursive descent
      (<a href="http://archive.computerhistory.org/resources/text/algol/algol_bulletin/AS16/AS16.HTM">Peter
        Lucas, 1961</a>).
      But most practical languages are not LL(1).
      <a href="http://www.romanredz.se/papers/FI2013.pdf">Redziejowski 2013</a>
      and
      <a href="http://www.romanredz.se/papers/FI2014.pdf">Redziejowski 2014</a>
      seek to extend this result by defining the language class LL(1p) --
      those top-down languages with one
      "parsing procedure"
      of lookahead.
      The LL(1p) languages are also well-behaved for PEG.
    </p>
    <p>
      <a href="http://arxiv.org/abs/1304.3177">Mascarenhas et al, 2013</a>
      also look at a different approach -- instead of writing a PEG specification
      and trying to keep it well-behaved,
      they look at taking languages from larger top-down classes
      and translating them to PEG.
      I don't know of any followup,
      but it's possible this approach could produce
      well-behaved top-down parsers which are
      an improvement over direct-from-PEG parsing.
      But for those who are open to leaving top-down parsing behind,
      a parser which handles languages in all these classes
      and more is already available.
    </p>
    <h3>Marpa</h3>
    <p>
      In this post,
      I have adopted the point of view of programmers using PEG,
      or thinking of doing so.
      My own belief in this matter is that
      very few programmers
      should want to bother with the issues I've just described.
      My reason for this is the Marpa parser --
      a general BNF Earley-driven parser that
    </p>
    <ul>
      <li>has an implementation you can use today;</li>
      <li>allows the application to combine syntax-driven parsing
        with custom procedural logic;</li>
      <li>makes available full, left-eidetic knowledge of the parse to
        the procedural logic;</li>
      <li>and parses a vast class of grammars in linear time,
        including all the LR-regular grammars.</li>
    </ul>
    <p>
      The LR-regular grammars include the LR(k) and LL(k)
      grammars for all
      <i>k</i>.
      LR-regular includes all the languages
      which are well-behaved under PEG,
      and all of those that
      <a href="http://arxiv.org/abs/1304.3177">Mascarenhas et al, 2013</a>
      consider translating into PEG.
    </p>
    <h3>Comments</h3>
    <p>Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
      To learn more about Marpa,
      there's
      <a href="http://savage.net.au/Marpa.html">the
        official web site maintained by Ron Savage</a>.
      I also have
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
    </p>
  </body>
</html>
<br />
<p>posted at: 19:56 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/03/peg.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
</div>
</div>
<div id="footer" style="border-top:thick solid #38B0C0;clear:left;padding:1em;">
<p>This is Ocean of Awareness's
  new home.  This blog has been hosted at
  <a href="http://blogs.perl.org/users/jeffrey_kegler/">blogs.perl.org</a>
  but I have succumbed to the lure of static blogging.
</div>
	<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
              <script type="text/javascript">
            var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
          </script>
          <script type="text/javascript">
            try {
              var pageTracker = _gat._getTracker("UA-33430331-1");
            pageTracker._trackPageview();
            } catch(err) {}
          </script>
</body></html>