index.html

<html>
<head>
<link rel="alternate" title="Ocean of Awareness RSS" type="application/rss+xml" title="RSS" href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/index.rss" />
<title>Ocean of Awareness</title>
<style type="text/css">
   strong {font-weight: 700;}
</style>
</head>
<body>
<div
  style="color:white;background-color:#38B0C0;padding:1em;clear:left;text-align:center;">
<h1>Ocean of Awareness</h1>
</div>
  <div style="margin:0;padding:10px 30px 10px 10px;width:150px;float:left;border-right:2px solid #38B0C0">
  <p>
  <strong>Jeffrey Kegler's blog</strong>
  about Marpa, his new parsing algorithm,
    and other topics of interest</p>
  <p><a href="http://jeffreykegler.github.io/personal/">Jeffrey's personal website</a></p>
      <p>
	<a href="https://twitter.com/jeffreykegler" class="twitter-follow-button" data-show-count="false">Follow @jeffreykegler</a>
      </p>
      <p style="text-align:center">
	<!-- Place this code where you want the badge to render. -->
	<a href="//plus.google.com/101567692867247957860?prsrc=3" rel="publisher" style="text-decoration:none;">
	<img src="//ssl.gstatic.com/images/icons/gplus-32.png" alt="Google+" style="border:0;width:32px;height:32px;"/></a>
      </p>
  <h3>Marpa resources</h3>
  <p><a href="http://jeffreykegler.github.io/Marpa-web-site/">The Marpa website</a></p>
  <p>The Ocean of Awareness blog: <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog">home page</a>,
  <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/metapages/chronological.html">chronological index</a>,
  and
  <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/metapages/annotated.html">annotated index</a>.
  </p>
  </div>
  <div style="margin-left:190px;border-left:2px solid #38B0C0;padding:25px;">
<h3>Sat, 01 Jun 2019</h3>
<br />
<center><a name="vgap"> <h2>Infinite Lookahead and Ruby Slippers</h2> </a>
</center>
<html>
  <head>
  </head>
  <body>
    <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    <h2>About this post</h2>
    <p>This post presents a practical, compact example which
    demonstrates a use case for both infinite lookahead
    and Ruby Slippers parsing.
    While the example itself is very simple,
    this post may not be a good first tutorial --
    it focuses on Marpa implementation strategy,
    instead of basics.
    </p>
    <h2>About Urbit</h2>
    <p>The example described in this post is one part of
    <tt>hoonlint</tt>.
    <tt>hoonlint</tt>, currently under development,
    will be a "lint" program for a language called Hoon.
    </p>
    <p>
    Hoon is part of
    <a href="https://urbit.org/">the Urbit project</a>.
    Urbit is an effort to return control of the Internet
    experience to the individual user.
    (The Urbit community has, generously, been supporting my work on Hoon.)
    </p>
    <p>
    The original Internet and its predecessors were cosy places.
    Users controlled their experience.
    Authority was so light you
    could forget it was there,
    but so adequate to its task that you could forget why it
    was necessary.
    What we old timers do remember of the early Internet was the feeling of entering
    into a "brave new world".
    </p>
    <p>
The Internet grew beyond our imaginings,
and our pure wonder of decades ago now seems ridiculous.
But the price has been a shift
of power which should be no laughing matter.
Control of our Internet experience now resides in
servers,
run by entities which make no secret of having their own interests.
Less overt, but increasingly obvious, is the single-mindedness with which they pursue
those interests.
</p>
<p>
And the stakes have risen.
In the early days,
we used the Internet as a supplement in our intellectual lives.
Today, we depend on it in our financial and social lives.
Today, the server-sphere can be a hostile  place.
Going forward it may well become a theater of war.
    </p>
    We could try to solve this problem by running our own servers.
    But this is a lot of work, and only leaves us in touch with those
    willing and able to do that.  In practice, this seems to be nobody.
    </p>
    <p>
Urbit seeks to solve these problems with
hassle-free personal servers, called urbits.
Urbits are journaling databases, so they are incorruptable.
To make sure they can be run anywhere in the cloud<a id="footnote-1-ref" href="#footnote-1">[1]</a>,
they are based on a tiny virtual machine, called Nock.
To keep urbits compact and secure,
Urbit takes on code bloat directly --
Urbit is an original design from a clean slate,
with a new protocol stack.
    </p>
    <h2>About Hoon</h2>
    <p>
    Nock's "machine language" takes the form of trees of arbitrary precision integers.
    The integers can be interpreted as strings, floats, etc.,
    as desired.
    And the trees can be interpreted as lists,
    giving Nock a resemblance to a LISP VM.
    Nock does its own memory management
    and takes care of its own garbage collection.<a id="footnote-2-ref" href="#footnote-2">[2]</a>
    </p>
    <p>
    Traditionally, there are two ways to enter machine language,
    <ul>
    <li>Physically, for example,
    by toggling it into a machine's front panel.
    Originally, entering it physically was the only way.
    </li>
    <li>Indirectly, using
    assembler or some higher-level language, like C.
    Once these indirect methods existed, they
    rapidly took over as the most common way to create machine language.
    </li>
    </ul>
    Like traditional
    machine language, Nock cannot be written directly.
    Hoon is Urbit's equivalent of C -- it is Urbit's
    "close to the metal" higher level language.
    </p>
    <p>
    Not that Hoon looks much like C,
    or anything else you've ever seen.
    This is a Hoon program that takes an integer argument,
    call it <tt>n</tt>,
    and returns the first <tt>n</tt> counting numbers:
    <pre><tt>
    |=  end=@                                               ::  1
    =/  count=@  1                                          ::  2
    |-                                                      ::  3
    ^-  (list @)                                            ::  4
    ?:  =(end count)                                        ::  5
      ~                                                     ::  6
    :-  count                                               ::  7
    $(count (add 1 count))                                  ::  8
    </tt></pre>
    </p>
    <p>
    Hoon comments begin with a "<tt>::</tt>" and run until the next
    newline.
    The above Hoon sample uses comments to show line numbers.
    </p>
    <p>
    The example for this post will be
    a <tt>hoonlint</tt> subset: a multi-line comment linter.
    Multi-line comments are the only Hoon syntax we will talk about.
    (For those who want to know more about Hoon,
    <a href="https://urbit.org/docs/learn/hoon/">there is a tutorial</a>.)
    </p>
    <p>
    </p>
    <h2>About Hoon comments</h2>
    <p>
    In basic Hoon syntax, multi-line comments are free-form.
    In practice, Hoon authors tend to follow a set of conventions.
    </p>
    <h3>Pre-comments</h3>
    <p>
    In the simplest case, a comment must precede the code it
    describes, and be at the same indent.
    These simple cases are called "pre-comments".<a id="footnote-3-ref" href="#footnote-3">[3]</a>
    For example, this code contains a pre-comment:
    <pre><tt>
	  :: pre-comment 1
	  [20 (mug bod)]
    </tt></pre>
    <p>
    <h3>Inter-comments</h3>
    Hoon multi-line comments may also
    contain "inter-comments".
    The inter-comments are aligned depending on the syntax.
    In the display below, the inter-comments are aligned with the "rune" of the enclosing sequence.
    A "rune" is Hoon's rough equivalent of a "keyword".
    Runes are always digraphs of special ASCII characters.
    The rune in the following code is
    <tt>:~</tt>,
    and the sequence it introduces
    includes pre-comments, inter-comments and meta-comments.
    </p>
    <pre><tt>
      :~  [3 7]
      ::
	  :: pre-comment 1
	  [20 (mug bod)]
      ::
	  :: pre-comment 2
	  [2 yax]
      ::
	  :: pre-comment 3
	  [2 qax]
    ::::
    ::    :: pre-comment 4
    ::    [4 qax]
      ::
	  :: pre-comment 5
	  [5 tay]
      ==
    </tt></pre>
    <p>
    When inter-comments are empty, as they are in the above,
    they are called "breathing comments", because they serve to separate,
    or allow some "air" between, elements of a sequence.
    For clarity,
    the pre-comments in the above are further indicated:
    all and only pre-comments contain the text "<tt>pre-comment</tt>".
    </p>
    <h3>Meta-comments</h3>
    <p>
    The above code also contains a third kind of comment -- meta-comments.
    Meta-comments must occur at the far left margin -- at column 1.
    These are called meta-comments, because they are allowed
    to be outside the syntax structure.
    One common use for meta-comments is "commenting out" other syntax.
    In the above display, the meta-comments "comment out"
    the comment labeled "<tt>pre-comment 4</tt>"
    and its associated code.
    </p>
    <h3>Staircase comments</h3>
    <p>Finally, there are "staircase comments", which are used
    to indicate the larger structure of Hoon sequences and other
    code.
    For example,
    <pre><tt>
    ::                                                      ::
    ::::  3e: AES encryption  (XX removed)                  ::
      ::                                                    ::
      ::
    ::                                                      ::
    ::::  3f: scrambling                                    ::
      ::                                                    ::
      ::    ob                                              ::
      ::
    </tt> </pre>
    <p>
    Each staircase consists of three parts.
    In lexical order, these parts are
    an upper riser,
    a tread, and a lower riser.
    The upper riser is a sequence of comments at the same
    alignment as an inter-comment.
    The tread is also at the inter-comment alignment,
    but must be 4 colons ("<tt>::::</tt>") followed
    by whitespace.
    The lower riser is a sequence of comments
    indented two spaces more than the tread.
    </p>
    <h2>Hoon comment conventions</h2>
    <p>Hoon's basic syntax allows comments to be free-form.
    In practice, there are strict conventions for these comments,
    conventions we would like to enforce with <tt>hoonlint</tt>.
    <ol>
    <li>A multi-line comment may contain
    an "inter-part", a "pre-part",
    or both.
    </li>
    <li>If both an inter-part and a pre-part are present,
    the inter-part must precede the pre-part.
    </li>
    <li>The inter-part is a non-empty sequence of inter-comments
    and staircases.
    </li>
    <li>A pre-part is a non-empty sequence of pre-comments.
    </li>
    <li>Meta-comments may be inserted anywhere in either the pre-part
    or the inter-part.
    </li>
    <li>Comments which do not obey the above rules are
    <b>bad comments</b>.
    A <b>good comment</b> is any comment which is not a bad comment.
    </li>
    <li>A comment is not regarded as a meta-comment
    if it can be parsed as structural comment.
    An <b>structural comment</b> is any good comment which is
    not a meta-comment.
    </li>
    </ol>
    <h2>Grammar</h2>
    <p>We will implement these conventions using the BNF
    of this section.
    The sections to follow outline the strategy behind the BNF.
    <pre><tt>
    :start ::= gapComments
    gapComments ::= OptExceptions Body
    gapComments ::= OptExceptions
    Body ::= InterPart PrePart
    Body ::= InterPart
    Body ::= PrePart
    InterPart ::= InterComponent
    InterPart ::= InterruptedInterComponents
    InterPart ::= InterruptedInterComponents InterComponent

    InterruptedInterComponents ::= InterruptedInterComponent+
    InterruptedInterComponent ::= InterComponent Exceptions
    InterComponent ::= Staircases
    InterComponent ::= Staircases InterComments
    InterComponent ::= InterComments

    InterComments ::= InterComment+

    Staircases ::= Staircase+
    Staircase ::= UpperRisers Tread LowerRisers
    UpperRisers ::= UpperRiser+
    LowerRisers ::= LowerRiser+

    PrePart ::= ProperPreComponent OptPreComponents
    ProperPreComponent ::= PreComment
    OptPreComponents ::= PreComponent*
    PreComponent ::= ProperPreComponent
    PreComponent ::= Exception

    OptExceptions ::= Exception*
    Exceptions ::= Exception+
    Exception ::= MetaComment
    Exception ::= BadComment
    Exception ::= BlankLine
    </tt></pre>
    <h2>Technique: Combinator</h2>
    Our comment linter is implemented as a combinator.
    The main <tt>hoonlint</tt> parser invokes this combinator when it encounters
    a multi-line comment.
    Because of the main parser,
    we do not have to worry about confusing comments with
    Hoon's various string and in-line text syntaxes.
    </p>
    <p>Note that while combinator parsing is useful,
    it is a technique that can be oversold.
    Combinators have been much talked about in the functional programming
    literature<a id="footnote-4-ref" href="#footnote-4">[4]</a>,
    but the current flagship functional programming language compiler,
    the Glasgow Haskell Compiler,
    does not use combinators to parse its version of the Haskell --
    instead it uses a parser in the yacc lineage.<a id="footnote-5-ref" href="#footnote-5">[5]</a>
    As a parsing technique on its own,
    the use of combinators is simply another way of packaging recursive
    descent with backtracking,
    and the two techniques share the same power,
    the same performance,
    and the same downsides.
    </p>
    <p>Marpa is much more powerful than either LALR (yacc-lineage) parsers or combinators,
    so we can save combinator parsing for those cases where
    combinator parsing really is helpful.
    One such case is lexer mismatch.
    </p>
    <h3>Lexer mismatch</h3>
    <p>The first programming languages, like BASIC and FORTRAN,
    were line-structured -- designed to be parsed line-by-line.<a id="footnote-6-ref" href="#footnote-6">[6]</a>
    After ALGOL, new languages were usually block-structured.
    Blocks can start or end in the middle of a line,
    and can span multiple lines.
    And blocks are often nested.
    </p>
    <p>A line-structured language requires its lexer to think in
    terms of lines,
    but this approach is completely useless for a block-structured
    language.
    Combining both line-structured and block-structured logic in the same lexer
    usually turns the lexer's code into a rat's nest.
    </p>
    <p>Calling a combinator every time
    a line-structured block is encountered eliminates the problem.
    The main lexer can assume that the code is block-structured,
    and all the line-by-line logic can go into combinators.
    </p>
    <h2>Technique: Non-determinism</h2>
    <p>
    Our grammar is non-deterministic,
    but unambiguous.
    It is unambiguous because,
    for every input,
    it will produce no more than one parse.
    </p>
    <p>
    It is non-deterministic because there is a case
    where it tracks two possible parses at once.
    The comment linter cannot immediately distinguish between
    a prefix of the upper riser of a staircase,
    and a prefix of a sequence of inter-comments.
    When a tread and lower riser is encountered,
    the parser knows it has found a staircase,
    but not until then.
    And if the parse is of an inter-comment sequence,
    the comment linter will
    not be sure of this until the end of the sequence.
    </p>
    <h2>Technique: Infinite lookahead</h2>
    <p>
    As just pointed out,
    the comment linter does not know whether it is parsing a staircase or
    an inter-comment sequence until either
    <ul>
    <li>it finds a tread and lower riser, in which case
    it knows the correct parse will be a staircase; or
    </li>
    <li>it successfully reaches the end of the inter-comment sequence,
    in which case it knows the correct parse is an inter-comment sequence.
    </ul>
    To determine which of these two choices is the correct parse,
    the linter needs to read
    an arbitrarily long sequence of tokens --
    in other words, the linter needs to perform infinite lookahead.
    </p>
    <p>Humans deal with infinite lookaheads all the time --
    natural languages are full of situations that require them.<a id="footnote-7-ref" href="#footnote-7">[7]</a>
    Modern language designers labor to avoid the need
    for infinite lookahead,
    but even so
    cases where it is desirable pop up.<a id="footnote-8-ref" href="#footnote-8">[8]</a>
    </p>
    <p>
    Fortunately, in 1991, Joop Leo published a method that
    allows computers to emulate infinite lookahead efficiently.
    Marpa uses Joop's technique.
    Joop's algorithm is complex,
    but the basic idea is to do what humans do in the same circumstance --
    keep all the possibilities in mind until the evidence comes in.
    </p>
    <p>
    </p>
    <h2>Technique: the Ruby Slippers</h2>
    <p>Recall that, according to our conventions,
    our parser does not recognize a meta-comment unless
    no structural comment can be recognized.
    We could implement this in BNF,
    but it is much more elegant to use the Ruby Slippers.<a id="footnote-9-ref" href="#footnote-9">[9]</a>
    </p>
    <p>As those already familiar with Marpa may recall,
    the Ruby Slippers are invoked when a Marpa parser finds itself
    unable to proceed with its current set of input tokens.
    At this point, the lexer can ask the Marpa parser what token it <b>does</b> want.
    Once the lexer is told what the "wished-for" token is,
    it can concoct one, out of nowhere if necessary, and pass it to the Marpa parser,
    which then proceeds happily.
    In effect, the lexer acts like Glenda the Good Witch of Oz,
    while the Marpa parser plays the role of Dorothy.
    </p>
    <p>In our implementation, the Marpa parser, by default,
    looks only for structural comments.
    If the Marpa parser of our
    comment linter finds that the current input line is not
    a structural comment,
    the Marpa parser halts
    and tells the lexer that there is a problem.
    The lexer then asks the Marpa parser what it is looking for.
    In this case, the answer will always be the same:
    the Marpa parser will be looking for a meta-comment.
    The lexer checks to see if the current line is a comment
    starting at column 1.
    If there is a comment starting at column 1,
    the lexer tells the Marpa parser that its wish has come true --
    there is a meta-comment.
    </p>
    <p>Another way to view the Ruby Slippers is as a kind of exception
    mechanism for grammars.
    In this application, we treat inability to read an structural
    comment as an exception.
    When the exception occurs,
    if possible, we read a meta-comment.
    </p>
    <h2>Technique: Error Tokens</h2>
    <p><b>Error tokens</b> are a specialized use of the Ruby Slippers.
    The application for this parser is "linting" --
    checking that the comments follow conventions.
    As such, the main product of the parser is not the parse --
    it is the list of errors gathered along the way.
    So stopping the parser at the first error does not make sense.
    </p>
    <p>
    What is desirable is to treat all inputs as valid,
    so that the parsing always runs to the end of input,
    in the process producing a list of the errors.
    To do this, we want to set up the parser so that it reads
    special "error tokens" whenever it encounters a reportable error.
    </p>
    <p>This is perfect for the Ruby Slippers.
    If an "exception" occurs,
    as above described for meta-comments,
    but no meta-comment is available,
    we treat it as a second level exception.
    </p>
    <p>When would no meta-comment be available?
    There are two cases:
    <ul><li>The line read is a comment,
    but it does not start at column 1.
    </li>
    <li>The line read is a blank line (all whitespace).
    </li>
    </ul>
    <p>On the second exception level, the current line
    will be read as either a <tt>&lt;BlankLine&gt;</tt>,
    or a <tt>&lt;BadComment&gt;</tt>.
    We know that every line must lex as either a
    <tt>&lt;BlankLine&gt;</tt>
    or a <tt>&lt;BadComment&gt;</tt> because our comment linter
    is called as a combinator,
    and the parent Marpa parser guarantees this.
    </p>
    <h2>Technique: Ambiguity</h2>
    <p>Marpa allows ambiguity,
    which could have been exploited as a technique.
    For example, in a simpler BNF than that we used above,
    it might be ambiguous whether a meta-comment belongs to an <tt>&lt;InterPart&gt;</tt>
    which immediately precedes it;
    or to a <tt>&lt;PrePart&gt;</tt> which immediately follows it.
    We could solve the dilemma by noting that it does not matter:
    All we care about is spotting bad comments and blank lines,
    so that picking one of two ambiguous parses at random will work fine.
    </p>
    <p>
    But efficiency issues are sometimes a problem with ambiguity
    and unambiguity can be a good way of avoiding them.<a id="footnote-10-ref" href="#footnote-10">[10]</a>
    Also, requiring the grammar to be unambiguous allows
    an additional check that is useful in the development phase.
    In our code we test each parse for ambiguity.
    If we find one, we know that <tt>hoonlint</tt> has a coding error.
    </p>
    <p>
    Keeping the parser unambiguous makes the BNF
    less elegant than it could be.
    To avoid ambiguity,
    we introduced extra symbols;
    introduced extra rules;
    and restricted the use of ambiguous tokens.
    </p>
    <p>Recall that I am using the term "ambiguous" in the strict technical
    sense that it has in parsing theory, so that a parser is only ambiguous
    if it can produce two valid parses for one string.
    An unambiguous parser
    can allow non-deterministism and
    can have ambiguous tokens.
    In fact, our example grammar does both of these things,
    but is nonetheless unambiguous.
    </p>
    <h3>Extra symbols</h3>
    One example of an extra symbol introduced to make this parser
    unambiguous is <tt>&lt;ProperPreComment&gt;</tt>.
    <tt>&lt;ProperPreComment&gt;</tt>
    is used to ensure that a
    <tt>&lt;PrePart&gt;</tt>
    never begins with a meta-comment.<a id="footnote-11-ref" href="#footnote-11">[11]</a>
    </p>
    <p>The BNF requires that the first line of a
    <tt>&lt;PrePart&gt;</tt>
    must be a
    <tt>&lt;ProperPreComment&gt;</tt>.
    This means that, if a
    <tt>&lt;MetaComment&gt;</tt> is found
    at the boundary between an
    <tt>&lt;InterPart&gt;</tt>
    and a
    <tt>&lt;PrePart&gt;</tt>,
    it cannot be the first line of the
    <tt>&lt;PrePart&gt;</tt>
    and so must be the last line of the
    <tt>&lt;InterPart&gt;</tt>.
    </p>
    </p>
    <h3>Extra rules</h3>
    <p>In our informal explanation of the comment conventions,
    we stated that an inter-part is a sequence, each element of
    which is an inter-comment or a staircase.
    While BNF that directly implemented this rule would be correct,
    it would also be highly ambiguous:
    If an inter-comment occurs before a tread or an upper riser line,
    it could also be parsed as part of the upper riser.
    </p>
    <p>To eliminate the ambiguity,
    we stipulate that if comment <b>can</b> be parsed as part of a staircase,
    then it <b>must</b> be parsed as part of a staircase.
    This stipulation still leaves the grammar non-deterministic --
    we may not know if our comment could be part of a staircase until
    many lines later.
    </p>
    <p>With our stipulation we know that, if an
    <tt>&lt;InterComponent&gt;</tt>
    contains
    a staircase, then that staircase must come before any of the inter-comments.
    In an <tt>&lt;InterComponent&gt;</tt>
    both staircases and inter-comments are optional, so the
    unambiguous representation of
    <tt>&lt;InterComponent&gt;</tt>
    is
    <pre><tt>
    InterComponent ::= Staircases
    InterComponent ::= Staircases InterComments
    InterComponent ::= InterComments
    </tt></pre>
    Notice that, although
    both staircases and inter-comments are optional,
    we do not include the case where both are omitted.
    This is because we insist that an
    <tt>&lt;InterComponent&gt;</tt>
    contain at least one line.
    </p>
    <h3>Ambiguous tokens</h3>
    <p>Our parser is not ambiguous, but
    it <b>does</b> allow ambiguous tokens.
    For example, a comment with inter-comment alignment
    could be either an
    <tt>&lt;InterComment&gt;</tt>
    or an
    <tt>&lt;UpperRiser&gt;</tt>;
    and our lexer returns both.
    The parser remains unambiguous, however, because
    only one of these two tokens will wind up in the
    final parse.
    </p>
    <p>Call the set of tokens returned
    by our parser for a single line,
    a "token set".
    If the token set contains more than one token,
    the tokenization is ambiguous for that line.
    If the token set contains only one token,
    the token set is called a "singleton",
    and tokenization is unambiguous for that line.
    </p>
    <p>
    To keep
    this parser unambiguous, we restrict the
    ambiguity at the lexer level.
    For example,
    our lexer is set up so
    that a meta-comment is never one of the alternatives
    in a lexical ambiguity.
    If a token set contains a
    <tt>&lt;MetaComment&gt;</tt>,
    that token set must be a singleton.
    The Ruby Slippers are used to enforce this.<a id="footnote-12-ref" href="#footnote-12">[12]</a>
    Similarly, the Ruby Slippers are used to guarantee that
    any set of tokens containing either
    a <tt>&lt;BadComment&gt;</tt>
    or a
    <tt>&lt;BlankLine&gt;</tt> is a singleton.
    </p>
    <h2>Code</h2>
    <p>This post did not walk the reader through the code.
    Instead, we talked in terms of strategy.
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/tree/gh-pages/code/vgap">
    The code is available on Github</a>
    in unit test form.
    For those who want to see the comment-linter combinator in a context,
    a version of the code embedded in <tt>hoonlint</tt>
    in also on Github.<a id="footnote-13-ref" href="#footnote-13">[13]</a>
    <h2>Comments on this blog post, etc.</h2>
    <p>
      To learn about Marpa,
      my Earley/Leo-based parser,
      there is the
      <a href="http://savage.net.au/Marpa.html">semi-official web site, maintained by Ron Savage</a>.
      The official, but more limited, Marpa website
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">is my personal one</a>.
      Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
    </p>
    <h2>Footnotes</h2>
<p id="footnote-1"><b>1.</b>
In their present form, urbits run on top of Unix and UDP.
 <a href="#footnote-1-ref">&#8617;</a></p>
<p id="footnote-2"><b>2.</b>
    Garbage collection and arbitrary precision may seem too high-level
    for something considered a "machine language",
    but our concepts evolve.
    The earliest machine languages required programmers to
    write their own memory caching logic
    and to create their own floating
    point representations,
    both things we now regard as much too low-level
    to deal with even at the lowest software level.
 <a href="#footnote-2-ref">&#8617;</a></p>
<p id="footnote-3"><b>3.</b>
    This post attempts to follow standard Hoon terminology, but
    for some details of Hoon's whitespace conventions,
    there is no settled terminology,
    and I have invented terms as necessary.
    The term "pre-comment" is one of those inventions.
 <a href="#footnote-3-ref">&#8617;</a></p>
<p id="footnote-4"><b>4.</b>
    For a brief survey of this literature,
    see the entries from 1990 to 1996
    in my <a href="https://jeffreykegler.github.io/personal/timeline_v3">
    "timeline" of parsing history</a>.
 <a href="#footnote-4-ref">&#8617;</a></p>
<p id="footnote-5"><b>5.</b>
<a
    href="https://github.com/ghc/ghc/blob/master/compiler/parser/Parser.y">This
    is the LALR grammar for GHC</a>, from GHC's Github mirror.
 <a href="#footnote-5-ref">&#8617;</a></p>
<p id="footnote-6"><b>6.</b>
    This is simplified.
    There were provisions for line continuation, etc.
    But, nonetheless, the lexers for these languages worked in
    terms of lines, and had no true concept of a "block".
 <a href="#footnote-6-ref">&#8617;</a></p>
<p id="footnote-7"><b>7.</b>
    An example of a requirement for infinite lookahead
    is the sentence "The horse raced past the barn fell".
    Yes, this sentence is not, in fact, infinitely long,
    but the subclause "raced past the barn" could be anything,
    and therefore could be arbitrarily long.
    In isolation, this example sentence may seem unnatural,
    a contrived "garden path".
    But if you imagine the sentence as an answer to the question, "Which horse fell?",
    expectations are set so that the sentence is quite reasonable.
 <a href="#footnote-7-ref">&#8617;</a></p>
<p id="footnote-8"><b>8.</b>
    See my blog post <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/08/rntz.html">
    "A Haskell challenge"</a>.
 <a href="#footnote-8-ref">&#8617;</a></p>
<p id="footnote-9"><b>9.</b>
        To find out more about Ruby Slippers parsing see the Marpa FAQ,
        <a href="http://savage.net.au/Perl-modules/html/marpa.faq/faq.html#q122">
          questions 122</a>
        and
        <a href="http://savage.net.au/Perl-modules/html/marpa.faq/faq.html#q123">
          123</a>;
        my
        <a href="file:///mnt2/new/projects/Ocean-of-Awareness-blog/metapages/annotated.html#PARSE_HTML">
          blog series on parsing HTML</a>;
	  my recent blog post
	  <a
	  href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator2.html">
	  "Marpa and combinator parsing 2"</a>;
	  and my much older blog post
        <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2011/11/marpa-and-the-ruby-slippers.html">
          "Marpa and the Ruby Slippers"</a>.
 <a href="#footnote-9-ref">&#8617;</a></p>
<p id="footnote-10"><b>10.</b>
    This, by the way,
    is where I believe parsing theory went wrong,
    beginning in the 1960's.
    In an understandable search for efficiency,
    mainstream parsing theory totally excluded not just ambiguity,
    but non-determinism as well.
    These draconian restrictions limited the
    search for practical parsers to a subset of techniques
    so weak that they cannot
    even duplicate human parsing capabilities.
    This had the bizarre effect of committing
    parsing theory to a form of
    "human exceptionalism" --
    a belief that human beings have a special ability to
    parse that computers cannot emulate.
    For more on this story,
    see my <a href="https://jeffreykegler.github.io/personal/timeline_v3">
    "timeline" of parsing history</a>.
 <a href="#footnote-10-ref">&#8617;</a></p>
<p id="footnote-11"><b>11.</b>
    This example illustrates the efficiency considerations
    involved in the decision to tolerate,
    or to exclude,
    efficiency.
    If <tt>n</tt> meta-comments occur between a
    <tt>&lt;InterPart&gt;</tt>
    and a <tt>&lt;PrePart&gt;</tt>,
    the dividing line is arbitrary,
    so that there are <tt>n+1</tt> parses.
    This will, in theory, make the processing time quadratic.
    And, in fact, long sequences of meta-comments might occur
    between the inter- and pre-comments,
    so the inefficiency might be real.
 <a href="#footnote-11-ref">&#8617;</a></p>
<p id="footnote-12"><b>12.</b>
    Inter-comments and
    comments that are part of upper risers may start at column 1,
    so that, without special precautions in the lexer,
    an ambiguity between a structural comment
    and a meta-comment is entirely
    possible.
 <a href="#footnote-12-ref">&#8617;</a></p>
<p id="footnote-13"><b>13.</b>
    For the <tt>hoonlint</tt>-embedded form,
    the Marpa grammar is
    <a href="https://github.com/jeffreykegler/yahc/blob/714157124b46492e13968c786e400276017a3b85/Lint/Policy/Test/Whitespace.pm#L19">
    here</a>
    and the code is
    <a href="https://github.com/jeffreykegler/yahc/blob/714157124b46492e13968c786e400276017a3b85/Lint/Policy/Test/Whitespace.pm#L341">
    here</a>.
    These are snapshots -- permalinks.
    The application is under development,
    and probably will change considerably.
    Documentation is absent
    and testing is minimal,
    so that this pre-alpha embedded form of the code will mainly be useful
    for those who want to take a quick glance at the
    comment linter in context.
 <a href="#footnote-13-ref">&#8617;</a></p>
  </body>
</html>
<br />
<p>posted at: 13:03 |
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2019/06/vgap.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Sun, 31 Mar 2019</h3>
<br />
<center><a name="methodology"> <h2>Sherlock Holmes and the Case of the Missing Parsing Solution</h2> </a>
</center>
<html>
  <head>
  </head>
  <body>
    <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    <blockquote>Always approach a case with an absolutely blank mind.
      It is always an advantage.
      Form no theories, just simply observe and draw inferences from your observations.
      &mdash;
        Sherlock Holmes, quoted in "The Adventure of the Cardboard Box".
        </blockquote>
    <blockquote>It is a capital mistake to theorize before one has data.
      &mdash;
        Holmes, in "A Scandal in Bohemia".
    </blockquote>
    <blockquote>I make a point of never having any prejudices, and of following docilely wherever fact may lead me.
      &mdash;
      Holmes, in "The Reigate Puzzle".
    </blockquote>
    <blockquote>When you have eliminated the impossible, whatever remains, no matter how improbable, must be the truth.
      &mdash;
      Holmes, in "The Sign of Four".
    </blockquote>
    <blockquote>
      In imagination there
      exists the perfect
      mystery story.
      Such a story presents
      the essential clues, and compels us to form our own
      theory of the case.
      If we
      follow the plot carefully, we arrive at the complete
      solution for ourselves just before the author's disclosure
      at the end of the book. The solution itself, contrary to
      those of inferior mysteries, does not disappoint us;
      moreover, it appears at the very moment we expect it.
      Can we liken the reader of such a book to the scientists,
      who throughout successive generations continue to seek
      solutions of the mysteries in the book of nature? The
      comparison is false and will have to be abandoned later,
      but it has a modicum of justification which may be
      extended and modified to make it more appropriate to
      the endeavour of science to solve the mystery of the
      universe.
      &mdash;
      Albert Einstein
        and Leopold Infeld. <a id="footnote-1-ref" href="#footnote-1">[1]</a>
    </blockquote>
    <h2>The Sherlock Holmes approach</h2>
    <p>My
    <a href="https://jeffreykegler.github.io/personal/timeline_v3">
    timeline history of parsing theory</a>
      is my most popular writing, but
      it is not without its critics.
      Many of them accuse the timeline of lack of objectivity or of bias.
    </p>
    <p>
      Einstein assumed his reader's idea of methods of proper investigation,
      in science as elsewhere,
      would be similar to those Conan Doyle's Sherlock Holmes.
      I will follow Einstein's lead in starting there.
    </p>
    <p>
      The deductions recorded in the Holmes' canon
      often involve
      <b>a lot</b>
      of theorizing.
      To make it a matter of significance what the dogs in "Silver Blaze" did in the night,
      Holmes needs a theory of canine behavior,
      and Holmes' theory sometimes outpaces its pack of facts by a considerable distance.
      Is it really true that only dangerous people own
      dangerous dogs?<a id="footnote-2-ref" href="#footnote-2">[2]</a>
    </p>
    <p>
      Holmes's methods, at least as stated in the Conan Doyle stories,
      are incapable of solving anything
      but the fictional problems he encounters.
      In real life, a "blank mind" can observe nothing.
      There is no "data" without theory, just white noise.
      Every "fact" gathered relies on many prejudgements about what is
      relevant and what is not.
      And you certainly cannot characterize anything as "impossible",
      unless you have, in advance, a theory about what is possible.
    </p>
    <h2>The Einstein approach</h2>
    <p>Einstein, in his popular account
    of the evolution of physics,
      finds the Doyle stories "admirable"<a id="footnote-3-ref" href="#footnote-3">[3]</a>.
      But to solve real-life mysteries, more is needed.
      Einstein begins his description of his methods at the start
      of his Chapter II:
    </p><blockquote>
      The following pages contain a dull report of
      some very simple experiments.
      The account will be boring
      not only because the description of experiments is uninteresting
      in comparison with their actual performance,
      but also because the meaning of the experiments does
      not become apparent until theory makes it so. Our
      purpose is to furnish a striking example of the role of
      theory in physics.
      <a id="footnote-4-ref" href="#footnote-4">[4]</a>
    </blockquote>
    <p>Einstein follows with a series of the kind of experiments
      that are performed in high school physics classes.
      One might imagine these experiments allowing an observer
      to deduce the basics of electromagnetism
      using materials and techniques available for centuries.
    </p>
    <p>But, and this is Einstein's point,
      this is not how it happened.
      The theory came
      <b>first</b>,
      and the experiments were devised afterwards.
    </p>
    <blockquote>
      In the first pages
      of our book we compared the role
      of an investigator
      to that of a detective who, after
      gathering the requisite facts, finds the right solution
      by pure thinking. In one essential this comparison must
      be regarded as highly superficial. Both in life and in
      detective novels the crime is given. The detective must
      look for letters, fingerprints, bullets, guns, but at least
      he knows that a murder has been committed. This is
      not so for a scientist. It should not be difficult to
      imagine someone who knows absolutely nothing about
      electricity, since all the ancients lived happily enough
      without any knowledge of it. Let this man be given
      metal, gold foil, bottles, hard-rubber rod, flannel, in
      short, all the material required for performing our
      three experiments. He may be a very
      cultured person,
      but he will probably put wine into the bottles, use the
      flannel for cleaning, and never once entertain the idea
      of doing the things we have described.
      For the detective
      the crime is given, the problem formulated: who
      killed Cock Robin?
      The scientist must, at least in part,
      commit his own crime, as well as carry out the investigation.
      Moreover, his task is not to explain just one
      case, but all phenomena which have happened
      or may
      still happen. &mdash; Einstein and Infeld <a id="footnote-5-ref" href="#footnote-5">[5]</a>
    </blockquote>
    <h2>Commiting our own crime</h2>
    <p>If then,
      we must commit the crime of theorizing before the facts,
      where does out theory come from?
    </p>
    <blockquote>
    Science is not just a collection of laws,
    a catalogue of unrelated facts.
    It is a creation of the human mind,
    with its freely invented ideas and concepts.
    Physical theories try to form a picture of reality
    and to establish its connection
    with the wide world of sense impressions.
    Thus the only justification for our mental structures
    is whether and in what way our theories form such
    a link. &mdash; Einstein and Infeld <a id="footnote-6-ref" href="#footnote-6">[6]</a>
    </blockquote>
    <blockquote>
      In the case of planets moving around the sun
      it is found that the system of mechanics works
      splendidly.
      Nevertheless we can well imagine that another system,
      based on different assumptions,
      might work just as well.
      <br>
      Physical concepts are free creations
      of the human mind, and are not,
      however it may seem,
      uniquely determined by the external world.
      In our endeavor to understand reality
      we are somewhat like a man trying
      to understand the mechanism of a closed watch.
      He sees the face and the moving hands,
      even hears its ticking,
      but he has no way of opening the case.
      If he is ingenious
      he may form some picture of a mechanism
      which could be responsible
      for all the things he observes,
      but he may never be quite sure
      his picture is the only one
      which could explain his observations.
      He will never be able
      to compare his picture with the real mechanism
      and he cannot even imagine the possibility
      or the meaning of such a comparison.
      But he certainly believes that,
      as his knowledge increases,
      his picture of reality will become
      simpler and simpler
      and will explain a wider and wider range
      of his sensuous impressions.
      He may also be believe in the existence
      of the ideal limit of knowledge
      and that it is approached
      by the human mind.
      He may call this ideal limit
      the objective truth. -- Einstein and Infeld <a id="footnote-7-ref" href="#footnote-7">[7]</a>
    </blockquote>
    <p>It may sound as if Einstein believed that the soundness of
    our theories is a matter of faith.
    In fact, Einstein was quite comfortable with putting it
    exactly that way:
    <blockquote>However, it must be admitted
    that our knowledge of these laws is only imperfect
    and fragmentary, so that,
    actually the belief
    in the existence of basic all-embracing laws
    in Nature also rests on a sort of faith.
    All the same this faith has been largely
    justified so far by the success of
    scientific research. &mdash; Einstein <a id="footnote-8-ref" href="#footnote-8">[8]</a>
    </blockquote>
    <blockquote>
    I believe that every true theorist
    is a kind of tamed metaphysicist,
    no matter how pure a "positivist" he may
    fancy himself.
    The metaphysicist believes that the logically
    simple is also the real.
    The tamed metaphysicist believes
    that not all that is logically simple
    is embodied in experienced reality,
    but that the totality of all sensory experience
    can be "comprehended" on the basis of a
    conceptual system built on premises of great
    simplicity.
    The skeptic will say this is a "miracle creed."
    Admittedly so, but it is a miracle creed
    which has been borne out to an amazing extent by
    the development of science. &mdash; Einstein <a id="footnote-9-ref" href="#footnote-9">[9]</a>
    </blockquote>
    <blockquote>
    The liberty of choice, however,
    is of a special kind;
    it is not in any way similar to the liberty of a
    writer of fiction.
    Rather, it is similar to that of a man engaged
    in solving a well-designed puzzle.
    He may, it is true, propose
    any word as the solution;
    but, there is only <i>one</i>
    word which really solves the puzzle in all its
    parts.
    It is a matter of faith that nature
    &mdash;
    as she is perceptible to our five senses
    &mdash;
    takes the character of such a
    well-formulated puzzle.
    The successes reaped up to now
    by science do,
    it is true,
    give a certain encouragement for this faith. --
    Einstein <a id="footnote-10-ref" href="#footnote-10">[10]</a>
    </blockquote>
    <p>The puzzle metaphor of the last quote is revealing.
    Einstein believes there is a single truth,
    but that we will never know what it is &mdash;
    even its existence can only be taken as a matter of faith.
    Existence is a crossword puzzle whose answer we will never
    know.
    Even the existence of an answer must be taken as
    a matter of faith.
    </p>
    <blockquote>
    The very fact that the totality of our sense experience
    is such that by means of thinking
    (operations with concepts,
    and the creation and use of definite functional relations
    between them,
    and the coordination of sense experiences to these concepts)
    it can be put in order,
    this fact is one which leaves us in awe,
    but which we shall never understand.
    One may say that
    "the eternal mystery of the world
    is its comprehensibility". &mdash; Einstein <a id="footnote-11-ref" href="#footnote-11">[11]</a>
    </blockquote>
    <blockquote>
    In my opinion,
    nothing can be said <i>a priori</i>
    concerning the manner in which the concepts
    are to be formed and connected,
    and how we are to coordinate them to sense experiences.
    In guiding us in the creation of such an order
    of sense experiences,
    success alone is the determining factor.
    All that is necessary is to fix a set of rules,
    since without such rules the acquisition
    of knowledge in the desired sense would be impossible.
    One may compare these rules with the rules of a game
    in which,
    while the rules themselves are arbitrary,
    it is their rigidity alone which
    makes the game possible.
    However, the fixation will never be final.
    It will have validity only for a special field
    of application. &mdash; Einstein <a id="footnote-12-ref" href="#footnote-12">[12]</a>
    </blockquote>
    <blockquote>
    There are no eternal theories in science.
    It always happens that some of the facts
    predicted by a theory
    are disproved by experiment.
    Every theory has its period of
    gradual development and triumph,
    after which it may experience a
    rapid decline. &mdash; Einstein and Infeld
    <a id="footnote-13-ref" href="#footnote-13">[13]</a>
    </blockquote>
    </p>
    <blockquote>
    In our great mystery story there are no problems
    wholly solved and settled for all time. &mdash; Einstein and Infeld
    <a id="footnote-14-ref" href="#footnote-14">[14]</a>
    </blockquote>
    <blockquote>
      This great mystery story
      is still
      unsolved.
      We
      cannot
      even be sure that it has a final solution. &mdash;
      Einstein and Infeld <a id="footnote-15-ref" href="#footnote-15">[15]</a>
    </blockquote>
    <h2>Choosing a "highway"</h2>
    In most of the above,
    Einstein is focusing on his work in a "hard" science: physics.
    Are his methods relevant to "softer" fields of study?
    Einstein thinks so:
    <blockquote>
      The whole of science is nothing
      more than a refinement of everyday thinking.
      It is for this reason that the critical thinking
      of the physicist cannot possibly be restricted to
      the examination of the concepts of his own
      specific field.
      He cannot proceed without considering critically
      a much more difficult problem,
      the problem of analyzing the nature of everyday
      thinking. &mdash; Einstein
      <a id="footnote-16-ref" href="#footnote-16">[16]</a>
    </blockquote>
    Einstein's collaboration with Infeld was, like the "Timeline",
    a description of the evolution of ideas,
    and in the Einstein&ndash;Infeld book they describe their approach:
    <blockquote>
      Through the maze of
      facts and concepts we had to choose some highway
      which seemed to us most characteristic and significant.
      Facts and theories not reached by this road had to be
      omitted. We were forced, by our general aim, to make
      a definite choice of facts and ideas. The importance of a
      problem should not be judged by the number of pages
      devoted to it. Some essential lines of thought have been
      left out, not because they seemed to us unimportant,
      but because they do not lie along the road we have
      chosen. &mdash; Einstein and Infeld <a id="footnote-17-ref" href="#footnote-17">[17]</a>
    </blockquote>
    <h2>Truth and success</h2>
    <p>Einstein says that objective truth, while
    it exists, is not to be attained in the hard sciences,
    so it is not likely he thought that a historical
    account could outdo physics in this respect.
    For Einstein, as quoted above,
    "success alone is the determining factor".
    </p>
    <p>Success, of course, varies with what the audience
    for a theory wants.
    In a very real sense,
    I consider a theory that can predict the
    stock market more successful than
    one which can predict perturbations of planetary orbits
    invisible to the naked eye.
    But this is not a reasonable expectation when applied
    to the theory of general relativity.
    </p>
    Among the expectations reasonable for a timeline of parsing
    might be these:
    <ul>
    <li>It helps choose the right parsing algoithm for practical
    applications.
    <li>It helps a reader to understand articles in the
    literature of parsing.
    <li>It helps guide future research.
    <li>It predicts the outcome of future research.
    </ul>
    </p>When I wrote the first version of <cite>Timeline</cite>,
    its goal was none of these.
    Instead I intended it to explain the sources behind my own
    research in the Earley/Leo lineage.
    </p>
    <p>
    With such a criteria of "success",
    I wondered if <cite>Timeline</cite> would have an audience
    much larger than one,
    and was quite surprised when it started getting thousands of
    web hits a day.
      The large audience <cite>Timeline 1.0</cite> drew
      was a sign that there is an large appetite
      out there for
      accounts of parsing theory,
      an appetite so strong that anything resembling
      a coherent account
      was quickly devoured.
    <p>In response to the unexpectedly large audience,
    later versions of the <cite>Timeline</cite> widened
    their focus.
      <cite>Timeline 3.1</cite>
      was broadened to give good coverage
      of mainstream parsing practice
      including a lot of new material and original analysis.
      This brought in lot of material on topics
      which had little or no influence on my Earley/Leo work.
      The parsing of arithmetic expressions,
      for example,
      is trivial in the Earley/Leo context,
      and before my research for <cite>Timeline 3.0</cite>
      I had devoted little attention to
      approaches that I felt amounted to
      needlessly doing things the hard way.
      But arithmetic expressions are at the borderline of power
      for traditional approaches
      and parsing arithmetic expressions was a central motivation
      for the authors of the algorithms that have so far
      been most influential on mainstream parsing.
      So in
      <cite>Timeline 3.1</cite>
      arithmetic expresssions became a recurring theme,
      being brought back for detailed examination time and time again.
    </p>
    <h2>Is the "Timeline" false?</h2>
    <p>
      Is the "Timeline" false?
      The answer is yes, in three increasingly practical senses.
    </p>
    <p>As Einstein makes clear,
    every theory that is about reality,
    will eventually proved be false.
    The best a theory can hope for is the fate of
    Newton's physics &mdash;
    to be shown to be a subcase of a larger theory.
    </p>
    <p>In a more specific sense,
    the truth of any theory of parsing history depends
    on its degree of success in explaining the facts.
    This means that the truth of the "Timeline" depends on which facts
    you require it to explain.
    If arbitrary choices of facts to be explained are allowed,
    the "Timeline" will certainly be seen to be false.
    </p>But can the "Timeline" be shown to be false
    for criteria of success which are non-arbitrary?
    In the next section, I will describe four non-arbitrary
    criteria of success,
    all of which are of practical interest,
    and for all of which the "Timeline" is false.
    </p>
    <h2>The Forever Five</h2>
    <p>"Success" depends a lot on judgement,
    but my studies have led me to conclude that all but five algorithms
    are "unsuccessful" in the sense that,
    for everything that they do,
    at least one other algorithm does it better in practice.
    But this means there are five algorithms which <b>do</b> solve
    some practical problems
    better than any other algorithm,
    including each of the other four.
    I call these the "forever five" because,
    if I am correct,
    these algorithms will be of permanent interest.
    </p>
    <p>
      My "Forever Five" are regular expressions, recursive descent, PEG, Earley/Leo and Sakai's
      algorithm.<a id="footnote-18-ref" href="#footnote-18">[18]</a>
      Earley/Leo is the focus of my
      <cite>Timeline</cite>, so that an effective
      critique of my "Timeline"
      could be a parsing historiography centering on any other of the other four.
    </p>
    <p>For example, of the five, regular expressions are the most limited in parsing power.
      On the other hand, most of the parsing problems you encounter in practice
      are handled quite nicely by regular expressions.<a id="footnote-19-ref" href="#footnote-19">[19]</a>
      Good implementations of regular expressions are widely available.
      And, for speed, they are literally unbeatable -- if a parsing problem is a
      regular expression, no other algorithm will beat a dedicated regular expression
      engine for parsing it.
    </p>
    <p>Could a
      <cite>Timeline</cite>
      competitor be written which
      centered on regular expressions?
      Certainly.
      And if immediate usefulness to the average programmer is the criterion
      (and it is a very good criterion),
      then the
      <cite>Regular Expressions Timeline</cite>
      would certainly give
      my timeline a run for the money.
    </p>
    <h2>What about a PEG Timeline?</h2>
    <p>
      The immediate impetus for this article was
      <a href="https://groups.google.com/d/msg/marpa-parser/8EEq92TjR4E/dIzCnsITBQAJ">a very collegial inquiry</a>
      from Nicolas Laurent, a researcher whose main interest is PEG.
      Could a
      <cite>PEG Timeline</cite>
      challenge mine?
      Again, very certainly.
    </p>
    <p>Because there are at least some
      problems for which PEG is superior to everything else,
      my own Earley/Leo approach included.
      As one example, PEG
      could be an more powerful alternative to regular expressions.
    </p>
    <p>That does not mean that I might not come back with
    a counter-critique.
    Among the questions that I might ask:
    <ul>
    <li>
      Is the PEG algorithm being proposed a future,
      or does it have an implementation?
    </li>
    <li>What claims of speed and time complexity are made?
      Is there a way of determining in advance of runtime how fast
      your algorithm will run?
      Or is the expectation of practical speed
      on an "implement and pray" basis?
    </li>
    <li>Does the proposed PEG algorithm match human parsing
      capabilities?
      If not, it is a claim for human exceptionalism,
      of a kind not usually accepted in modern computer science.
      How is exceptionalism justified in this case?
    </li>
    </ul>
    <blockquote>
    The search for truth is more precious
    than its possession. -- Einstein, quoting Lessing<a id="footnote-20-ref" href="#footnote-20">[20]</a>
    </blockquote>
    <h2>Comments, etc.</h2>
    <p>
      The background material for this post is in my
      <a href="https://jeffreykegler.github.io/personal/timeline_v3">
        Parsing: a timeline 3.0</a>,
      and this post may be considered a supplement to "Timelime".
      To learn about Marpa,
      my Earley/Leo-based parsing project,
      there is the
      <a href="http://savage.net.au/Marpa.html">semi-official web site, maintained by Ron Savage</a>.
      The official, but more limited, Marpa website
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">is my personal one</a>.
      Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
    </p>
    <h2>Footnotes</h2>
<p id="footnote-1"><b>1.</b>
      Einstein, Albert and Infeld, Leopold,
        <cite>The Evolution of Physics</cite>,
        Simon and Schuster, 2007, p. 3
 <a href="#footnote-1-ref">&#8617;</a></p>
<p id="footnote-2"><b>2.</b>
        "A dog reflects the family life.
        Whoever saw a frisky dog in a gloomy family, or a sad dog in a happy one?
        Snarling people have snarling dogs, dangerous people have dangerous ones."
        From "The Adventure of the Creeping Man".
 <a href="#footnote-2-ref">&#8617;</a></p>
<p id="footnote-3"><b>3.</b>
      Einstein and Infeld, p. 4.
 <a href="#footnote-3-ref">&#8617;</a></p>
<p id="footnote-4"><b>4.</b>
      Einstein and Infeld, p. 71.
 <a href="#footnote-4-ref">&#8617;</a></p>
<p id="footnote-5"><b>5.</b>
        Einstein and Infeld, p 78.
 <a href="#footnote-5-ref">&#8617;</a></p>
<p id="footnote-6"><b>6.</b>
    Einstein and Infeld, p. 294.
 <a href="#footnote-6-ref">&#8617;</a></p>
<p id="footnote-7"><b>7.</b>
      Einstein and Infeld, p. 31.
        See also Einstein,
	"On the Method of Theoretical Physics",
        <cite>Ideas and Opinions</cite>,
	Wings Books, New York,
	no publication date, p. 272.
 <a href="#footnote-7-ref">&#8617;</a></p>
<p id="footnote-8"><b>8.</b>
    Dukas and Hoffman,
    <cite>Albert Einstein: The Human Side</cite>,
    Princeton University Press, 2013,
    pp 32-33.
 <a href="#footnote-8-ref">&#8617;</a></p>
<p id="footnote-9"><b>9.</b>
    "On the Generalized Theory of Gravitation", in
    <cite>Ideas and Opinions</cite>, p 342.
 <a href="#footnote-9-ref">&#8617;</a></p>
<p id="footnote-10"><b>10.</b>
    "Physics and Reality", in
    <cite>Ideas and Opinions</cite>, pp. 294-295.
 <a href="#footnote-10-ref">&#8617;</a></p>
<p id="footnote-11"><b>11.</b>
    "Physics and Reality", in
    <cite>Ideas and Opinions</cite>,
    p. 292.
 <a href="#footnote-11-ref">&#8617;</a></p>
<p id="footnote-12"><b>12.</b>
    "Physics and Reality", in
    <cite>Ideas and Opinions</cite>,
    p. 292.
 <a href="#footnote-12-ref">&#8617;</a></p>
<p id="footnote-13"><b>13.</b>
    Einstein and Infeld, p. 75.
 <a href="#footnote-13-ref">&#8617;</a></p>
<p id="footnote-14"><b>14.</b>
    Einstein and Infeld, p. 35.
 <a href="#footnote-14-ref">&#8617;</a></p>
<p id="footnote-15"><b>15.</b>
      Einstein and Infeld, pp. 7-8
 <a href="#footnote-15-ref">&#8617;</a></p>
<p id="footnote-16"><b>16.</b>
	"Physics and Reality",
        <cite>Ideas and Opinions</cite>, p 290.
 <a href="#footnote-16-ref">&#8617;</a></p>
<p id="footnote-17"><b>17.</b>
        Einstein and Infeld, p. 78.
 <a href="#footnote-17-ref">&#8617;</a></p>
<p id="footnote-18"><b>18.</b>
        Three quibbles:
        Regular expressions do not find structure,
        so pedantically they are recognizers,
        not parsers.
        Recursive descent is technique for creating a family of algorithms,
        not an algorithm.
        And the algorithm first described by Sakai is more commonly
        called CYK, from the initials of three other researchers who re-discovered
        it over the years.
 <a href="#footnote-18-ref">&#8617;</a></p>
<p id="footnote-19"><b>19.</b>
      A lot of this is because programmers learn to formulate problems in
      ways which avoid complex parsing so that,
      in practice,
      the alternatives are
      using regular expressions or rationalizing away the
      need for parsing.
 <a href="#footnote-19-ref">&#8617;</a></p>
<p id="footnote-20"><b>20.</b>
    "The Fundaments of Theoretical Physics", in
    <cite>Ideas and Opinions</cite>, p. 335.
 <a href="#footnote-20-ref">&#8617;</a></p>
  </body>
</html>
<br />
<p>posted at: 21:31 |
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2019/03/methodology.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Sun, 14 Oct 2018</h3>
<br />
<center><a name="timeline_3_1"> <h2>Parsing Timeline 3.1</h2> </a>
</center>
<html>
  <head>
  </head>
  <body style="max-width:850px">
    <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    <p>
    <h2>Announcing Timeline 3.1</h2>
    <p>I have just released
    <a href=
    "https://jeffreykegler.github.io/personal/timeline_v3">
    version 3.1 of my Parsing Timeline</a>.
    It is a painless introduction to
    a fascinating and important story
    which is scattered among
    one of the most
    forbidding literatures in computer science.
    Previous versions of this timeline have been,
    by far,
    the most popular of my writings.
    </p>
    <p>A third of Timeline 3.1 is new,
    added since the 3.0 version.
    Much of the new material is adapted from previous
    blog posts, both old and recent.
    Other material is completely new.
    The sections that are not new with 3.1
    has been carefully reviewed and
    heavily revised.
    </p>
    <h2>Comments, etc.</h2>
    <p>My interest in parsing stems from my 
    own approach to it -- a parser in the Earley/Leo
    lineage named Marpa.
    To learn more about Marpa,
      a good first stop is the
      <a href="http://savage.net.au/Marpa.html">semi-official web site, maintained by Ron Savage</a>.
      The official, but more limited, Marpa website
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">is my personal one</a>.
      Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
    </p>
    <!--
    No footnotes in this one !!!
    <h2>Footnotes</h2>
    -->
  </body>
</html>
<br />
<p>posted at: 18:22 |
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/10/timeline_3_1.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Tue, 02 Oct 2018</h3>
<br />
<center><a name="popularity"> <h2>Measuring language popularity</h2> </a>
</center>
<html>
  <head>
  </head>
  <body style="max-width:850px">
    <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    <h2>Language popularity</h2>
    <p>
      <a href="https://github.com/github/linguist">Github's
        linguist</a>
      is seen as the most trustworthy tool
      for estimating language popularity<a id="footnote-1-ref" href="#footnote-1">[1]</a>,
      in large part because it reports its result as
      the proportion of code in a very large dataset,
      instead of web hits or searches.<a id="footnote-2-ref" href="#footnote-2">[2]</a>
      It is ironic, in this context,
      that
      <tt>linguist</tt>
      avoids looking at the code,
      preferring to use
      metadata -- file name and the vim and shebang lines.
      Scanning the actual code is <tt>linguist</tt>'s last resort.<a id="footnote-3-ref" href="#footnote-3">[3]</a>
    </p>
    <p>How accurate is this?
      For files that are mostly in a single programming language,
      currently the majority of them,
      <tt>linguist</tt>'s method are probably very accurate.
    </p>
    <p>But literate programming often requires mixing languages.
      It is perhaps an extreme example,
      but much of the code used in this blog post
      comes from a Markdown file, which contains both C and Lua.
      This code is "untangled" from the Lua by ad-hoc scripts<a id="footnote-4-ref" href="#footnote-4">[4]</a>.
      In my codebase,
      <tt>linguist</tt>
      indentifies this code simply
      as Markdown.<a id="footnote-5-ref" href="#footnote-5">[5]</a>
      <tt>linguist</tt>
      then ignores it,
      as it does all documentation files.<a id="footnote-6-ref" href="#footnote-6">[6]</a>.
    </p>
    <p>Currently, this kind of homegrown
      literate programming may be so rare
      that it is not worth taking into account.
      But if literate programming becomes more popular,
      that trend might well slip under
      <tt>linguist</tt>'s radar.
      And even those with a lot of faith in
      <tt>linguist</tt>'s numbers should be happy to
      know they could be confirmed by more careful methods.
    </p>
    <h2>Token-by-token versus line-by-line</h2>
    <p><tt>linguist</tt> avoids reporting results based on looking at the code,
    because careful line counting for multiple languages
      cannot be done with traditional parsing methods.<a id="footnote-7-ref" href="#footnote-7">[7]</a>
      To do careful line counting,
      a parser must be able to handle ambiguity in several forms --
      ambiguous parses, ambiguous tokens, and overlapping variable-length tokens.
    </p>
    <p>
      The ability to deal with
      "overlapping variable-length tokens" may sound like a bizarre requirement,
      but it is not.
      Line-by-line languages (BASIC, FORTRAN, JSON, .ini files, Markdown)
      and token-by-token languages (C, Java, Javascript, HTML)
      are both common,
      and even today commonly occur in the same file (POD and Perl,
      Haskell's Bird notation, Knuth's CWeb).
    </p>
    <p>
      Deterministic parsing can switch back and forth,
      though at the cost of some very hack-ish code.
      But for careful line counting,
      you need to parse line-by-line and token-by-token
      simultaneously.
      Consider this example:
    </p>
    <pre><tt>
    int fn () { /* for later
\begin{code}
   */ int fn2(); int a = fn2();
   int b = 42;
   return  a + b; /* for later
\end{code}
*/ }
    </tt></pre>
    <p>A reader can imagine that this code is part of a test case using code
      pulled from a LaTeX file.
      The programmer wanted to indicate the copied portion of code,
      and did so by commenting out its original LaTeX delimiters.
      GCC compiles this code without warnings.
    </p>
    <p>It is not really the case that LaTeX is a line-by-line language.
      But in literate programming systems<a id="footnote-8-ref" href="#footnote-8">[8]</a>,
      it is usually required that the
      <tt>\begin{code}</tt>
      and
      <tt>\end{code}</tt>
      delimiters begin at column 0,
      and that the code block between them be a set of whole lines so,
      for our purposes in this post,
      we can treat LaTeX as line-by-line.
      For LaTeX, our parser finds
    </p><pre><tt>
  L1c1-L1c29 LaTeX line: "    int fn () { /* for later"
  L2c1-L2c13 \begin{code}
  L3c1-L5c31 [A CODE BLOCK]
  L6c1-L6c10 \end{code}
  L7c1-L7c5 LaTeX line: "*/ }"<a id="footnote-9-ref" href="#footnote-9">[9]</a>
</tt></pre><p>
      Note that in the LaTeX parse, line alignment is respected perfectly:
      The first and last are ordinary LaTeX lines,
      the 2nd and 6th are commands bounding the code,
      and lines 3 through 5 are a code block.
    </p>
    <p>
      The C tokenization, on the other hand,
      shows no respect for lines.
      Most tokens are a small part of their line,
      and the two comments start in the middle of
      a line and end in the middle of one.
      For example, the first comment starts at column 17
      of line 1 and ends at column 5 of line 3.<a id="footnote-10-ref" href="#footnote-10">[10]</a>
    </p>
    <p>What language is our example in?
    Our example is long enough to justify classification,
    and it compiles as C code.
    So it seems best to classify this example as C code<a id="footnote-11-ref" href="#footnote-11">[11]</a>.
    Our parses give us enough data for a heuristic
    to make a decision capturing this intuition.<a id="footnote-12-ref" href="#footnote-12">[12]</a>
    </p>
    <h2>Earley/Leo parsing and combinators</h2>
    <p>In a series of previous posts<a id="footnote-13-ref" href="#footnote-13">[13]</a>,
      I have been developing a parsing method that
      integrates
      Earley/Leo parsing and combinator parsing.
      Everything in my previous posts is available
      in <a href=
      "https://metacpan.org/pod/distribution/Marpa-R2/pod/Marpa_R2.pod"
      >Marpa::R2</a>,
      which was Debian stable as of jessie.
    </p>
    <p>
      The final piece, added in this post, is the
      ability to use variable length subparsing<a id="footnote-14-ref" href="#footnote-14">[14]</a>,
      which I have just added to Marpa::R3,
      Marpa::R2's successor.
      Releases of <a href=
      "https://metacpan.org/pod/release/JKEGL/Marpa-R3-4.001_053/pod/Marpa_R3.pod"
      >Marpa::R3</a>
      pass a full test suite,
      and the documentation is kept up to date,
      but R3 is alpha, and the usual cautions<a id="footnote-15-ref" href="#footnote-15">[15]</a>
      apply.
    </p>
    <p>Earley/Leo parsing is linear for a superset
    of the LR-regular grammars,
    which includes all other grammar classes in practical use,
    and Earley/Leo allows the equivalent of infinite lookahead.<a id="footnote-16-ref" href="#footnote-16">[16]</a>
    When the power of Earley/Leo gives out,
    Marpa allows combinators (subparsers)
    to be invoked.
    The subparsers can be anything, including
    other Earley/Leo parsers,
    and they can be called recursively<a id="footnote-17-ref" href="#footnote-17">[17]</a>.
    Rare will be the grammar of practical interest that
    cannot be parsed with this combination of methods.
    </p>
    <h2>The example</h2>
    <p>The code that ran this example is <a href=
    "https://github.com/jeffreykegler/Marpa--R3/tree/08fa873687130fcfbe199a5f573375ad11322f3a/pub/varlex"
    >available on Github</a>.
      In previous posts,
      we gave larger examples<a id="footnote-18-ref" href="#footnote-18">[18]</a>,
      and our tools and techniques have scaled.
      We expect that the variable-length subparsing
      feature will also scale -- while it was not available in
      Marpa::R2, it is not in itself new.
      Variable-length tokens have been available in other Marpa interfaces for
      years and they were described in Marpa's theory paper.<a id="footnote-19-ref" href="#footnote-19">[19]</a>.
    </p>
    <p>
      The grammars used in the example of this post are minimal.
      Only enough LaTex is implemented
      to recognize code blocks; and
      only enough C syntax is implemented to recognize comments.
    </p>
    <h2>The code, comments, etc.</h2>
    <p>To learn more about Marpa,
      a good first stop is the
      <a href="http://savage.net.au/Marpa.html">semi-official web site, maintained by Ron Savage</a>.
      The official, but more limited, Marpa website
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">is my personal one</a>.
      Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
    </p>
    <h2>Footnotes</h2>
<p id="footnote-1"><b>1.</b>
	This github repo for <tt>linguist</tt> is <a href=
	"https://github.com/github/linguist/"
	>https://github.com/github/linguist/</a>.
 <a href="#footnote-1-ref">&#8617;</a></p>
<p id="footnote-2"><b>2.</b>
	Their methodology is often left vague,
	but it seems safe to say the careful line-by-line counting
	discussed in this post
	goes well beyond the techniques used in
	the widely-publicized lists of "most popular programming
	languages". 
	<br><br>
	In fact, it seems likely these measures do not use line
	counts at all,
	but instead report the sum of blob sizes.
	Github's <tt>linguist</tt> does give a line count but
	Github does not vouch for its accuracy:
"if you really need to know the lines of code of an entire repo, there are much better tools for this than Linguist."
        (Quoted from
        <a href=
	"https://github.com/github/linguist/issues/3131"
	>the resolution of
	Github linguist issue #1331</a>.)
	The Github API's <tt>list-languages</tt> command reports language sizes
	in bytes.
	The <a href=
	  "https://developer.github.com/v3/repos/#list-languages"
	>API documentation</a>
	is vague, but it seems the counts are the
	sum of blob sizes,
	with each blob classed as one and only one language.
	<br><br>
	Some tallies seem even more coarsely grained than this --
	they are not even blob-by-blob,
	but assign entire repos to the "primary language".
	For more, see
        <a href="https://techcrunch.com/2018/09/30/what-the-heck-is-going-on-with-measures-of-programming-language-popularity/">
          Jon Evan's
          <cite>Techcrunch</cite>
          article</a>;
	  and <a href=
	  "https://www.benfrederickson.com/ranking-programming-languages-by-github-users/"
	  >Ben Frederickson's project</a>.
 <a href="#footnote-2-ref">&#8617;</a></p>
<p id="footnote-3"><b>3.</b>
        <tt>linguist</tt>'s methodology is described in its README.md
	(<a href=
	"https://github.com/github/linguist/blob/8cd9d744caa7bd3920c0cb8f9ca494ce7d8dc206/README.md"
	>permalink as of 30 September 2018</a>).
 <a href="#footnote-3-ref">&#8617;</a></p>
<p id="footnote-4"><b>4.</b>
        This custom literate programming system is not documented or packaged,
	but those who cannot resist taking a look can find the Markdown
	file it processes <a href=
	"https://github.com/jeffreykegler/Marpa--R3/blob/f16ef5798986da69fa8b437edc3930ce2cebd498/cpan/kollos/kollos.md"
	>here</a>,
	and its own code <a href=
	"https://github.com/jeffreykegler/Marpa--R3/blob/f16ef5798986da69fa8b437edc3930ce2cebd498/cpan/kollos/miranda">
	here</a>
	(permalinks accessed 2 October 2018).
 <a href="#footnote-4-ref">&#8617;</a></p>
<p id="footnote-5"><b>5.</b>
        For those who care about getting
        <tt>linguist</tt>
        as
        accurate as possible.
        there is a workaround:
        the
        <tt>linguist-language</tt>
        git attribute.
        This still requires that each blob be 
	reported as containing lines of only one language.
 <a href="#footnote-5-ref">&#8617;</a></p>
<p id="footnote-6"><b>6.</b>
        For the treatment of Markdown, see
        <tt>linguist</tt>
        <a href="https://github.com/github/linguist/blob/8cd9d744caa7bd3920c0cb8f9ca494ce7d8dc206/README.md#my-repository-isnt-showing-my-language">README.md</a>
        (permalink accessed as of 30 September 2018).
 <a href="#footnote-6-ref">&#8617;</a></p>
<p id="footnote-7"><b>7.</b>
        Another possibility is a multi-scan approach -- one
        pass per language.
        But that is likely to be expensive.
        At last count there were 381 langauges in
        <tt>linguist</tt>'s
        database.
        Worse, it won't solve the problem:
        "liberal" recognition even of a single language
        requires more power than available from
        traditional parsers.
 <a href="#footnote-7-ref">&#8617;</a></p>
<p id="footnote-8"><b>8.</b>
      For example, these line-alignment requirements match 
      those in
      <a href=
      "https://www.haskell.org/onlinereport/haskell2010/haskellch10.html"
      >Section 10.4</a> of the 2010 Haskell Language Report.
 <a href="#footnote-8-ref">&#8617;</a></p>
<p id="footnote-9"><b>9.</b>
  Adapted from
  <a href=
  "https://github.com/jeffreykegler/Marpa--R3/blob/08fa873687130fcfbe199a5f573375ad11322f3a/pub/varlex/idlit_ex2.t#L83"
  >test code in Github repo</a>, permalink accessed 2 October 2018.
 <a href="#footnote-9-ref">&#8617;</a></p>
<p id="footnote-10"><b>10.</b>
      See the <a href=
      "https://github.com/jeffreykegler/Marpa--R3/blob/08fa873687130fcfbe199a5f573375ad11322f3a/pub/varlex/idlit_ex2.t#L44"
      >test file</a>
      on Gihub.
 <a href="#footnote-10-ref">&#8617;</a></p>
<p id="footnote-11"><b>11.</b>
    Some might think the two LaTex lines should be counted as LaTex and,
    using subparsing of comments, that heuristic can be implemented.
 <a href="#footnote-11-ref">&#8617;</a></p>
<p id="footnote-12"><b>12.</b>
    To be sure, a useful tool would want to include considerably more of
    C's syntax.
    It is perhaps not necessary to be sure that a file compiles
    before concluding it is C.
    And we might want to class a file as C in spite of a
    fleeting failure to compile.
    But we do want to lower the probably of a false positive.
 <a href="#footnote-12-ref">&#8617;</a></p>
<p id="footnote-13"><b>13.</b>
    <a href=
    "http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/csg.html"
    >Marpa and procedural parsing</a>;
    <a href=
    "http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator.html"
    >Marpa and combinator parsing</a>;
    and <a href=
    "http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator2.html"
    >Marpa and combinator parsing 2</a>
 <a href="#footnote-13-ref">&#8617;</a></p>
<p id="footnote-14"><b>14.</b>
      There is <a href=
      "https://metacpan.org/pod/distribution/Marpa-R2/pod/Marpa_R2.pod"
      >documentation of the interface</a>,
      but it is not a good starting point
      for a reader who has just started to look at the Marpa::R3 project.
      Once a user is familiar with Marpa::R3 standard DSL-based
      interface,
      they can start to learn about its alternatives <a href=
      "https://metacpan.org/pod/release/JKEGL/Marpa-R3-4.001_053/pod/External/Basic.pod"
      >here</a>.
 <a href="#footnote-14-ref">&#8617;</a></p>
<p id="footnote-15"><b>15.</b>
        Specifically,
	since Marpa::R3 is alpha,
	its features are subject
        to change without notice, even between micro releases,
        and changes are made without concern for backward compatibility.
        This makes R3 unsuitable for a production application.
        Add to this that,
	while R3 is tested, it has seen much less
        usage and testing than R2, which has been very stable for
        some time.
 <a href="#footnote-15-ref">&#8617;</a></p>
<p id="footnote-16"><b>16.</b>
    Technically, a grammar is LR-regular if it can be parsed
    deterministically using a regular set as its lookahead.
    A "regular set" is a set of regular expressions.
    The regular set itself must be finite,
    but the regular expressions it contains
    can match lookaheads of arbitrary length.
 <a href="#footnote-16-ref">&#8617;</a></p>
<p id="footnote-17"><b>17.</b>
    See <a href=
    "http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator2.html"
    >Marpa and combinator parsing 2</a>
 <a href="#footnote-17-ref">&#8617;</a></p>
<p id="footnote-18"><b>18.</b>
    The largest example is in <a href=
    "http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator2.html"
    >Marpa and combinator parsing 2</a>
 <a href="#footnote-18-ref">&#8617;</a></p>
<p id="footnote-19"><b>19.</b>
 Kegler, Jeffrey. <cite>Marpa, A Practical General Parser: The Recognizer</cite>.
 <a href=
 "http://dinhe.net/~aredridel/.notmine/PDFs/Parsing/KEGLER,%20Jeffrey%20-%20Marpa,%20a%20practical%20general%20parser:%20the%20recognizer.pdf"
>Online version accessed of 24 April 2018</a>.
The link is to the 19 June 2013 revision of the 2012 original.
 <a href="#footnote-19-ref">&#8617;</a></p>
  </body>
</html>
<br />
<p>posted at: 20:16 |
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/10/popularity.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
<h3>Tue, 28 Aug 2018</h3>
<br />
<center><a name="rntz"> <h2>A Haskell challenge</h2> </a>
</center>
<html>
  <head>
  </head>
  <body style="max-width:850px">
    <!--
      marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
      -->
    <h2>The challenge</h2>
    <p>
    A <a href="http://www.rntz.net/post/2018-07-10-parsing-list-comprehensions.html">recent
    blog post by Michael Arntzenius</a> ended with a friendly challenge to Marpa.
    Haskell list comprehensions are something that
    Haskell's own parser handles only with difficulty.
    A point of Michael's critique of Haskell's parsing was
    that Haskell's list comprehension could be even more powerful if not
    for these syntactic limits.
    </p>
    Michael wondered aloud if Marpa could do better.
    It can.
    </p>
    <p>The problem syntax occurs with the "guards",
    a very powerful facility of
    Haskell's list comprehension.
    Haskell allows several kinds of "guards".
    Two of these "guards" can have the same prefix,
    and these ambiguous prefixes can
    be of arbitrary length.
    In other words,
    parsing Haskell's list comprehension requires
    either lookahead of arbitrary length,
    or its equivalent.
    <p>
    <p>To answer Michael's challenge,
    I extended my Haskell subset parser to deal with
    list comprehension.
    That parser, with its test examples, is online.<a id="footnote-1-ref" href="#footnote-1">[1]</a>
    I have run it for examples thousands of tokens long and,
    more to the point,
    have checked the Earley sets to ensure that Marpa
    will stay linear,
    no matter how long the ambiguous prefix gets.<a id="footnote-2-ref" href="#footnote-2">[2]</a>
    </p>
    Earley parsing, which Marpa uses,
    accomplishes the seemingly impossible here.
    It does the equivalent of infinite lookahead efficiently,
    without actually doing any lookahead or
    backtracking.
    That Earley's algorithm can do this has been a settled
    fact in the literature for some time.
    But today Earley's algorithm is little known even
    among those well acquainted with parsing,
    and to many claiming the equivalent of infinite lookahead,
    without actually doing any lookahead at all,
    sounds like a boast of magical powers.
    </p>
    <p>
    In the rest of this blog post,
    I hope to indicate how Earley parsing follows more than
    one potential parse at a time.
    I will not describe Earley's algorithm in full.<a id="footnote-3-ref" href="#footnote-3">[3]</a>
    But I will show that no magic is involved,
    and that in fact the basic ideas behind Earley's method
    are intuitive and reasonable.
    </p>
    <h2>A quick cheat sheet on list comprehension</h2>
    <p>
    List comprehension in Haskell is impressive.
    Haskell allows
    you to build a list using a series of "guards",
    which can be of several kinds.
    The parsing issue arises because two of the guard types --
    generators and boolean expressions --
    must be treated quite differently,
    but can look the same over an arbitrarily long prefix.
    </p>
    <h3>Generators</h3>
    <p>Here is one example of a Haskell generator,
    from the test case for this blog post:
    </p>
    <pre><tt>
          list = [ x | [x, 1729,
		      -- insert more here
		      99
		   ] <- xss ] </tt><a id="footnote-4-ref" href="#footnote-4">[4]</a></pre>
    <p>
    This says to build a lists of <tt>x</tt>'s
    such that the guard
    <tt>[x, 1729, 99 ] &lt;- xss</tt>
    holds.
    The clue that this guard is a generator is the
    <tt>&lt;-</tt> operator.
    The <tt>&lt;-</tt> operator
    will appear in every generator,
    and means "draw from".
    </p>
    <p>
    The LHS of the <tt>&lt;-</tt> operator is a pattern
    and the RHS is an expression.
    This generator draws all the elements from <tt>xss</tt>
    which match the pattern <tt>[x, 1729, 99 ]</tt>.
    In other words, it draws out
    all the elements of <tt>xss</tt>,
    and tests that they
    are lists of length 3
    whose last two subelements are 1729 and 99.
    </p>
    <p>The variable <tt>x</tt> is set to the 1st subelement.
    <tt>list</tt> will be a list of all those <tt>x</tt>'s.
    In the test suite, we have
    <pre><tt>
    xss = [ [ 42, 1729, 99 ] ] </tt><a id="footnote-5-ref" href="#footnote-5">[5]</a></pre>
    </p>
    so that list becomes <tt>[42]</tt> -- a list
    of one element whose value is 42.
    </p>
    <h3>Boolean guards</h3>
    <p>Generators can share very long prefixes with Boolean guards.
    <pre><tt>
	list2 = [ x | [x, 1729, 99] &lt;- xss,
               [x, 1729,
                  -- insert more here
                  99
               ] == ys,
             [ 42, 1729, 99 ] &lt;- xss
             ] </tt><a id="footnote-6-ref" href="#footnote-6">[6]</a></pre>
    </p>
    <p>The expression defining <tt>list2</tt>
    has 3 comma-separated guards:
    The first guard is a generator,
    the same one as in the previous example.
    The last guard is also a generator.
    </p>
    <p>
    The middle guard is of a new type: it is a Boolean:
    <tt>[x, 1729, 99 ] == ys</tt>.
    This guard insists that <tt>x</tt> be such that the triple
    <tt>[x, 1729, 99 ]</tt> is equal to <tt>ys</tt>.
    </p>
    <p>
    In the test suite, we have
    <pre><tt>
    ys = [ 42, 1729, 99 ] </tt><a id="footnote-7-ref" href="#footnote-7">[7]</a></pre>
    so that <tt>list2</tt> is also
    <tt>[42]</tt>.
    </p>
    <h2>Boolean guards versus generators</h2>
    <p>From the parser's point of view, Boolean guards
    and generators start out looking the same --
    in the examples above, three of our guards start out
    the same -- with the string <tt>[x, 1729, 99 ]</tt>,
    but
    <ul>
    <li>in one case (the Boolean guard),
    <tt>[x, 1729, 99 ]</tt> is the beginning of an expression; and </li>
    <li>in the other two cases (the generators),
    <tt>[x, 1729, 99 ]</tt> is a pattern.</li>
    </ul>
    Clearly patterns and expressions can look identical.
    And they can look identical for an arbitrarily long time --
    I tested the <a href="https://www.haskell.org/ghc/">Glasgow Haskell Compiler</a>
    (GHC)
    with identical expression/pattern prefixes
    thousands of tokens in length.
    My virtual memory eventually gives out,
    but GHC itself never complains.<a id="footnote-8-ref" href="#footnote-8">[8]</a>
    (The comments "<tt>insert more here</tt>" show the points at which the
    comma-separated lists of integers can be extended.)
    </p>
    <h2>The problem for parsers</h2>
    <p>So Haskell list comprehension presents a problem for parsers.
    A parser must determine whether it is parsing an expression or
    a pattern, but it cannot know this for an arbitrarily long time.
    A parser must keep track of two possibilities at once --
    something traditional parsing has refused to do.
    As I have pointed out<a id="footnote-9-ref" href="#footnote-9">[9]</a>,
    belief that traditional parsing "solves" the parsing problem is
    belief in human exceptionalism --
    that human have calculating abilities that Turing machines do not.
    Keeping two possibilites in mind for a long time is trivial for
    human beings -- in one form we call it worrying,
    and try to prevent ourselves from doing it obsessively.
    But it has been the orthodoxy that practical parsing algorithms
    cannot do this.
    </footnote>
    </p>
    <p>Arntzenius has a nice summary of the attempts to parse this
    construct while only allowing one possibility at a time --
    that is, determistically.
    Lookahead clearly cannot work -- it would have to be arbitrarily
    long.
    Backtracking can work, but can be very costly
    and is a major obstacle to quality error reporting.
    </p>
    <p>
    GHC avoids the problems with backtracking by using post-processing.
    At parsing time, GHC treats an ambiguous guard as a
    Boolean.
    Then, if it turns out that is a generator,
    it rewrites it in post-processing.
    This inelegance incurs some real technical debt --
    either a pattern must <b>always</b> be a valid expression,
    or even more trickery must be resorted to.<a id="footnote-10-ref" href="#footnote-10">[10]</a>
    <h2>The Earley solution</h2>
    </p>
    <p>Earley parsing deals with this issue by doing what 
    a human would do --
    keeping both possibilities in mind at once.
    Jay Earley's innovation was to discover a way for a computer
    to track multiple possible parses
    that is compact,
    efficient to create,
    and efficient to read.
    </p>
    <p>
    Earley's algorithm maintains an "Earley table"
    which contains "Earley sets",
    one for each token.
    Each Earley set contains "Earley items".
    Here are some Earley items from Earley set 25 
    in one of our test cases:<br>
    <pre><tt>
	origin = 22; &lt;atomic expression&gt; ::=   '[' &ltexpression&gt; '|' . &ltguards&gt; ']'
	origin = 25; &lt;guards&gt; ::= . &lt;guard<&gt;
	origin = 25; &lt;guards&gt; ::= . &lt;guards&gt; ',' &lt;guard<&gt;
	origin = 25; &lt;guard<&gt;  ::= . &lt;pattern&gt; '&lt; &lt;expression&gt;
	origin = 25; &lt;guard<&gt;  ::= . &lt;expression&gt; </tt><a id="footnote-11-ref" href="#footnote-11">[11]</a></pre>
     <p>
     In the code, these represent the state of the parse just after
     the pipe symbol ("<tt>|</tt>") on line 4 of our test code.
    </p>
    Each Earley item describes progress in one rule of the grammar.
    There is a dot ("<tt>.</tt>") in each rule,
    which indicates how far the parse
    has progressed inside the rule.
    One of the rules has the dot just after the pipe symbol,
    as you would expect, since we have just seen a pipe symbol.
    </p>
    <p>
    The other four rules have the dot at the beginning of the RHS.
    These four rules are "predictions" -- none of their symbols
    have been parsed yet, but we know that these rules might occur,
    starting at the location of this Earley set.
    </p>
    <p>
    Each item also records an "origin": the location in the input where
    the rule described in the item began.
    For predictions the origin is always the same as the Earley set.
    For the first Earley item, the origin is 3 tokens earlier,
    in Earley set 22.
    </p>
    <p>
    <h2>The "secret" of non-determinism</h2>
    <p>
    And now we have come to the secret of efficient non-deterministic parsing --
    a "secret"
    which I hope to convince the reader is not magic,
    or even much of a mystery.
    Here, again, are two of the items from Earley set 25:</p>
    <pre><tt>
	origin = 25; &lt;guard<&gt;  ::= . &lt;pattern&gt; '&lt; &lt;expression&gt;
	origin = 25; &lt;guard<&gt;  ::= . &lt;expression&gt; </tt> <a id="footnote-12-ref" href="#footnote-12">[12]</a></pre>
    </p>
    <p>At this point there are two possibilities going forward --
    a generator guard or a Boolean expression guard.
    And there is an Earley item for each of these possibilities in the Earley set.
    </p>
    <p>
    That is the basic idea -- that is all there is to it.
    Going forward in the parse, for as long as both possibilities stay
    live, Earley items for both will appear in the Earley sets.
    </p>
    <p>From this point of view,
    it should now be clear why the Earley algorithm can keep track
    of several possibilities without lookahead or backtracking.
    No lookahead is needed because all possibilities are in the
    Earley set, and selection among them will take place as the
    rest of the input is read.
    And no backtracking is needed because every possibility
    was already recorded -- there is nothing new to be found
    by backtracking.
    </p>
    <p>It may also be clearer why I claim that Marpa is left-eidetic,
    and how the Ruby Slippers work.<a id="footnote-13-ref" href="#footnote-13">[13]</a>
    Marpa has perfect knowledge of everything in the parse so far,
    because it is all in the Earley tables.
    And, given left-eidetic knowledge, Marpa also knows what
    terminals are expected at the current location,
    and can "wish" them into existence as necessary.
    </p>
    <h2>The code, comments, etc.</h2>
    <p>A permalink to the
    full code and a test suite for this prototype,
    as described in this blog post,
    is
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/tree/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell">
    on Github</a>.
    In particular,
    the permalink of the
    the test suite file for list comprehension is
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp.t">
    here</a>.
    I expect to update this code,
    and the latest commit can be found
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/tree/gh-pages/code/haskell">
    here</a>.
    </p>
    <p>
      To learn more about Marpa,
      a good first stop is the
      <a href="http://savage.net.au/Marpa.html">semi-official web site, maintained by Ron Savage</a>.
      The official, but more limited, Marpa website
      <a href="http://jeffreykegler.github.io/Marpa-web-site/">is my personal one</a>.
      Comments on this post can be made in
      <a href="http://groups.google.com/group/marpa-parser">
        Marpa's Google group</a>,
      or on our IRC channel: #marpa at freenode.net.
    </p>
    <h2>Footnotes</h2>
<p id="footnote-1"><b>1.</b>
    If you are interested in my Marpa-driven Haskell subset parser,
    <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator2.html">
    this blog post</a>
    may be the best introduction.
    The code is
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/tree/gh-pages/code/haskell">
    on Github</a>.
 <a href="#footnote-1-ref">&#8617;</a></p>
<p id="footnote-2"><b>2.</b>
    The Earley sets for the ambigious prefix immediately reach a size
    of 46 items, and then stay at that level.
    This is experimental evidence that the Earley set
    sizes stay constant.
    <br><br>
    And, if the Earley items are examined,
    and their derivations traced,
    it can be seen that
    they must repeat the same Earley item count
    for as long as the ambiguous prefix continues.
    The traces I examined are
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp_trace.out">here</a>,
    and the code which generated them is
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp_ex.pl">here</a>,
    for the
    reader who wants to convince himself.
    <br><br>
    The guard prefixes of Haskell are ambiguous,
    but (modulo mistakes in the standards)
    the overall Haskell grammar is not.
    In the literature on Earley's,
    it has been shown that for an unambiguous grammar,
    each Earley item has an constant amortized cost in time.
    Therefore,
    if a parse produces
    a Earley sets that are all of less than a constant size,
    it must have linear time complexity.
 <a href="#footnote-2-ref">&#8617;</a></p>
<p id="footnote-3"><b>3.</b>
    There are many descriptions of Earley's algorithm out there.
    <a href="https://en.wikipedia.org/wiki/Earley_parser">The
    Wikipedia page on Earley's algorithm</a>
    (accessed 27 August 2018)
    is one good place to start.
    I did
    another very simple introduction to Earley's in
    <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2010/06/jay-earleys-idea.html">an
    earlier blog post</a>,
    which may be worth looking at.
    Note that Marpa contains
    <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2011/11/what-is-the-marpa-algorithm.html">
    improvements to Earley's algorithm</a>.
    Particularly, to fulfill Marpa's claim of linear time for all
    LR-regular grammars, Marpa uses Joop Leo's speed-up.
    But Joop's improvement is <b>not</b> necessary or useful
    for parsing
    Haskell list comprehension,
    is not used in this example,
    and will not be described in this post.
 <a href="#footnote-3-ref">&#8617;</a></p>
<p id="footnote-4"><b>4.</b>
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp.t#L30">
    Permalink to this code</a>,
    accessed 27 August 2018.
 <a href="#footnote-4-ref">&#8617;</a></p>
<p id="footnote-5"><b>5.</b>
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp.t#L25">
    Permalink to this code</a>,
    accessed 27 August 2018.
 <a href="#footnote-5-ref">&#8617;</a></p>
<p id="footnote-6"><b>6.</b>
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp.t#L35">
    Permalink to this code</a>,
    accessed 27 August 2018.
 <a href="#footnote-6-ref">&#8617;</a></p>
<p id="footnote-7"><b>7.</b>
    <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp.t#L28">
    Permalink to this code</a>,
    accessed 27 August 2018.
 <a href="#footnote-7-ref">&#8617;</a></p>
<p id="footnote-8"><b>8.</b>
    Note that if the list is extended, the patterns matches and Boolean
    tests fail, so that 42 is no longer the answer.
    From the parsing point of view, this is immaterial.
 <a href="#footnote-8-ref">&#8617;</a></p>
<p id="footnote-9"><b>9.</b>
    In several places, including
    <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/07/knuth_1965_2.html">
    this blog post</a>.
 <a href="#footnote-9-ref">&#8617;</a></p>
<p id="footnote-10"><b>10.</b>
    This account of the state of the art summarizes
    <a href="http://www.rntz.net/post/2018-07-10-parsing-list-comprehensions.html">
    Arntzenius's recent post</a>,
    which should be consulted for the details.
 <a href="#footnote-10-ref">&#8617;</a></p>
<p id="footnote-11"><b>11.</b>
     Adapted from
     <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp_trace.out#L811">
     this trace output</a>,
     accessed 27 August 2018.
 <a href="#footnote-11-ref">&#8617;</a></p>
<p id="footnote-12"><b>12.</b>
     Adapted from
     <a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/blob/0df0aef7d6cb8590d3a33f857619e75f84786dd7/code/haskell/listcomp_trace.out#L811">
     this trace output</a>,
     accessed 27 August 2018.
 <a href="#footnote-12-ref">&#8617;</a></p>
<p id="footnote-13"><b>13.</b>
    For more on the Ruby Slippers see
    my <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator2.html">
    just previous blog post</a>,
 <a href="#footnote-13-ref">&#8617;</a></p>
  </body>
</html>
<br />
<p>posted at: 07:30 |
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/08/rntz.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&sect;
</div>
</div>
</div>
<div id="footer" style="border-top:thick solid #38B0C0;clear:left;padding:1em;">
<p>This is Ocean of Awareness's
  new home.  This blog has been hosted at
  <a href="http://blogs.perl.org/users/jeffrey_kegler/">blogs.perl.org</a>
  but I have succumbed to the lure of static blogging.
</div>
	<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
              <script type="text/javascript">
            var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
          </script>
          <script type="text/javascript">
            try {
              var pageTracker = _gat._getTracker("UA-33430331-1");
            pageTracker._trackPageview();
            } catch(err) {}
          </script>
</body></html>