-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathindex.html
2426 lines (2421 loc) · 99.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<head>
<link rel="alternate" title="Ocean of Awareness RSS" type="application/rss+xml" title="RSS" href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/index.rss" />
<title>Ocean of Awareness</title>
<style type="text/css">
strong {font-weight: 700;}
</style>
</head>
<body>
<div
style="color:white;background-color:#38B0C0;padding:1em;clear:left;text-align:center;">
<h1>Ocean of Awareness</h1>
</div>
<div style="margin:0;padding:10px 30px 10px 10px;width:150px;float:left;border-right:2px solid #38B0C0">
<p>
<strong>Jeffrey Kegler's blog</strong>
about Marpa, his new parsing algorithm,
and other topics of interest</p>
<p><a href="http://jeffreykegler.github.io/personal/">Jeffrey's personal website</a></p>
<p>
<a href="https://twitter.com/jeffreykegler" class="twitter-follow-button" data-show-count="false">Follow @jeffreykegler</a>
</p>
<p style="text-align:center">
<!-- Place this code where you want the badge to render. -->
<a href="//plus.google.com/101567692867247957860?prsrc=3" rel="publisher" style="text-decoration:none;">
<img src="//ssl.gstatic.com/images/icons/gplus-32.png" alt="Google+" style="border:0;width:32px;height:32px;"/></a>
</p>
<h3>Marpa resources</h3>
<p><a href="http://jeffreykegler.github.io/Marpa-web-site/">The Marpa website</a></p>
<p>The Ocean of Awareness blog: <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog">home page</a>,
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/metapages/chronological.html">chronological index</a>,
and
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/metapages/annotated.html">annotated index</a>.
</p>
</div>
<div style="margin-left:190px;border-left:2px solid #38B0C0;padding:25px;">
<h3>Sat, 01 Jun 2019</h3>
<br />
<center><a name="vgap"> <h2>Infinite Lookahead and Ruby Slippers</h2> </a>
</center>
<html>
<head>
</head>
<body>
<!--
marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
-->
<h2>About this post</h2>
<p>This post presents a practical, compact example which
demonstrates a use case for both infinite lookahead
and Ruby Slippers parsing.
While the example itself is very simple,
this post may not be a good first tutorial --
it focuses on Marpa implementation strategy,
instead of basics.
</p>
<h2>About Urbit</h2>
<p>The example described in this post is one part of
<tt>hoonlint</tt>.
<tt>hoonlint</tt>, currently under development,
will be a "lint" program for a language called Hoon.
</p>
<p>
Hoon is part of
<a href="https://urbit.org/">the Urbit project</a>.
Urbit is an effort to return control of the Internet
experience to the individual user.
(The Urbit community has, generously, been supporting my work on Hoon.)
</p>
<p>
The original Internet and its predecessors were cosy places.
Users controlled their experience.
Authority was so light you
could forget it was there,
but so adequate to its task that you could forget why it
was necessary.
What we old timers do remember of the early Internet was the feeling of entering
into a "brave new world".
</p>
<p>
The Internet grew beyond our imaginings,
and our pure wonder of decades ago now seems ridiculous.
But the price has been a shift
of power which should be no laughing matter.
Control of our Internet experience now resides in
servers,
run by entities which make no secret of having their own interests.
Less overt, but increasingly obvious, is the single-mindedness with which they pursue
those interests.
</p>
<p>
And the stakes have risen.
In the early days,
we used the Internet as a supplement in our intellectual lives.
Today, we depend on it in our financial and social lives.
Today, the server-sphere can be a hostile place.
Going forward it may well become a theater of war.
</p>
We could try to solve this problem by running our own servers.
But this is a lot of work, and only leaves us in touch with those
willing and able to do that. In practice, this seems to be nobody.
</p>
<p>
Urbit seeks to solve these problems with
hassle-free personal servers, called urbits.
Urbits are journaling databases, so they are incorruptable.
To make sure they can be run anywhere in the cloud<a id="footnote-1-ref" href="#footnote-1">[1]</a>,
they are based on a tiny virtual machine, called Nock.
To keep urbits compact and secure,
Urbit takes on code bloat directly --
Urbit is an original design from a clean slate,
with a new protocol stack.
</p>
<h2>About Hoon</h2>
<p>
Nock's "machine language" takes the form of trees of arbitrary precision integers.
The integers can be interpreted as strings, floats, etc.,
as desired.
And the trees can be interpreted as lists,
giving Nock a resemblance to a LISP VM.
Nock does its own memory management
and takes care of its own garbage collection.<a id="footnote-2-ref" href="#footnote-2">[2]</a>
</p>
<p>
Traditionally, there are two ways to enter machine language,
<ul>
<li>Physically, for example,
by toggling it into a machine's front panel.
Originally, entering it physically was the only way.
</li>
<li>Indirectly, using
assembler or some higher-level language, like C.
Once these indirect methods existed, they
rapidly took over as the most common way to create machine language.
</li>
</ul>
Like traditional
machine language, Nock cannot be written directly.
Hoon is Urbit's equivalent of C -- it is Urbit's
"close to the metal" higher level language.
</p>
<p>
Not that Hoon looks much like C,
or anything else you've ever seen.
This is a Hoon program that takes an integer argument,
call it <tt>n</tt>,
and returns the first <tt>n</tt> counting numbers:
<pre><tt>
|= end=@ :: 1
=/ count=@ 1 :: 2
|- :: 3
^- (list @) :: 4
?: =(end count) :: 5
~ :: 6
:- count :: 7
$(count (add 1 count)) :: 8
</tt></pre>
</p>
<p>
Hoon comments begin with a "<tt>::</tt>" and run until the next
newline.
The above Hoon sample uses comments to show line numbers.
</p>
<p>
The example for this post will be
a <tt>hoonlint</tt> subset: a multi-line comment linter.
Multi-line comments are the only Hoon syntax we will talk about.
(For those who want to know more about Hoon,
<a href="https://urbit.org/docs/learn/hoon/">there is a tutorial</a>.)
</p>
<p>
</p>
<h2>About Hoon comments</h2>
<p>
In basic Hoon syntax, multi-line comments are free-form.
In practice, Hoon authors tend to follow a set of conventions.
</p>
<h3>Pre-comments</h3>
<p>
In the simplest case, a comment must precede the code it
describes, and be at the same indent.
These simple cases are called "pre-comments".<a id="footnote-3-ref" href="#footnote-3">[3]</a>
For example, this code contains a pre-comment:
<pre><tt>
:: pre-comment 1
[20 (mug bod)]
</tt></pre>
<p>
<h3>Inter-comments</h3>
Hoon multi-line comments may also
contain "inter-comments".
The inter-comments are aligned depending on the syntax.
In the display below, the inter-comments are aligned with the "rune" of the enclosing sequence.
A "rune" is Hoon's rough equivalent of a "keyword".
Runes are always digraphs of special ASCII characters.
The rune in the following code is
<tt>:~</tt>,
and the sequence it introduces
includes pre-comments, inter-comments and meta-comments.
</p>
<pre><tt>
:~ [3 7]
::
:: pre-comment 1
[20 (mug bod)]
::
:: pre-comment 2
[2 yax]
::
:: pre-comment 3
[2 qax]
::::
:: :: pre-comment 4
:: [4 qax]
::
:: pre-comment 5
[5 tay]
==
</tt></pre>
<p>
When inter-comments are empty, as they are in the above,
they are called "breathing comments", because they serve to separate,
or allow some "air" between, elements of a sequence.
For clarity,
the pre-comments in the above are further indicated:
all and only pre-comments contain the text "<tt>pre-comment</tt>".
</p>
<h3>Meta-comments</h3>
<p>
The above code also contains a third kind of comment -- meta-comments.
Meta-comments must occur at the far left margin -- at column 1.
These are called meta-comments, because they are allowed
to be outside the syntax structure.
One common use for meta-comments is "commenting out" other syntax.
In the above display, the meta-comments "comment out"
the comment labeled "<tt>pre-comment 4</tt>"
and its associated code.
</p>
<h3>Staircase comments</h3>
<p>Finally, there are "staircase comments", which are used
to indicate the larger structure of Hoon sequences and other
code.
For example,
<pre><tt>
:: ::
:::: 3e: AES encryption (XX removed) ::
:: ::
::
:: ::
:::: 3f: scrambling ::
:: ::
:: ob ::
::
</tt> </pre>
<p>
Each staircase consists of three parts.
In lexical order, these parts are
an upper riser,
a tread, and a lower riser.
The upper riser is a sequence of comments at the same
alignment as an inter-comment.
The tread is also at the inter-comment alignment,
but must be 4 colons ("<tt>::::</tt>") followed
by whitespace.
The lower riser is a sequence of comments
indented two spaces more than the tread.
</p>
<h2>Hoon comment conventions</h2>
<p>Hoon's basic syntax allows comments to be free-form.
In practice, there are strict conventions for these comments,
conventions we would like to enforce with <tt>hoonlint</tt>.
<ol>
<li>A multi-line comment may contain
an "inter-part", a "pre-part",
or both.
</li>
<li>If both an inter-part and a pre-part are present,
the inter-part must precede the pre-part.
</li>
<li>The inter-part is a non-empty sequence of inter-comments
and staircases.
</li>
<li>A pre-part is a non-empty sequence of pre-comments.
</li>
<li>Meta-comments may be inserted anywhere in either the pre-part
or the inter-part.
</li>
<li>Comments which do not obey the above rules are
<b>bad comments</b>.
A <b>good comment</b> is any comment which is not a bad comment.
</li>
<li>A comment is not regarded as a meta-comment
if it can be parsed as structural comment.
An <b>structural comment</b> is any good comment which is
not a meta-comment.
</li>
</ol>
<h2>Grammar</h2>
<p>We will implement these conventions using the BNF
of this section.
The sections to follow outline the strategy behind the BNF.
<pre><tt>
:start ::= gapComments
gapComments ::= OptExceptions Body
gapComments ::= OptExceptions
Body ::= InterPart PrePart
Body ::= InterPart
Body ::= PrePart
InterPart ::= InterComponent
InterPart ::= InterruptedInterComponents
InterPart ::= InterruptedInterComponents InterComponent
InterruptedInterComponents ::= InterruptedInterComponent+
InterruptedInterComponent ::= InterComponent Exceptions
InterComponent ::= Staircases
InterComponent ::= Staircases InterComments
InterComponent ::= InterComments
InterComments ::= InterComment+
Staircases ::= Staircase+
Staircase ::= UpperRisers Tread LowerRisers
UpperRisers ::= UpperRiser+
LowerRisers ::= LowerRiser+
PrePart ::= ProperPreComponent OptPreComponents
ProperPreComponent ::= PreComment
OptPreComponents ::= PreComponent*
PreComponent ::= ProperPreComponent
PreComponent ::= Exception
OptExceptions ::= Exception*
Exceptions ::= Exception+
Exception ::= MetaComment
Exception ::= BadComment
Exception ::= BlankLine
</tt></pre>
<h2>Technique: Combinator</h2>
Our comment linter is implemented as a combinator.
The main <tt>hoonlint</tt> parser invokes this combinator when it encounters
a multi-line comment.
Because of the main parser,
we do not have to worry about confusing comments with
Hoon's various string and in-line text syntaxes.
</p>
<p>Note that while combinator parsing is useful,
it is a technique that can be oversold.
Combinators have been much talked about in the functional programming
literature<a id="footnote-4-ref" href="#footnote-4">[4]</a>,
but the current flagship functional programming language compiler,
the Glasgow Haskell Compiler,
does not use combinators to parse its version of the Haskell --
instead it uses a parser in the yacc lineage.<a id="footnote-5-ref" href="#footnote-5">[5]</a>
As a parsing technique on its own,
the use of combinators is simply another way of packaging recursive
descent with backtracking,
and the two techniques share the same power,
the same performance,
and the same downsides.
</p>
<p>Marpa is much more powerful than either LALR (yacc-lineage) parsers or combinators,
so we can save combinator parsing for those cases where
combinator parsing really is helpful.
One such case is lexer mismatch.
</p>
<h3>Lexer mismatch</h3>
<p>The first programming languages, like BASIC and FORTRAN,
were line-structured -- designed to be parsed line-by-line.<a id="footnote-6-ref" href="#footnote-6">[6]</a>
After ALGOL, new languages were usually block-structured.
Blocks can start or end in the middle of a line,
and can span multiple lines.
And blocks are often nested.
</p>
<p>A line-structured language requires its lexer to think in
terms of lines,
but this approach is completely useless for a block-structured
language.
Combining both line-structured and block-structured logic in the same lexer
usually turns the lexer's code into a rat's nest.
</p>
<p>Calling a combinator every time
a line-structured block is encountered eliminates the problem.
The main lexer can assume that the code is block-structured,
and all the line-by-line logic can go into combinators.
</p>
<h2>Technique: Non-determinism</h2>
<p>
Our grammar is non-deterministic,
but unambiguous.
It is unambiguous because,
for every input,
it will produce no more than one parse.
</p>
<p>
It is non-deterministic because there is a case
where it tracks two possible parses at once.
The comment linter cannot immediately distinguish between
a prefix of the upper riser of a staircase,
and a prefix of a sequence of inter-comments.
When a tread and lower riser is encountered,
the parser knows it has found a staircase,
but not until then.
And if the parse is of an inter-comment sequence,
the comment linter will
not be sure of this until the end of the sequence.
</p>
<h2>Technique: Infinite lookahead</h2>
<p>
As just pointed out,
the comment linter does not know whether it is parsing a staircase or
an inter-comment sequence until either
<ul>
<li>it finds a tread and lower riser, in which case
it knows the correct parse will be a staircase; or
</li>
<li>it successfully reaches the end of the inter-comment sequence,
in which case it knows the correct parse is an inter-comment sequence.
</ul>
To determine which of these two choices is the correct parse,
the linter needs to read
an arbitrarily long sequence of tokens --
in other words, the linter needs to perform infinite lookahead.
</p>
<p>Humans deal with infinite lookaheads all the time --
natural languages are full of situations that require them.<a id="footnote-7-ref" href="#footnote-7">[7]</a>
Modern language designers labor to avoid the need
for infinite lookahead,
but even so
cases where it is desirable pop up.<a id="footnote-8-ref" href="#footnote-8">[8]</a>
</p>
<p>
Fortunately, in 1991, Joop Leo published a method that
allows computers to emulate infinite lookahead efficiently.
Marpa uses Joop's technique.
Joop's algorithm is complex,
but the basic idea is to do what humans do in the same circumstance --
keep all the possibilities in mind until the evidence comes in.
</p>
<p>
</p>
<h2>Technique: the Ruby Slippers</h2>
<p>Recall that, according to our conventions,
our parser does not recognize a meta-comment unless
no structural comment can be recognized.
We could implement this in BNF,
but it is much more elegant to use the Ruby Slippers.<a id="footnote-9-ref" href="#footnote-9">[9]</a>
</p>
<p>As those already familiar with Marpa may recall,
the Ruby Slippers are invoked when a Marpa parser finds itself
unable to proceed with its current set of input tokens.
At this point, the lexer can ask the Marpa parser what token it <b>does</b> want.
Once the lexer is told what the "wished-for" token is,
it can concoct one, out of nowhere if necessary, and pass it to the Marpa parser,
which then proceeds happily.
In effect, the lexer acts like Glenda the Good Witch of Oz,
while the Marpa parser plays the role of Dorothy.
</p>
<p>In our implementation, the Marpa parser, by default,
looks only for structural comments.
If the Marpa parser of our
comment linter finds that the current input line is not
a structural comment,
the Marpa parser halts
and tells the lexer that there is a problem.
The lexer then asks the Marpa parser what it is looking for.
In this case, the answer will always be the same:
the Marpa parser will be looking for a meta-comment.
The lexer checks to see if the current line is a comment
starting at column 1.
If there is a comment starting at column 1,
the lexer tells the Marpa parser that its wish has come true --
there is a meta-comment.
</p>
<p>Another way to view the Ruby Slippers is as a kind of exception
mechanism for grammars.
In this application, we treat inability to read an structural
comment as an exception.
When the exception occurs,
if possible, we read a meta-comment.
</p>
<h2>Technique: Error Tokens</h2>
<p><b>Error tokens</b> are a specialized use of the Ruby Slippers.
The application for this parser is "linting" --
checking that the comments follow conventions.
As such, the main product of the parser is not the parse --
it is the list of errors gathered along the way.
So stopping the parser at the first error does not make sense.
</p>
<p>
What is desirable is to treat all inputs as valid,
so that the parsing always runs to the end of input,
in the process producing a list of the errors.
To do this, we want to set up the parser so that it reads
special "error tokens" whenever it encounters a reportable error.
</p>
<p>This is perfect for the Ruby Slippers.
If an "exception" occurs,
as above described for meta-comments,
but no meta-comment is available,
we treat it as a second level exception.
</p>
<p>When would no meta-comment be available?
There are two cases:
<ul><li>The line read is a comment,
but it does not start at column 1.
</li>
<li>The line read is a blank line (all whitespace).
</li>
</ul>
<p>On the second exception level, the current line
will be read as either a <tt><BlankLine></tt>,
or a <tt><BadComment></tt>.
We know that every line must lex as either a
<tt><BlankLine></tt>
or a <tt><BadComment></tt> because our comment linter
is called as a combinator,
and the parent Marpa parser guarantees this.
</p>
<h2>Technique: Ambiguity</h2>
<p>Marpa allows ambiguity,
which could have been exploited as a technique.
For example, in a simpler BNF than that we used above,
it might be ambiguous whether a meta-comment belongs to an <tt><InterPart></tt>
which immediately precedes it;
or to a <tt><PrePart></tt> which immediately follows it.
We could solve the dilemma by noting that it does not matter:
All we care about is spotting bad comments and blank lines,
so that picking one of two ambiguous parses at random will work fine.
</p>
<p>
But efficiency issues are sometimes a problem with ambiguity
and unambiguity can be a good way of avoiding them.<a id="footnote-10-ref" href="#footnote-10">[10]</a>
Also, requiring the grammar to be unambiguous allows
an additional check that is useful in the development phase.
In our code we test each parse for ambiguity.
If we find one, we know that <tt>hoonlint</tt> has a coding error.
</p>
<p>
Keeping the parser unambiguous makes the BNF
less elegant than it could be.
To avoid ambiguity,
we introduced extra symbols;
introduced extra rules;
and restricted the use of ambiguous tokens.
</p>
<p>Recall that I am using the term "ambiguous" in the strict technical
sense that it has in parsing theory, so that a parser is only ambiguous
if it can produce two valid parses for one string.
An unambiguous parser
can allow non-deterministism and
can have ambiguous tokens.
In fact, our example grammar does both of these things,
but is nonetheless unambiguous.
</p>
<h3>Extra symbols</h3>
One example of an extra symbol introduced to make this parser
unambiguous is <tt><ProperPreComment></tt>.
<tt><ProperPreComment></tt>
is used to ensure that a
<tt><PrePart></tt>
never begins with a meta-comment.<a id="footnote-11-ref" href="#footnote-11">[11]</a>
</p>
<p>The BNF requires that the first line of a
<tt><PrePart></tt>
must be a
<tt><ProperPreComment></tt>.
This means that, if a
<tt><MetaComment></tt> is found
at the boundary between an
<tt><InterPart></tt>
and a
<tt><PrePart></tt>,
it cannot be the first line of the
<tt><PrePart></tt>
and so must be the last line of the
<tt><InterPart></tt>.
</p>
</p>
<h3>Extra rules</h3>
<p>In our informal explanation of the comment conventions,
we stated that an inter-part is a sequence, each element of
which is an inter-comment or a staircase.
While BNF that directly implemented this rule would be correct,
it would also be highly ambiguous:
If an inter-comment occurs before a tread or an upper riser line,
it could also be parsed as part of the upper riser.
</p>
<p>To eliminate the ambiguity,
we stipulate that if comment <b>can</b> be parsed as part of a staircase,
then it <b>must</b> be parsed as part of a staircase.
This stipulation still leaves the grammar non-deterministic --
we may not know if our comment could be part of a staircase until
many lines later.
</p>
<p>With our stipulation we know that, if an
<tt><InterComponent></tt>
contains
a staircase, then that staircase must come before any of the inter-comments.
In an <tt><InterComponent></tt>
both staircases and inter-comments are optional, so the
unambiguous representation of
<tt><InterComponent></tt>
is
<pre><tt>
InterComponent ::= Staircases
InterComponent ::= Staircases InterComments
InterComponent ::= InterComments
</tt></pre>
Notice that, although
both staircases and inter-comments are optional,
we do not include the case where both are omitted.
This is because we insist that an
<tt><InterComponent></tt>
contain at least one line.
</p>
<h3>Ambiguous tokens</h3>
<p>Our parser is not ambiguous, but
it <b>does</b> allow ambiguous tokens.
For example, a comment with inter-comment alignment
could be either an
<tt><InterComment></tt>
or an
<tt><UpperRiser></tt>;
and our lexer returns both.
The parser remains unambiguous, however, because
only one of these two tokens will wind up in the
final parse.
</p>
<p>Call the set of tokens returned
by our parser for a single line,
a "token set".
If the token set contains more than one token,
the tokenization is ambiguous for that line.
If the token set contains only one token,
the token set is called a "singleton",
and tokenization is unambiguous for that line.
</p>
<p>
To keep
this parser unambiguous, we restrict the
ambiguity at the lexer level.
For example,
our lexer is set up so
that a meta-comment is never one of the alternatives
in a lexical ambiguity.
If a token set contains a
<tt><MetaComment></tt>,
that token set must be a singleton.
The Ruby Slippers are used to enforce this.<a id="footnote-12-ref" href="#footnote-12">[12]</a>
Similarly, the Ruby Slippers are used to guarantee that
any set of tokens containing either
a <tt><BadComment></tt>
or a
<tt><BlankLine></tt> is a singleton.
</p>
<h2>Code</h2>
<p>This post did not walk the reader through the code.
Instead, we talked in terms of strategy.
<a href="https://github.com/jeffreykegler/Ocean-of-Awareness-blog/tree/gh-pages/code/vgap">
The code is available on Github</a>
in unit test form.
For those who want to see the comment-linter combinator in a context,
a version of the code embedded in <tt>hoonlint</tt>
in also on Github.<a id="footnote-13-ref" href="#footnote-13">[13]</a>
<h2>Comments on this blog post, etc.</h2>
<p>
To learn about Marpa,
my Earley/Leo-based parser,
there is the
<a href="http://savage.net.au/Marpa.html">semi-official web site, maintained by Ron Savage</a>.
The official, but more limited, Marpa website
<a href="http://jeffreykegler.github.io/Marpa-web-site/">is my personal one</a>.
Comments on this post can be made in
<a href="http://groups.google.com/group/marpa-parser">
Marpa's Google group</a>,
or on our IRC channel: #marpa at freenode.net.
</p>
<h2>Footnotes</h2>
<p id="footnote-1"><b>1.</b>
In their present form, urbits run on top of Unix and UDP.
<a href="#footnote-1-ref">↩</a></p>
<p id="footnote-2"><b>2.</b>
Garbage collection and arbitrary precision may seem too high-level
for something considered a "machine language",
but our concepts evolve.
The earliest machine languages required programmers to
write their own memory caching logic
and to create their own floating
point representations,
both things we now regard as much too low-level
to deal with even at the lowest software level.
<a href="#footnote-2-ref">↩</a></p>
<p id="footnote-3"><b>3.</b>
This post attempts to follow standard Hoon terminology, but
for some details of Hoon's whitespace conventions,
there is no settled terminology,
and I have invented terms as necessary.
The term "pre-comment" is one of those inventions.
<a href="#footnote-3-ref">↩</a></p>
<p id="footnote-4"><b>4.</b>
For a brief survey of this literature,
see the entries from 1990 to 1996
in my <a href="https://jeffreykegler.github.io/personal/timeline_v3">
"timeline" of parsing history</a>.
<a href="#footnote-4-ref">↩</a></p>
<p id="footnote-5"><b>5.</b>
<a
href="https://github.com/ghc/ghc/blob/master/compiler/parser/Parser.y">This
is the LALR grammar for GHC</a>, from GHC's Github mirror.
<a href="#footnote-5-ref">↩</a></p>
<p id="footnote-6"><b>6.</b>
This is simplified.
There were provisions for line continuation, etc.
But, nonetheless, the lexers for these languages worked in
terms of lines, and had no true concept of a "block".
<a href="#footnote-6-ref">↩</a></p>
<p id="footnote-7"><b>7.</b>
An example of a requirement for infinite lookahead
is the sentence "The horse raced past the barn fell".
Yes, this sentence is not, in fact, infinitely long,
but the subclause "raced past the barn" could be anything,
and therefore could be arbitrarily long.
In isolation, this example sentence may seem unnatural,
a contrived "garden path".
But if you imagine the sentence as an answer to the question, "Which horse fell?",
expectations are set so that the sentence is quite reasonable.
<a href="#footnote-7-ref">↩</a></p>
<p id="footnote-8"><b>8.</b>
See my blog post <a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/08/rntz.html">
"A Haskell challenge"</a>.
<a href="#footnote-8-ref">↩</a></p>
<p id="footnote-9"><b>9.</b>
To find out more about Ruby Slippers parsing see the Marpa FAQ,
<a href="http://savage.net.au/Perl-modules/html/marpa.faq/faq.html#q122">
questions 122</a>
and
<a href="http://savage.net.au/Perl-modules/html/marpa.faq/faq.html#q123">
123</a>;
my
<a href="file:///mnt2/new/projects/Ocean-of-Awareness-blog/metapages/annotated.html#PARSE_HTML">
blog series on parsing HTML</a>;
my recent blog post
<a
href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/05/combinator2.html">
"Marpa and combinator parsing 2"</a>;
and my much older blog post
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2011/11/marpa-and-the-ruby-slippers.html">
"Marpa and the Ruby Slippers"</a>.
<a href="#footnote-9-ref">↩</a></p>
<p id="footnote-10"><b>10.</b>
This, by the way,
is where I believe parsing theory went wrong,
beginning in the 1960's.
In an understandable search for efficiency,
mainstream parsing theory totally excluded not just ambiguity,
but non-determinism as well.
These draconian restrictions limited the
search for practical parsers to a subset of techniques
so weak that they cannot
even duplicate human parsing capabilities.
This had the bizarre effect of committing
parsing theory to a form of
"human exceptionalism" --
a belief that human beings have a special ability to
parse that computers cannot emulate.
For more on this story,
see my <a href="https://jeffreykegler.github.io/personal/timeline_v3">
"timeline" of parsing history</a>.
<a href="#footnote-10-ref">↩</a></p>
<p id="footnote-11"><b>11.</b>
This example illustrates the efficiency considerations
involved in the decision to tolerate,
or to exclude,
efficiency.
If <tt>n</tt> meta-comments occur between a
<tt><InterPart></tt>
and a <tt><PrePart></tt>,
the dividing line is arbitrary,
so that there are <tt>n+1</tt> parses.
This will, in theory, make the processing time quadratic.
And, in fact, long sequences of meta-comments might occur
between the inter- and pre-comments,
so the inefficiency might be real.
<a href="#footnote-11-ref">↩</a></p>
<p id="footnote-12"><b>12.</b>
Inter-comments and
comments that are part of upper risers may start at column 1,
so that, without special precautions in the lexer,
an ambiguity between a structural comment
and a meta-comment is entirely
possible.
<a href="#footnote-12-ref">↩</a></p>
<p id="footnote-13"><b>13.</b>
For the <tt>hoonlint</tt>-embedded form,
the Marpa grammar is
<a href="https://github.com/jeffreykegler/yahc/blob/714157124b46492e13968c786e400276017a3b85/Lint/Policy/Test/Whitespace.pm#L19">
here</a>
and the code is
<a href="https://github.com/jeffreykegler/yahc/blob/714157124b46492e13968c786e400276017a3b85/Lint/Policy/Test/Whitespace.pm#L341">
here</a>.
These are snapshots -- permalinks.
The application is under development,
and probably will change considerably.
Documentation is absent
and testing is minimal,
so that this pre-alpha embedded form of the code will mainly be useful
for those who want to take a quick glance at the
comment linter in context.
<a href="#footnote-13-ref">↩</a></p>
</body>
</html>
<br />
<p>posted at: 13:03 |
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2019/06/vgap.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
§
§
§
</div>
<h3>Sun, 31 Mar 2019</h3>
<br />
<center><a name="methodology"> <h2>Sherlock Holmes and the Case of the Missing Parsing Solution</h2> </a>
</center>
<html>
<head>
</head>
<body>
<!--
marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
-->
<blockquote>Always approach a case with an absolutely blank mind.
It is always an advantage.
Form no theories, just simply observe and draw inferences from your observations.
—
Sherlock Holmes, quoted in "The Adventure of the Cardboard Box".
</blockquote>
<blockquote>It is a capital mistake to theorize before one has data.
—
Holmes, in "A Scandal in Bohemia".
</blockquote>
<blockquote>I make a point of never having any prejudices, and of following docilely wherever fact may lead me.
—
Holmes, in "The Reigate Puzzle".
</blockquote>
<blockquote>When you have eliminated the impossible, whatever remains, no matter how improbable, must be the truth.
—
Holmes, in "The Sign of Four".
</blockquote>
<blockquote>
In imagination there
exists the perfect
mystery story.
Such a story presents
the essential clues, and compels us to form our own
theory of the case.
If we
follow the plot carefully, we arrive at the complete
solution for ourselves just before the author's disclosure
at the end of the book. The solution itself, contrary to
those of inferior mysteries, does not disappoint us;
moreover, it appears at the very moment we expect it.
Can we liken the reader of such a book to the scientists,
who throughout successive generations continue to seek
solutions of the mysteries in the book of nature? The
comparison is false and will have to be abandoned later,
but it has a modicum of justification which may be
extended and modified to make it more appropriate to
the endeavour of science to solve the mystery of the
universe.
—
Albert Einstein
and Leopold Infeld. <a id="footnote-1-ref" href="#footnote-1">[1]</a>
</blockquote>
<h2>The Sherlock Holmes approach</h2>
<p>My
<a href="https://jeffreykegler.github.io/personal/timeline_v3">
timeline history of parsing theory</a>
is my most popular writing, but
it is not without its critics.
Many of them accuse the timeline of lack of objectivity or of bias.
</p>
<p>
Einstein assumed his reader's idea of methods of proper investigation,
in science as elsewhere,
would be similar to those Conan Doyle's Sherlock Holmes.
I will follow Einstein's lead in starting there.
</p>
<p>
The deductions recorded in the Holmes' canon
often involve
<b>a lot</b>
of theorizing.
To make it a matter of significance what the dogs in "Silver Blaze" did in the night,
Holmes needs a theory of canine behavior,
and Holmes' theory sometimes outpaces its pack of facts by a considerable distance.
Is it really true that only dangerous people own
dangerous dogs?<a id="footnote-2-ref" href="#footnote-2">[2]</a>
</p>
<p>
Holmes's methods, at least as stated in the Conan Doyle stories,
are incapable of solving anything
but the fictional problems he encounters.
In real life, a "blank mind" can observe nothing.
There is no "data" without theory, just white noise.
Every "fact" gathered relies on many prejudgements about what is
relevant and what is not.
And you certainly cannot characterize anything as "impossible",
unless you have, in advance, a theory about what is possible.
</p>
<h2>The Einstein approach</h2>
<p>Einstein, in his popular account
of the evolution of physics,
finds the Doyle stories "admirable"<a id="footnote-3-ref" href="#footnote-3">[3]</a>.
But to solve real-life mysteries, more is needed.
Einstein begins his description of his methods at the start
of his Chapter II:
</p><blockquote>
The following pages contain a dull report of
some very simple experiments.
The account will be boring
not only because the description of experiments is uninteresting
in comparison with their actual performance,
but also because the meaning of the experiments does
not become apparent until theory makes it so. Our
purpose is to furnish a striking example of the role of
theory in physics.
<a id="footnote-4-ref" href="#footnote-4">[4]</a>
</blockquote>
<p>Einstein follows with a series of the kind of experiments
that are performed in high school physics classes.
One might imagine these experiments allowing an observer
to deduce the basics of electromagnetism
using materials and techniques available for centuries.
</p>
<p>But, and this is Einstein's point,
this is not how it happened.
The theory came
<b>first</b>,
and the experiments were devised afterwards.
</p>
<blockquote>
In the first pages
of our book we compared the role
of an investigator
to that of a detective who, after
gathering the requisite facts, finds the right solution
by pure thinking. In one essential this comparison must
be regarded as highly superficial. Both in life and in
detective novels the crime is given. The detective must
look for letters, fingerprints, bullets, guns, but at least
he knows that a murder has been committed. This is
not so for a scientist. It should not be difficult to
imagine someone who knows absolutely nothing about
electricity, since all the ancients lived happily enough
without any knowledge of it. Let this man be given
metal, gold foil, bottles, hard-rubber rod, flannel, in
short, all the material required for performing our
three experiments. He may be a very
cultured person,
but he will probably put wine into the bottles, use the
flannel for cleaning, and never once entertain the idea
of doing the things we have described.
For the detective
the crime is given, the problem formulated: who
killed Cock Robin?
The scientist must, at least in part,
commit his own crime, as well as carry out the investigation.
Moreover, his task is not to explain just one
case, but all phenomena which have happened
or may
still happen. — Einstein and Infeld <a id="footnote-5-ref" href="#footnote-5">[5]</a>
</blockquote>
<h2>Commiting our own crime</h2>
<p>If then,
we must commit the crime of theorizing before the facts,
where does out theory come from?
</p>
<blockquote>
Science is not just a collection of laws,
a catalogue of unrelated facts.
It is a creation of the human mind,
with its freely invented ideas and concepts.
Physical theories try to form a picture of reality
and to establish its connection
with the wide world of sense impressions.
Thus the only justification for our mental structures
is whether and in what way our theories form such
a link. — Einstein and Infeld <a id="footnote-6-ref" href="#footnote-6">[6]</a>
</blockquote>
<blockquote>
In the case of planets moving around the sun
it is found that the system of mechanics works
splendidly.
Nevertheless we can well imagine that another system,
based on different assumptions,
might work just as well.
<br>
Physical concepts are free creations
of the human mind, and are not,