forked from jeffreykegler/Ocean-of-Awareness-blog
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
1282 lines (1282 loc) · 50.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<head>
<link rel="alternate" title="Ocean of Awareness RSS" type="application/rss+xml" title="RSS" href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/index.rss" />
<title>Ocean of Awareness</title>
<style type="text/css">
strong {font-weight: 700;}
</style>
</head>
<body>
<div
style="color:white;background-color:#38B0C0;padding:1em;clear:left;text-align:center;">
<h1>Ocean of Awareness</h1>
</div>
<div style="margin:0;padding:10px 30px 10px 10px;width:150px;float:left;border-right:2px solid #38B0C0">
<p>
<strong>Jeffrey Kegler's blog</strong>
about Marpa, his new parsing algorithm,
and other topics of interest</p>
<p><a href="http://www.jeffreykegler.com/">Jeffrey's personal website</a></p>
<p>
<a href="https://twitter.com/jeffreykegler" class="twitter-follow-button" data-show-count="false">Follow @jeffreykegler</a>
</p>
<p style="text-align:center">
<!-- Place this code where you want the badge to render. -->
<a href="//plus.google.com/101567692867247957860?prsrc=3" rel="publisher" style="text-decoration:none;">
<img src="//ssl.gstatic.com/images/icons/gplus-32.png" alt="Google+" style="border:0;width:32px;height:32px;"/></a>
</p>
<h3>Marpa resources</h3>
<p><a href="http://jeffreykegler.github.com/Marpa-web-site/">The Marpa website</a></p>
<p>The Ocean of Awareness blog: <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog">home page</a>,
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/metapages/chronological.html">chronological index</a>,
and
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/metapages/annotated.html">annotated index</a>.
</p>
</div>
<div style="margin-left:190px;border-left:2px solid #38B0C0;padding:25px;">
<h3>Sun, 20 Dec 2015</h3>
<br />
<center><a name="topdown"> <h2>Top-down parsing is guessing</h2> </a>
</center>
<html>
<head>
</head>
<body><p>
<!--
marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
-->
</p>
<p>
Top-down parsing is guessing. Literally.
Bottom-up parsing is looking.
</p>
<p>The way you'll often hear that phrased is that top-down parsing is
looking, starting at the top,
and bottom-up parsing is looking, starting at the bottom.
But that is misleading, because the input is at the bottom --
at the top there is nothing to look at.
A usable top-down parser
<b>must</b>
have a bottom-up component,
even if that component is just lookahead.
</p>
<p>A more generous, but still accurate, way to describe the top-down
component of parsers is "prediction".
And prediction is, indeed, a very useful component of a parser,
when used in combination with other techniques.
</p>
<p>Of course, if a parser does nothing but predict,
it can predict only one input.
Top-down parsing must always be combined with a bottom-up
component.
This bottom-up component may be as modest as lookahead, but it
<b>must</b>
be there or else top-down parsing is really not parsing at all.
</p>
<h2>So why is top-down parsing used so much?</h2>
<p>Top-down parsing may be unusable in its pure form,
but from one point of view that is irrelevant.
Top-down parsing's biggest advantage is that it is highly flexible --
there's no reason to stick to its "pure" form.
</p>
<p>A top-down parser can be written as a series of subroutine calls --
a technique called recursive descent.
Recursive descent
allows you to hook in custom-written bottom-up logic at every
top-down choice point,
and it is a technique which is
completely understandable to programmers with little or no training
in parsing theory.
When dealing with recursive descent parsers,
it is more useful to be a seasoned, far-thinking programmer
than it is to be a mathematician.
This makes recursive descent very appealing to
seasoned, far-thinking programmers,
and they are the audience that counts.
</p>
<h2>Switching techniques</h2>
<p>You can even use the flexibility of top-down to switch
away from top-down parsing.
For example, you could claim that a top-down parser could do anything my
own parser
(<a href="http://jeffreykegler.github.io/Marpa-web-site/">Marpa</a>)
could do, because a recursive descent parser can call
a Marpa parser.
</p>
<p>
A less dramatic switchoff, and one that still leaves the parser with a good claim to be
basically top-down, is very common.
Arithmetic expressions are essential for a computer language.
But they are also
among the many things
top-down parsing cannot handle, even with ordinary lookahead.
Even so, most computer languages these days are parsed top-down --
by recursive descent.
These recursive descent parsers deal with expressions
by temporarily handing control over to an bottom-up operator
precedence parser.
Neither of these parsers is
extremely smart about the hand-over and hand-back
-- it is up to the programmer to make sure the two play together nicely.
But used with caution, this approach works.
</p>
<h2>Top-down parsing and language-oriented programming</h2>
<p>But what about taking top-down methods into the future of language-oriented programming,
extensible languages, and grammars which write grammars?
Here we are forced to confront the reality -- that the effectiveness of
top-down parsing comes entirely from the foreign elements that are added to it.
Starting from a basis of top-down parsing is literally starting
with nothing.
As I have shown in more detail
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2015/12/composable.html">elsewhere</a>,
top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.
</p>
<p>Perl 6 grammars are top-down -- PEG with lots of extensions.
These extensions
include backtracking, backtracking control,
a new style of tie-breaking and lots of opportunity
for the programmer to intervene and customize everything.
But behind it all is a top-down parse engine.
</p>
<p>One aspect of Perl 6 grammars might be seen as breaking
out of the top-down trap.
That trick of switching over to a
bottom-up operator precedence parser for expressions,
which I mentioned above,
is built into Perl 6 and semi-automated.
(I say semi-automated because making sure the two parsers "play nice"
with each other is not automated -- that's still up to the programmer.)
</p>
<p>As far as I know, this semi-automation of expression handling
is new with Perl 6 grammars, and it
may prove handy for duplicating what is done
in recursive descent parsers.
But it adds no new technique to those already in use.
And features
like
</p><ul>
<li>mulitple types of expression, which can be told apart
based on their context,</li>
<li><i>n</i>-ary expressions for arbitrary
<i>n</i>, and</li>
<li>the autogeneration of multiple rules, each allowing
a different precedence scheme,
for expressions of arbitrary arity and associativity,
</li>
</ul>
<p>
all of which are available and in current use in
<a href="http://jeffreykegler.github.io/Marpa-web-site/">Marpa</a>,
are impossible for the technology behind Perl 6 grammars.
</p>
<p>I am a fan of the Perl 6 effort.
Obviously, I have doubts about one specific set of hopes for Perl 6 grammars.
But these hopes have not been central to the Perl 6 effort,
and I will be an eager student of the Perl 6 team's work over the coming months.
</p>
<h2>Comments</h2>
<p>
To learn more about Marpa,
there's
<a href="http://savage.net.au/Marpa.html">the
official web site maintained by Ron Savage</a>.
I also have
<a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
Comments on this post can be made in
<a href="http://groups.google.com/group/marpa-parser">
Marpa's Google group</a>,
or on our IRC channel: #marpa at freenode.net.
</p>
</body>
</html>
<br />
<p>posted at: 15:21 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/12/topdown.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
§
§
§
</div>
<h3>Sun, 13 Dec 2015</h3>
<br />
<center><a name="composable"> <h2>Grammar reuse</h2> </a>
</center>
<html>
<head>
</head>
<body><p>
<!--
marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
-->
</p>
<p>Every year the Perl 6 community creates an "Advent" series of posts.
I always follow these, but
<a href="https://perl6advent.wordpress.com/2015/12/08/day-8-grammars-generating-grammars/">one
in particular</a>
caught my attention this year.
It presents a vision of a future where programming is language-driven.
A vision that I share.
The post went on to encourage its readers to follow up on this vision,
and suggested an approach.
But I do not think the particular approach suggested would be fruitful.
In this post I'll explain why.
</p>
<h2>Reuse</h2>
<p>The focus of the Advent post was language-driven programming,
and that is the aspect that excites me most.
But the points that I wish to make are more easily understood if
I root them in a narrower, but more familiar issue
-- grammar reuse.
</p>
<p>
Most programmers will be very familiar with grammar reuse from regular expressions.
In the regular expression ("RE") world,
programming by cutting and pasting
is very practical and often practiced.
</p>
<p>
For this post I will consider grammar reusability to be the ability
to join two grammars and create a third.
This is also sometimes called grammar composition.
For this purpose, I will widen the term "grammar" to include
RE's and PEG parser specifications.
Ideally, when you compose two grammars,
what you get is
</p><ul>
<li>a language you can reasonably predict, and
</li>
<li>if each of the two original grammars can be parsed in reasonable time,
a language that can be parsed in reasonable time.
</li>
</ul>
<p>
Not all language representations are reusable.
RE's are, and BNF is.
PEG looks like a combination of BNF and RE's,
but PEG, in fact, is its own very special form of parser specification.
And PEG parser specifications are
one of the least reusable language representations ever invented.
</p>
<h2>Reuse and regular expressions</h2>
<p>RE's are as well-behaved under
reuse as a language representation can get.
The combination of two RE's is always another RE,
and you can reasonably determine what language the combined RE recognizes by
examining it.
Further, every RE is parseable in linear time.
</p>
<p>The one downside, often mentioned by critics, is that RE's
do not scale in terms of readability.
Here, however, the problem is not really one of reusability.
The problem is that RE's are quite limited in their capabilities,
and programmers often exploit the excellent behavior of RE's under reuse
to push them into applications for which RE's just do not have the power.
</p>
<h2>Reuse and PEG</h2>
<p>When programmers first look at PEG syntax, they often think they've encountered
paradise. They see both BNF and RE's, and imagine they'll have the
best of each.
But the convenient behavior of
RE's depends on their unambiguity.
You simply cannot write
an unambiguous RE -- it's impossible.
</p>
<p>
More powerful and more flexible, BNF allows you to describe many more grammars --
including ambiguous ones.
How does PEG resolve this? With a Gordian knot approach.
Whenever it encounters an ambiguity, it throws all but one of the choices away.
The author of the PEG specification gets some control over what is thrown away --
he specifies an order of preference for the choices.
But degree of control is less than it seems,
and in practice PEG is
the nitroglycerin of parsing -- marvelous when it works,
but tricky and dangerous.
</p>
<p>
Consider these 3 PEG specifications:
</p><blockquote><pre>
("a"|"aa")"a"
("aa"|"a")"a"
A = "a"A"a"/"aa"</pre></blockquote>
<p>
All three clearly accept only strings which are repetitions
of the letter "<tt>a</tt>".
But which strings?
For the answers,
suggestions for dealing with PEG if you are committed to it,
and more,
look at
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2015/03/peg.html">
my previous post on PEG</a>.
</p>
<p>
When getting an RE or a BNF grammar to work,
you can go back to the grammar and ask
yourself "Does my grammar look like my intended language?".
With PEG, this is not really possible.
With practice, you might get used to figuring out single line PEG specs
like the first two above.
(If you can get the last one, you're amazing.)
But tracing these through multiple rule layers required by useful grammars is,
in practice, not really possible.
</p>
<p>
In real life,
PEG specifications are written by hacking them until the test suite works.
And, once you get a PEG specification to pass the test suite for a practical-sized grammar,
you are very happy to leave it alone.
Trying to compose two PEG specifications is rolling the dice with the odds against you.
</p>
<h2>Reuse and the native Perl 6 parser</h2>
<p>
The native Perl 6 parser is an extended PEG parser.
The extensions are very interesting from the PEG point of view.
The PEG "tie breaking" has been changed,
and backtracking can be used.
These features mean the Perl 6 parser can be extended to languages
well beyond what
ordinary PEG parsers can handle.
But, if you use the extra features, reuse will be even trickier than
if you stuck with vanilla PEG.
</p>
<h2>Reuse and general BNF parsing</h2>
<p>
As mentioned, general BNF is reusable, and so general BNF parsers like
<a href="http://savage.net.au/Marpa.html">Marpa</a>
are as reusable as regular expressions, with two caveats.
First, if the two grammars are not doing their own lexing, their lexers will have
to be compatible.
</p>
<p>Second,
with regular expressions you had the advantage that
<b>every</b>
regular expression parses in linear time, so that speed was guaranteed to be acceptable.
Marpa users reuse grammars and pieces of grammars all the time.
The result is always the language specified by the merged BNF,
and I've never heard anyone complain that performance deterioriated.
</p><p>
But, while it may not happen often,
it is possible to combine two Marpa grammars that run in linear time
and end up with one that does not.
You can guarantee your merged Marpa grammar will stay linear if you follow 2 rules:
</p><ul>
<li>keep the grammar unambiguous;</li>
<li>don't use an unmarked middle recursion.</li>
</ul>
<p>
Unmarked middle recursions are not things you're likely to need a lot: they
are those palindromes where you have to count to find the middle:
grammars like "<tt>A ::= a | a A a</tt>".
If you use a middle recursion at all, it is almost certainly going to
be marked, like "<tt>A ::= b | a A a</tt>",
which generates strings like "<tt>aabaa</tt>".
With Marpa, as with RE's, reuse is easy and practical.
And, as I hope to show in a future post, unlike RE's,
Marpa opens the road to language-driven programming.
</p>
<h2>Perl 6</h2>
<p>I'm a fan of the Perl 6 effort.
I certainly
<b>should</b>
be a supporter, after the many favors they've done for me
and the Marpa community over the years.
The considerations of this post
will disappoint some of the
hopes for applications of the native Perl 6 parser.
But these applications have not been central to the Perl 6 effort,
of which I will be an eager student over the coming months.
</p>
<h2>Comments</h2>
<p>
To learn more about Marpa,
there's
<a href="http://savage.net.au/Marpa.html">the
official web site maintained by Ron Savage</a>.
I also have
<a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
Comments on this post can be made in
<a href="http://groups.google.com/group/marpa-parser">
Marpa's Google group</a>,
or on our IRC channel: #marpa at freenode.net.
</p>
</body>
</html>
<br />
<p>posted at: 10:13 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/12/composable.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
§
§
§
</div>
<h3>Sat, 29 Aug 2015</h3>
<br />
<center><a name="fast_handy"> <h2>Fast handy languages</h2> </a>
</center>
<html>
<head>
</head>
<body><p>
<!--
marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
-->
</p>
<p>Back around 1980, I had access to UNIX and a language I wanted to parse.
I knew that UNIX had all the latest CS tools.
So I expected to type in my BNF and "Presto, Language!".
</p>
<p>Not so easy, I was told.
Languages were difficult things created with complex tools
written by experts who understood the issues.
I recall thinking that,
while English had a syntax that is
as hard as they come,
toddlers manage to parse it
just fine.
But experts are experts,
and more so at second-hand.
</p>
<p>I was steered to an LALR-based parser called yacc.
(Readers may be more familiar with bison, a yacc successor.)
LALR had extended the class of quickly parseable grammars a bit
beyond recursive descent.
But recursive descent was simple in principle,
and its limits were easy to discover and work around.
LALR, on the hand, was OK when it worked, but
figuring out why it failed when it failed
was more like decryption than debugging,
and this was the case both with parser development
and run-time errors.
I soon gave up on yacc
and found another way to solve my problem.
</p>
<p>Few people complained about yacc on the Internet.
If you noise it about that you are unable
to figure out how to use
what everybody says is the state-of-the-art tool,
the conclusions drawn may not be the ones you want.
But my experience seems to have been more than common.
</p>
<p>LALR's claim to fame was that it was the basis of the
industry-standard C compiler.
Over three decades,
its maintainers suffered amid the general silence.
But by 2006, they'd had enough.
GCC (the new industry standard)
ripped its LALR engine out.
By then
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2014/09/chron.html">the
trend back to recursive descent</a>
was well underway.
</p>
<h3>A surprise discovery</h3>
<p>Back in the 1970's,
there had been more powerful alternatives
to LALR and recursive descent.
But they were
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2013/04/fast_enough.html">reputed
to be slow</a>.
</p>
<p>For some applications slow is OK.
In 2007 I decided that a parsing tool that parsed
all context-free languages at state-of-the-art speeds,
slow or fast as the case might be,
would be a useful addition to programmer toolkits.
And I ran into a surprise.
</p>
<p>Hidden in the literature was an amazing discovery --
an 1991 article by Joop Leo that
described how to modify Earley's
algorithm to be fast for every language class in practical use.
(When I say "fast" in this article, I will mean "linear".)
Leo's article had been almost completely ignored --
my project (<a href="http://savage.net.au/Marpa.html">Marpa</a>)
would become its first
practical implementation.
</p>
<h3>Second-order languages</h3>
<p>The implications of Leo's discovery go well beyond speed.
If you can rely on the BNF that you write always producing
a practical parser, you can auto-generate your language.
In fact,
you can write languages which write languages.
</p>
<h3>Which languages are fast?</h3>
<p>The Leo/Earley algorithm is not fast
for every BNF-expressible language.
BNF is powerful, and you can write exponentially
ambiguous languages in it.
But programmers these days
mostly care about unambiguous languages --
we are accustomed to tools and techniques
that parse only a subset of these.
</p>
<p>
As I've said, Marpa is fast for every language
class in practical use today.
Marpa is almost certainly fast for any language
that a modern programmer has in mind.
Unless you peek ahead at the hints I am about to give you,
in fact, it is actually
<b>hard</b>
to write an unambiguous
grammar that goes non-linear on Marpa.
Simply mixing up lots of left, right and middle recursions
will
<b>not</b>
be enough to make an
unambiguous grammar go non-linear.
You will also need to violate a rule
in the set that
I am about to give you.
</p>
<p>To guarantee that Marpa is fast for your BNF language,
follow three rules:
</p>
<ul>
<li>Rule 1: Your BNF must be unambiguous.
</li>
<li>Rule 2: Your BNF must have no "unmarked" middle recursions.
</li>
<li>Rule 3: All of the right-recursive symbols
in your BNF must be dedicated
to right recursion.
</li>
</ul>
<p>Rule 3 turns out to be very easy to obey.
I discuss it in detail in the next section,
which will be about how to break these rules and
get away with it.
</p>
<p>Before we do that,
let's look at what an "unmarked" middle recursion is.
Here's an example of a "marked" middle recursion:
</p><blockquote><pre>
M ::= 'b'
M ::= 'a' M 'a'
</pre></blockquote><p>
Here the "b" symbol is the marker.
This marked middle recursion generates sequences like
</p><blockquote><pre>
b
a b a
a a b a a
</pre></blockquote>
<p>Now for an "unmarked" middle recursion:
</p><blockquote><pre>
M ::= 'a' 'a'
M ::= 'a' M 'a'
</pre></blockquote><p>
This unmarked middle recursion generates sequences like
</p><blockquote><pre>
a a
a a a a
a a a a a a
</pre></blockquote><p>
In this middle recursion there is no marker.
To know where the middle is,
you have to scan all the way to the end,
and then count back.
</p>
<p>A rule of thumb is that if you can "eyeball" the middle
of a long sequence,
the recursion is marked.
If you can't, it is unmarked.
Unfortunately, we can't characterize exactly what a marker
must look like -- a marker can encode the moves of a Turing machine,
so marker detection is undecidable.
</p>
<h3>How to get away with breaking the rules</h3>
<p>The rules about ambiguity and recursions are "soft".
If you only use limited ambiguity and
keep your rule-breaking recursions short,
your parser will stay fast.
</p>
<p>Above, I promised to explain rule 3, which insisted that
a right recursive symbol be "dedicated".
A right recursive symbol is "dedicated" if it appears only
as the recursive symbol in a right recursion.
If your grammar is unambiguous, but you've used an "undedicated"
right-recursive symbol, that is easy to fix.
Just rewrite the grammar, replacing the "undedicated" symbol
with two different symbols.
Dedicate one of the two to the right recursion,
and use the other symbol everywhere else.
</p>
<h3>When NOT to use Marpa</h3>
<p>The languages I have described as "fast" for Marpa
include all those in practical use and many more.
But do you really want to use Marpa for all of them?
Here are four cases for which Marpa is probably not
your best alternative.
</p>
<p>The first case: a language that parses easily with a regular
expression.
The regular expression will be much faster.
Don't walk away from a good thing.
</p>
<p>The second case:
a language
that is easily parsed using a single
loop and some state that fits into constant space.
This parser might be very easy to write and maintain.
If you are using a much slower higher level language,
Marpa's optimized C language
may be a win on CPU speed.
But, as before, if you have a good thing,
don't walk away from it.
</p><p>The third case:
a variation on the second.
Here your single loop might be getting out of hand,
making you yearn for the syntax-driven convenience
of Marpa,
but your state still fits into constant space.
In its current implementation, Marpa keeps all of its
parse tables forever, so Marpa does
<b>not</b>
parse in constant space.
Keeping the tables
allows Marpa to deal with the full structure of its
input, in a way that a SAX-ish approaches cannot.
But if space requirements are an issue,
and your application allows a simplified constant-space
approach,
Marpa's power and convenience may not be enough to
make up for that.
</p>
<p>The fourth case:
a language that
</p><ul>
<li>is very small;
</li>
<li>changes slowly or not at all, and does not grow in complexity;
</li>
<li>merits careful hand-optimization, and has available the staff
to do it;
</li>
<li>merits and has available the kind of on-going support that will
keep your code optimized under changing circumstances; and
</li>
<li>is easily parseable via recursive descent:</li>
</ul>
<p>
It is rare that all of these are the case,
but when that happens,
recursive descent is often preferable to Marpa.
Lua and JSON
are two languages which meet the above criteria.
In Lua's case, it targets platforms with very restricted memories,
which is an additional reason to prefer recursive descent --
Marpa has a relatively large footprint.
</p>
<p>It was not good luck that made
both Lua and JSON good targets for recursive descent --
they were designed around its limits.
JSON is a favorite test target of Marpa for just these reasons.
There are carefully hand-optimized C language parsers for us to
benchmark against.
</p>
<p>We get closer and closer,
but Marpa will
never beat small hand-optimized JSON parsers in software.
However, while recursive descent is a technique for hand-writing parsers,
Marpa is a mathematical algorithm.
Someday,
instructions for manipulating Earley items could be implemented directly
in silicon.
When and if that day comes,
Earley's algorithm will beat recursive descent even at
parsing the grammars that were designed for it.
</p>
<h3>Comments</h3>
<p>Comments on this post can be made in
<a href="http://groups.google.com/group/marpa-parser">
Marpa's Google group</a>,
or on our IRC channel: #marpa at freenode.net.
To learn more about Marpa,
there's
<a href="http://savage.net.au/Marpa.html">the
official web site maintained by Ron Savage</a>.
I also have
<a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
</p>
</body>
</html>
<br />
<p>posted at: 16:36 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/08/fast_handy.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
§
§
§
</div>
<h3>Tue, 10 Mar 2015</h3>
<br />
<center><a name="linear"> <h2>Linear? Yeah right.</h2> </a>
</center>
<html>
<head>
</head>
<body><p>
<!--
marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
-->
</p>
<h3>Linear?</h3>
<p>I have claimed that my new parser,
<a href="http://savage.net.au/Marpa.html">Marpa</a>,
is linear for vast classes of grammars,
going well beyond what the traditional parsers can do.
But skepticism is justified.
When it comes to parsing algorithms,
there have been a lot of time complexity claims
that are hand-wavy, misleading or just
plain false.
This post describes how someone,
who
is exercising the appropriate degree of skepticism,
might conclude that believing Marpa's claims is a reasonable
and prudent thing to do.
</p>
<p>
Marpa's linearity claims seem to be,
in comparison with the other parsers in practical use
today,
bold.
Marpa claims linearity,
not just for every class of grammar for which
yacc/bison, PEG and recursive descent currently claim linearity,
but for considerably more.
(The mathematical details of these claims
are in a
<a href="#DETAILS">section at the end</a>.)
It seems too good to be true.
</p>
<h3>Why should I believe you?</h3>
<p>The most important thing to realize,
in assessing the believability of Marpa's time complexity claims,
is that they are not new.
They were already proved in a long-accepted paper in the refereed literature.
They are the time complexity claims proved by Joop Leo for his algorithm
in 1991, over two decades ago.
Marpa is derived from Leo's algorithm, and its time complexity claims
are those proved for Leo's algorithm.
</p>
<p>Above I said that Marpa's time complexity claims "seem" bold.
On any objective assessment, they are in fact a bit of a yawn.
The claims
<b>seem</b>
surprising only because a lot of people
are unaware of Leo's results.
That is, they are surprising in the same sense that someone who
had avoided hearing about radio waves would be surprised
to learn that he can
communicate instantly
with someone on the other side of the world.
</p>
<p>So, if there's so little to prove, why does the
<a href="https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer">
Marpa paper</a>
have proofs?
In Marpa, I made many implementation decisions about,
and some changes to, the Leo/Earley algorithm.
None of my changes produced better time complexity results --
my only claim is that I did not change the Leo/Earley
algorithm in a way that slowed it down.
To convince myself of this claim, I reworked
the original proofs of
Leo and Earley,
changing them to reflect my changes,
and demonstrated that the results
that Leo had obtained still held.
</p>
<p>Proofs of this kind,
which introduce no new mathematical techniques,
but simply take a previous result and march from
here to there by well-know means,
are called "tedious".
In journals, where there's a need to conserve space,
they are usually omitted,
especially if,
as is the case with Marpa's time complexity proofs,
the results are intuitively quite plausible.
</p>
<h3>Getting from plausible to near-certain</h3>
<p>So let's say you are not going to work through every line
of Marpa's admittedly tedious proofs.
We've seen that the results are intuitively plausible,
as long as you don't reject the previous literature.
But can we do better than merely "plausible"?
</p><p>As an aside, many people misunderstand the phrase
"mathematically proven", especially as it applies to branches
of math like parsing theory.
The fact is that proofs in papers often contain errors.
Usually these are minor,
and don't affect the result.
On the other hand, Jay Earley's paper,
while one of the best Computer Science papers ever published,
also contained a very
serious error.
And this error slipped past his Ph.D. committee and
his referees.
Mathematical arguments and proofs do not allow us to achieve
absolute certainty.
They can only improve the degree of certainty.
</p>
<p>There's a second way to dramatically increase
your degree of conviction
in Marpa's linearity claims, and it is quite simple.
Create examples of problematic grammars,
run them and time them.
This is not as satisfying as a mathematical proof,
because no set of test grammars can be exhaustive.
But if you can't find a counter-example
to Marpa's linearity claims among the grammars of
most interest to you,
that should help lift
your level of certainty to "certain for
all practical purposes".
</p>
<p>Much of this increase in certainty can be
achieved without bothering to run your own tests.
Marpa is in wide use at this point.
If Marpa was going quadratic on grammars
for which it claimed to be linear,
and these were grammars of practical interest,
that would very likely have been noticed by now.
</p>
<h3>I'm still not convinced</h3>
<p>Let's suppose all this has not brought you to
the level of certainty you need to use Marpa.
That means the reasonable thing is to continue to
struggle to work with the restrictions of the
traditional algorithms, right?
No, absolutely not.
</p>
<p>OK, so you don't believe that Marpa preserves
the advances in power and speed made by Leo.
Does that mean that parsers have to stay underpowered?
No, it simply means that there should be a
more direct implementation of Leo's 1991,
bypassing Marpa.
</p><p>But if you are looking for an implementation of Leo's
1991 algorithm,
I think you may end up coming back to Marpa as the most
reasonable choice.
Marpa's additional features
include the ability to use custom,
procedural logic,
as you can with recursive descent.
And Marpa has worked out a lot of the
implementation details for you.
</p><h3>Comments</h3>
<p>Comments on this post can be made in
<a href="http://groups.google.com/group/marpa-parser">
Marpa's Google group</a>,
or on our IRC channel: #marpa at freenode.net.
To learn more about Marpa,
there's
<a href="http://savage.net.au/Marpa.html">the
official web site maintained by Ron Savage</a>.
I also have
<a href="http://jeffreykegler.github.io/Marpa-web-site/">a Marpa web site</a>.
</p>
<h3>Appendix: Some technical details</h3>
<p>
<a name="DETAILS">Above</a>
I talked about algorithms, classes of grammars and their
linearity claims.
I didn't give details because most folks aren't interested.
For those who are, they are in this section.
</p>
<p>
yacc is linear for a grammar class called LALR,
which is a subset of another grammar class
called LR(1).
If you are willing to hassle with GLR,
bison claims linearity for all of LR(1).
Recursive descent is a technique, not an algorithm,
but it is top-down with look-ahead,
and therefore can be seen as some form of LL(k),
where k depends on how it is implemented.
In practice, I suspect k is never much bigger than 3,
and usually pretty close to 1.
With packratting,
PEG can be made linear for everything it
parses but there is a catch --
<a href="http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2015/03/peg.html">
only in limited cases do you know
what language your PEG grammar actually parses</a>.
In current practice, that means your PEG grammar
must be LL(1).
Some of the PEG literature looks at techniques for
extending this as far as LL-regular, but there are no
implementations, and it remains to be seen if the
algorithms described are practical.
</p>
<p>
The
<a href="https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer">
Marpa paper</a>
contains a proof,
based on a proof of the same claim by
Joop Leo,
that Marpa is linear for LR-regular grammars.
The LR-regular grammars
include the LR(k) grammars for every k.
So Marpa is linear for LR(1), LR(2), LR(8675309), etc.
LR-regular also includes LL-regular.
So every class of grammar under discussion
in the PEG literature is
already parsed in linear time by Marpa.
From this,
it is also safe to conclude that,
if a grammar can be parsed by
anything reasonably described as recursive descent,
it can be parsed in linear time by Marpa.
</p>
</body>
</html>
<br />
<p>posted at: 22:01 |
<a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2015/03/linear.html">direct link to this entry</a>
</p>
<div style="color:#38B0C0;padding:1px;text-align:center;">
§
§
§
</div>
<h3>Mon, 02 Mar 2015</h3>
<br />
<center><a name="peg"> <h2>PEG: Ambiguity, precision and confusion</h2> </a>
</center>
<html>
<head>
</head>
<body><p>
<!--
marpa_r2_html_fmt --no-added-tag-comment --no-ws-ok-after-start-tag
-->
</p>
<h3>Precise?</h3>
<p><a href="http://bford.info/packrat/">PEG parsing</a>
is a new notation
for a notorously tricky algorithm that goes back
to the earliest computers.
In its PEG form,
this algorithm acquired an seductive new interface,
one that looks like the best of
extended BNF combined with the best of regular expressions.
Looking at a sample of it, you are tempted to imagine
that writing a parser has suddenly become a very straightforward
matter. Not so.
</p>
<p>For those not yet in the know on this,
I'll illustrate with a pair of examples from
<a href="http://www.romanredz.se/papers/FI2008.pdf">
an excellent 2008 paper by Redziejowski</a>.
Let's start with these two PEG specifications.
</p><blockquote><pre>
("a"|"aa")"a"
("aa"|"a")"a"
</pre></blockquote><p>
One of these two PEG grammars accepts
the string "<tt>aaa</tt>" but not the string "<tt>aa</tt>".
The other does the opposite -- it accepts the string
the string "<tt>aa</tt>" but not the string "<tt>aaa</tt>".
Can you tell which one?
(For the answer,
see page 4 of
<a href="http://www.romanredz.se/papers/FI2008.pdf">
Redziejowski 2008</a>.)
</p><p>
Here is another example:
</p><blockquote><pre>
A = "a"A"a"/"aa"
</pre></blockquote><p>