-
Notifications
You must be signed in to change notification settings - Fork 0
/
lr0.ltx
2535 lines (2295 loc) · 81.8 KB
/
lr0.ltx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Copyright 2022 Jeffrey Kegler
% This document is licensed under
% a Creative Commons Attribution-NoDerivs 3.0 United States License.
\documentclass[12pt]{amsart}
\usepackage{amssymb}
\usepackage{xspace}
\usepackage{url}
\usepackage{placeins}
\usepackage{needspace}
\usepackage{float}
% \usepackage[obeyDraft]{todonotes}
\usepackage{todonotes}
\usepackage{pgf}
\usepackage{tikz}
\usetikzlibrary{arrows,automata}
\usetikzlibrary{positioning}
\tikzset{
state/.style={
rectangle,
rounded corners,
draw=black, thick,
minimum height=2em,
inner sep=2pt,
text centered,
},
}
% We deal with underfull vbox's by eyeball, or not at all.
\raggedbottom
\newcommand{\dfn}[1]{{\bf #1}}
\newcommand{\inference}[2]{\genfrac{}{}{1pt}{}{#1}{#2}}
\newcommand{\mymathop}[1]{\mathop{\texttt{#1}}}
\newcommand{\V}[1]{\ensuremath{\texttt{\mbox{#1}}}}
\newcommand{\type}[1]{\ensuremath{\texttt{#1}}}
\newcommand{\isWellDef}{\mathop{\downarrow}}
\newcommand{\isIllDef}{\mathop{\uparrow}}
\newcommand{\wellDefRel}{\mathrel{\downarrow}}
\newcommand{\illDefRel}{\mathrel{\uparrow}}
\newcommand{\wellDefB}[2]{#1\wellDefRel#2}
\newcommand{\illDefB}[2]{#1\illDefRel#2}
\newcommand{\qeq}{\simeq}
\newcommand{\nqeq}{\not\simeq}
\newcommand{\unicorn}{\ensuremath{\bot}\xspace}
\newcommand{\TRUE}{\ensuremath{\mathbb T}\xspace}
\newcommand{\FALSE}{\ensuremath{\mathbb F}\xspace}
\newcommand{\tuple}[1]{\ensuremath{\left\langle #1 \right\rangle}}
\newcommand{\seq}[1]{\ensuremath{\left[ #1 \right]}}
\newcommand{\seqB}[2]{\seq{#1 \mcoloncolon #2}}
\newcommand{\typed}[2]{\ensuremath{#1\mcolon#2}}
\newcommand{\Vtyped}[2]{\ensuremath{\V{#1}\mcolon#2}}
\newcommand{\VTtyped}[2]{\ensuremath{\V{#1}\mcolon\type{#2}}}
\newcommand{\mname}[1]{\mbox{\sf #1}}
% [O]perator [A]pplication
\newcommand{\Oname}[1]{\mname{#1}}
% The \mathopen{} and \mathclose{} eliminate unwanted extra space
\newcommand{\OA}[2]{\ensuremath{%
\mathop{\Oname{#1}}\mathopen{}\left(#2\right)\mathclose{}%
}}
\newcommand{\VOA}[2]{\OA{#1}{\V{#2}}}
% [F]unction [A]pplication, replaces \myfn
\newcommand{\smallrawfn}[2]{\ensuremath{\mathop{#1}(#2)}}
% The \mathopen{} and \mathclose{} eliminate unwanted extra space
\newcommand{\rawfn}[2]{\ensuremath{\mathop{#1}\mathopen{}\left(#2\right)\mathclose{}}}
\newcommand{\Fname}[1]{\V{#1}}
\newcommand{\FA}[2]{\ensuremath{\rawfn{\mathop{\Fname{#1}}}{#2}}}
\newcommand{\VFA}[2]{\FA{#1}{\V{#2}}}
% List operators
\newcommand{\Hd}[1]{\OA{hd}{#1}}
% With the list operators,
% is looks better if we don't increase paren size as we nest
\newcommand{\Tl}[1]{\Oname{tl}({#1})}
\newcommand{\HTl}[1]{\Hd{\Tl{#1}}}
\newcommand{\TTl}[1]{\Tl{\Tl{#1}}}
\newcommand{\HTTl}[1]{\Hd{\Tl{\Tl{#1}}}}
\newcommand{\TTTl}[1]{\Tl{\TTl{#1}}}
\newcommand{\defined}{\underset{\text{def}}{\equiv}}
\newcommand{\naturals}{\ensuremath{\mathbb{N}}\xspace}
\newcommand{\cat}{\mathbin{\scalebox{2}{.}}}
\newcommand{\expon}{\mathbin{{*}{*}}}
\newcommand{\mult}{\mathbin{\ast}}
\newcommand{\order}[1]{\ensuremath{{\mathcal O}(#1)}\xspace}
\newcommand{\size}[1]{\ensuremath{\left | {#1} \right |}}
\newcommand{\Vsize}[1]{\ensuremath{\size{\V{#1}}}}
\newcommand{\de}{\mathrel{::=}}
\newcommand{\derives}{\mathrel{\Rightarrow}}
\newcommand{\destar}
{\mathrel{\mbox{$\stackrel{\!{\ast}}{\Rightarrow\!}$}}}
\newcommand{\deplus}
{\mathrel{\mbox{$\stackrel{\!{+}}{\Rightarrow\!}$}}}
\newcommand{\mydot}{\raisebox{.05em}{$\,\bullet\,$}}
\newcommand{\lastix}[1]{\ensuremath{\##1}}
\newcommand{\Vlastix}[1]{\ensuremath{\lastix{\V{#1}}}}
\newcommand{\VlastElement}[1]{\Velement{#1}{\Vlastix{#1}}}
\newcommand{\element}[2]{\ensuremath{\mathop{#1}
\mathopen{}\left[#2\right]\mathclose{}}}
\newcommand{\Velement}[2]{\ensuremath{\mathop{\V{#1}}
\mathopen{}\left[#2\right]\mathclose{}}}
\newcommand{\VVelement}[2]{\ensuremath{\mathop{\V{#1}}
\mathopen{}\left[ \V{#2}
\right]\mathclose{} }}
\newcommand{\mcolon}{\mathrel:}
\newcommand{\mcoloncolon}{\mathrel{\vcenter{\hbox{$::$}}}}
\newcommand{\suchthat}{\mcoloncolon}
\newcommand{\quantify}[3]{% quantifier, binding, predicate
\ensuremath{\mathrel{#1}#2 \; \suchthat \; #3}%
}
\newcommand{\rIota}{\mathop{\scalebox{1.5}{\rotatebox[origin=c]{180}{$\iota$}}}}
\newcommand{\iotaQOp}{\mathop{\rIota}}
\newcommand{\iotaQ}[2]{\quantify{\iotaQOp}{#1}{#2}}
\newcommand{\epsQOp}{\mathop{\varepsilon}}
\newcommand{\epsQ}[2]{\quantify{\epsQOp}{#1}{#2}}
\newcommand{\lambdaQOp}{\mathop{\lambda}}
\newcommand{\lambdaQ}[2]{\quantify{\lambdaQOp}{#1}{#2}}
% collection builder variant, or descriptor
\newcommand{\setQvOp}{\mathop{\mathsection}}
\newcommand{\setQv}[2]{\quantify{\setQOp}{#1}{#2}}
\newcommand{\setB}[2]{\lbrace #1 \; \suchthat \; #2 \rbrace}
\newcommand{\existUniqQOp}{\mathop{\exists{!}}}
\newcommand{\existUniqQ}[2]{\quantify{\existUniqQOp}{#1}{#2}}
% extra {} in existQOP and univQOp to raise to baseline
\newcommand{\existQOp}{\mathop{\exists{}}}
\newcommand{\existQ}[2]{\quantify{\existQOp}{#1}{#2}}
\newcommand{\nexistQOp}{\mathop{\nexists{}}}
\newcommand{\nexistQ}[2]{\quantify{\nexistQOp}{#1}{#2}}
% extra {} in existQOP and univQOp to raise to baseline
\newcommand{\univQOp}{\mathop{\forall{}\,}}
\newcommand{\univQ}[2]{\quantify{\univQOp}{#1}{#2}}
% [f]u[n]ction [M]ap operator
\newcommand{\fnMap}{\rightharpoonup}
% [t]otal [f]u[n]ction [M]ap operator
\newcommand{\tfnMap}{\rightarrow}
% [t]otal [f]u[n]ction [M]ultidomain [M]ap operator
% \hbox shifts from math to text vertical alignment, which is what we want
\newcommand{\tfnMMap}{\ensuremath{
\mathrel{\raise.75pt\hbox{\ensuremath{
\raise1.19pt\hbox{\scalebox{.60}{$\ni$}} \mkern-12mu \rightarrow}}}
}}
\newcommand{\nat}[1]{\ensuremath{#1_\naturals}}
\newcommand{\Vnat}[1]{\nat{\V{#1}}}
\newcommand{\ah}[1]{#1_{\type{AH}}}
\newcommand{\Vah}[1]{\ensuremath{\V{#1}_{\type{AH}}}}
\newcommand{\dr}[1]{#1_{\type{DR}}}
\newcommand{\Vdr}[1]{\ensuremath{\V{#1}_{\type{DR}}}}
\newcommand{\orig}[1]{#1_{\type{ORIG}}}
\newcommand{\Vorig}[1]{\ensuremath{\V{#1}_{\type{ORIG}}}}
\newcommand{\setOf}[1]{{{#1}^{\mbox{\normalsize $\ast$}}}}
\newcommand{\Vrule}[1]{\ensuremath{\V{#1}_{\type{RULE}}}}
\newcommand{\Vruleset}[1]{\ensuremath{\V{#1}_{\setOf{\type{RULE}}}}}
\newcommand{\Veim}[1]{\ensuremath{\V{#1}_{\type{EIM}}}}
\newcommand{\Vahem}[1]{\ensuremath{\V{#1}_{\type{AHEM}}}}
\newcommand{\es}[1]{\ensuremath{#1_{\type{ES}}}}
\newcommand{\Ves}[1]{\ensuremath{\V{#1}_{\type{ES}}}}
\newcommand{\Vstr}[1]{\ensuremath{\V{#1}_{\type{STR}}}}
\newcommand{\sym}[1]{#1_{\type{SYM}}}
\newcommand{\Vsym}[1]{\ensuremath{\V{#1}_{\type{SYM}}}}
\newcommand{\Vsymset}[1]{\ensuremath{\V{#1}_{\setOf{\type{SYM}}}}}
\newcommand{\Eloc}[1]{\ensuremath{{#1}_{\type{LOC}}}}
\newcommand{\Vloc}[1]{\Eloc{\V{#1}}}
\newcommand{\Cfa}{\V{fa}}
\newcommand{\Cg}{\V{g}}
\newcommand{\Cw}{\V{w}}
\newcommand{\Crules}{\V{rules}}
\newcommand{\Nulling}[1]{\mymathop{Nulling}(#1)}
\newcommand{\Nullable}[1]{\mymathop{Nullable}(#1)}
\newcommand{\GOTO}[2]{\mymathop{GOTO}(#1, #2)}
\newcommand{\Next}[1]{\mymathop{Next}(#1)}
\newcommand{\Predict}[1]{\mymathop{Predict}(#1)}
\newcommand{\Postdot}[1]{\mymathop{Postdot}(#1)}
\newcommand{\Pos}[1]{\mymathop{Pos}(#1)}
\newcommand{\Rule}[1]{\mymathop{Rule}(#1)}
\newcommand{\LHS}[1]{\mymathop{LHS}(#1)}
\newcommand{\RHS}[1]{\mymathop{RHS}(#1)}
\newcommand{\Rules}[1]{\ensuremath{\mymathop{Rules}(#1)}}
\newcommand{\Vocab}[1]{\ensuremath{\mymathop{Vocab}(#1)}}
\newcommand{\Terminals}[1]{\ensuremath{\mymathop{Terminals}(#1)}}
\newcommand{\NT}[1]{\ensuremath{\mymathop{NT}(#1)}}
\newcommand{\Accept}[1]{\ensuremath{\mymathop{Accept}(#1)}}
\newcommand{\DR}[1]{\mymathop{DR}(#1)}
\newcommand{\AH}[1]{\mymathop{AH}(#1)}
\newcommand{\Transition}[1]{\mymathop{Transition}(#1)}
\newcommand{\Origin}[1]{\mymathop{Origin}(#1)}
\newcommand{\Current}[1]{\mymathop{Current}(#1)}
\newcommand{\myL}[1]{\mymathop{L}(#1)}
\newcommand\Etable[1]{\ensuremath{\mymathop{table}[#1]}}
\newcommand\bigEtable[1]{\ensuremath{\mymathop{table}\bigl[#1\bigr]}}
\newcommand\Vtable[1]{\Etable{\V{#1}}}
\newsavebox{\myBox}
\newlength{\myHt}
\newlength{\myDp}
\newlength{\myWd}
\newcommand{\padBoxC}[3]{%
\savebox{\myBox}{\ignorespaces#3}%
\usebox{\myBox}%
\myHt=\ht\myBox%
\myDp=\dp\myBox%
\vrule height\dimexpr\myHt+#1 depth\dimexpr\myDp+#2 width0pt%
}
\newcommand{\padBox}[2]{\padBoxC{#1}{#1}{#2}}
\newcommand{\cellboxWidth}[1]{4.5in}
% non-optional-argument version of cellbox
\newcommand{\cellboxB}[2]{%
\parbox{#1}{\ignorespaces#2}%
}
% Padded cellbox
\newcommand{\cellbox}[2][\cellboxWidth]{%
\cellboxB{#1}{%
\padBoxC{4pt}{0pt}{\vphantom{(}}%
{\ignorespaces#2}%
\padBoxC{0pt}{4pt}{\vphantom{(}}%
}%
}
% I like to silence underfull hbox messages.
% This is syntactic sugar for doing that.
\newenvironment{MYsloppy}[1][3em]{\par\tolerance9999 \emergencystretch#1\relax}{\par}
% This form allows the resetting of hbadness, which may be necessary to
% shut up an "underfull" warning.
\newenvironment{MYsloppyB}[2]{\par\tolerance9999 \emergencystretch#1 \hbadness#2\relax}{\par}
% This form is for a trick: Tolerances are evaluated line by line, so a tolerance
% less than 9999 may produce a better looking paragraph and reduce the "badness".
\newenvironment{MYsloppyC}[3]{\par\tolerance#3 \emergencystretch#1 \hbadness#2\relax}{\par}
\begin{document}
\date{\today}
\title{Marpa and nullable symbols}
\author{Jeffrey Kegler}
\thanks{%
Copyright \copyright\ 2023 Jeffrey Kegler. Version 1.
}
\thanks{%
This document is licensed under
a Creative Commons Attribution-NoDerivs 3.0 United States License.
}
\begin{abstract}
Marpa~\cite{Marpa2023} was intended
to make the best results
in the academic literature
on Earley's algorithm
available as
a practical general parser.
Earley-based parsers have had issues handling
nullable symbols.
Initially, we dealt with nullable symbols by following the approach
in Aycock and Horspool's 2002 paper~\cite{AH2002}.
This paper reports our experience with ~\cite{AH2002},
and the approach to handling nullables that we settled on
in reaction to that experience.
\end{abstract}
\maketitle
\section{Overview}
The Marpa recognizer~\cite{Marpa2023} was intended
to make the best results
in the academic literature
on Earley's algorithm
available as
a practical general parser.
Accordingly, when Marpa was first released in 2011,
it included the improved handling of nullable symbols
in Aycock and Horspool's 2002 paper~\cite{AH2002}.
Section \ref{sec:preliminaries}
deals with notation and other conventions.
Sections \ref{sec:earley}
and \ref{sec:earley-ops}
describe the Earley algorithm.
Sections \ref{sec:aycock-horspool-ideal-solution},
\ref{sec:aycock-horspool-finite-automata},
and \ref{sec:nihilist-normal-form}
describe the Aycock-Horspool version of Earley's
algorithm~\cite{AH2002}.
Sections \ref{sec:pitfall-of-counting-earley-items}
and \ref{sec:ahfa-states-are-not-disjoint}
raise some theoretical issues about \cite{AH2002}.
Section \ref{sec:problems-with-the-ahfa}
discusses the practical issues we encountered in
implementing~\cite{AH2002}.
In 2014 we extensively modified Marpa's usage of
the ideas from \cite{AH2002}.
Section
\ref{sec:marpa-s-approach-to-nullable-symbols}
describes the new approach
we took in Marpa to deal with nullable symbols.
Section \ref{sec:conclusion} summarizes our conclusions.
\section{Preliminaries}
\label{sec:preliminaries}
Readers should be familiar with Marpa~\cite{Marpa2023}
and with standard grammar notation.
It will also be useful to be familiar with Earley parsing,
and with Aycock and Horspool's 2002 paper~\cite{AH2002}.
We use the type system
of Farmer~2012~\cite{Farmer2012},
without needing most of its apparatus.\footnote{%
Types in \cite{Farmer2012} are
collections of classes (``superclasses'').
But in this paper, every explicitly stated type will be a ZF set.
}
Let \V{exp} be an expression,
and let \type{T} be its type.
Type may be indicated in a ``wide'' notation:
\[ \VTtyped{exp}{T} \]
More often, this paper will use the ``narrow''
notation,
where a subscripts indicates type:
\[ \V{exp}_\type{T} \]
A noteworthy feature we adopt from Farmer~2012~\cite{Farmer2012} is his notion of ill-definedness.
For example, the value of partial functions may be ill-defined for
some arguments in their domain.
Farmer's handling of ill-definedness is the traditional one,
and was well-entrenched,
but he was first to describe and formalize it.\footnote{%
See Farmer 2004~\cite{Farmer2004}.
Note that Farmer refers to ill-defined values as ``undefined''.
We found this problematic.
For example, a partial function may not have a value for
every argument in its domain.
Saying that the value of the partial function for these arguments
is defined as ``undefined'' is confusing.
In this paper we say that all the values of partial functions
are defined, but that some may not have values in the codomain,
and are therefore ill-defined.
}
We write \TRUE for ``true''; \FALSE for ``false'';
and $\unicorn$ for ill-defined.
A value is \dfn{well-defined} iff it is not ill-defined.
We write $\V{x}\isWellDef$ to say that \V{x} is well-defined,
and we write $\V{x}\isIllDef$ to say that \V{x} is ill-defined.
Traditionally, and in this paper,
any formula with an ill-defined operand is false.
This means that an equality both of whose operands are ill-defined
is false, so that $\neg(\unicorn = \unicorn)$.
For cases where this is inconvenient,
we introduce a new relation, $\simeq$, such that
$$
\V{a} \simeq \V{b} \defined \V{a} = \V{b} \lor (\V{a} \isIllDef \land \V{b} \isIllDef ).
$$
We often abbreviate ``if and only if'' to ``iff''.
We also often substitute the more prominent double colon ($\mcoloncolon$)
for the ``mid'' divider ($\mid$).
We define the natural numbers, \naturals, to include zero.
$\V{f} \circ \V{g}$, pronounced ``\V{f} after \V{g}'',
indicates the composition of the functions \V{f} and \V{g},
so that
$\rawfn{(\V{f}\circ\V{g})}{\V{x}} = \VFA{f}{\VFA{g}{\V{x}}}$.
The difference of the two sets, \V{S1} and \V{S2}, is written
\( \V{S1} \setminus \V{S2}. \)
We write tuples using angle brackets: $ \tuple{\V{a},\V{b},\V{c}} $.
The head of a tuple \V{S} can be written \Hd{S}.
The tail of a tuple \V{S} can be written \Tl{S}.
For example, where
\[
\V{S} = \tuple{ 42, 1729, 42 },
\]
then \V{S} is a 3-tuple, or triple,
\begin{gather*}
\Hd{S}=42, \\
\Tl{S}=\tuple{1729,42}, \\
\HTl{S}=1729, \\
\TTl{S}=42, \text{ and} \\
\TTTl{S}\isIllDef.
\end{gather*}
We define the natural numbers, \naturals, to include zero.
$\V{D} \fnMap \V{C}$ is the set of partial functions from domain \V{D} to codomain \V{C}.
$\V{D} \tfnMap \V{C}$ is the set of total functions from domain \V{D} to codomain \V{C}.
It follows that $\naturals \tfnMap \V{C}$ is the set of
infinite sequences of terms from the set \V{C};
and that $42 \tfnMap \V{C}$ is the set of
sequences of length 42 of terms from the set \V{C}.
We say that
$$ \V{domSet} \tfnMMap \V{C} \defined
\bigl\{ \V{fn} \bigm|
\left( \exists \; \V{D} \in \V{domSet} \; \middle| \; \V{fn} \in \V{D} \tfnMap \V{C}
\right) \bigr\}
$$
so that $\naturals \tfnMMap \V{C}$
is the set of finite sequences of terms from the set \V{C}.
We write \Vsize{\V{seq}} for the cardinality, or length,
of the sequence \V{seq}.
\VVelement{seq}{i} is the \V{i}'th term of the sequence \V{seq},
and is well-defined when
$0 \le \V{i} < \Vsize{seq}$.
We often specify a sequence by giving its terms inside
brackets.
For example,
$$\seq{\, \V{a}, \; 42, \; \seq{} \, }$$
is the sequence of length
3 whose terms are, in order, \V{a}, the number 42, and the empty sequence.
The last index of the sequence \V{seq} is \Vlastix{seq},
so that \VlastElement{seq} is the last term of \V{seq}.
\Vlastix{seq} is ill-defined for the empty sequence,
that is, if $\Vsize{seq} = 0$.
If \V{seq} is not the empty sequence, then
$\Vsize{seq} = \Vlastix{seq} + 1$.
Let \type{SYM} be the type for symbols, which
will will treat as opaque,
except that
\( \emptyset \notin \type{SYM} \).
The string type, \type{STR}, is the set of all finite sequences of
symbols:
\[ \type{STR} = \naturals \tfnMMap \type{SYM}. \]
The empty string is \( \emptyset \), but we let
\( \emptyset = \epsilon \)
and we usually write the empty string as \( \epsilon \).
We also use \(\epsilon\) as
the reserved non-symbol: \( \epsilon \notin \type{SYM} \).
A rule (type \type{RULE}) is a duple:
\[ \type{RULE} = \type{SYM} \times \type{STR}. \]
\[ \begin{gathered}
\LHS{\VTtyped{r}{RULE}} = \Hd{\V{r}}. \\
\RHS{\VTtyped{r}{RULE}} = \Tl{\V{r}}.
\end{gathered} \]
We usually write a rule \V{r} in
the form $\tuple{\Vsym{lhs} \de \Vstr{rhs}}$,
where $\V{lhs} = \LHS{\V{r}}$
and $\V{rhs} = \RHS{\V{r}}$.
\Vsym{lhs} is referred to as the left hand side (LHS)
of \Vrule{r}.
\Vstr{rhs} is referred to as the right hand side (RHS)
of \Vrule{r}.
A grammar is a 4-tuple, type \type{G},
\[
\type{G} = \setOf{\type{SYM}} \times \setOf{\type{SYM}} \times
\setOf{\type{RULE}} \times \type{SYM},
\]
where, if \VTtyped{g}{G} is a grammar, then
\[ \begin{gathered}
\Vocab{\V{g}} \defined \Hd{\V{g}}, \\
\Terminals{\V{g}} \defined \HTl{\V{g}}, \\
\Terminals{\V{g}} \subset \Vocab{\V{g}}, \\
\NT{\V{g}} \defined \Vocab{\V{g}} \setminus \Terminals{\V{g}}, \\
\Rules{\V{g}} \defined \HTTl{\V{g}}, \\
\V{r} \in \Rules{\V{g}} \implies \LHS{\V{r}} \in \NT{\V{g}}, \\
\V{r} \in \Rules{\V{g}} \implies \RHS{\V{r}} \in (\naturals \tfnMMap \Vocab{\V{g}}), \\
\Accept{\V{g}} \defined \TTTl{\V{g}},
\text{ and }\\
\Accept{\V{g}} \in \NT{\V{g}}. \\
\end{gathered} \]
The rules of a grammar imply the traditional rewriting system,
in which
\begin{itemize}
\item $\Vstr{x} \derives \Vstr{y}$
states that \V{x} derives \V{y} in exactly one step;
\item $\Vstr{x} \deplus \Vstr{y}$
states that \V{x} derives \V{y} in one or more steps;
and
\item $\Vstr{x} \destar \Vstr{y}$
states that \V{x} derives \V{y} in zero or more steps.
\end{itemize}
We call these rewrites \dfn{derivation steps}.
A sequence of zero or more derivation steps,
in which the left hand side of all but the first
is the right hand side of its predecessor,
is a \dfn{derivation}.
We say that the symbol \V{x} \dfn{induces}
the string of length 1 whose only term is that symbol,
that is, the string \seq{\Vsym{x}}.
Pedantically, the terms of derivations,
and the arguments of concatenations like
$\Vstr{s1} \cat \Vstr{s2}$,
must be strings.
But in concatenations and derivation steps we often
write the symbol to represent the string it induces so
that
$$ \Vstr{a} \cat \Vsym{b} \cat \Vstr{c}
= \Vstr{a} \cat \seq{\Vsym{b}} \cat \Vstr{c}.
$$
A \dfn{sentence} of \V{g} is a string of terminals derivable
from \Accept{\V{g}}.
The language of a grammar \V{g}, $\myL{\V{g}}$,
is its set of sentences:
\[ \begin{gathered}
\myL{\V{g}} = \setB{ \VTtyped{sentence}{STR} }{ \\
\V{sentence} \in \setOf{\Terminals{\V{g}}} \land \Accept{\V{g}} \destar \V{sentence} }
\end{gathered} \]
We say that a string \V{x} is \dfn{nullable},
$\Nullable{\Vstr{x}}$, iff the empty string can be
derived from it:
$$\Nullable{\Vstr{x}} \defined \V{x} \destar \epsilon.$$
We say that a string \V{x} is \dfn{nulling},
$\Nulling{\Vstr{x}}$,
iff it always eventually derives the null
string:
\[
\Nulling{\Vstr{s}} \defined
\forall \; \Vstr{y} \;\mid\; \V{x} \destar \V{y} \implies \V{y} \destar \epsilon.
\]
We say that symbols are nulling or nullable based on the string they
induce:
\begin{gather*}
\Nullable{\Vsym{x}} \defined \Nullable{\seq{\V{x}}}. \\
\Nulling{\Vsym{x}} \defined \Nulling{\seq{\V{x}}}.
\end{gather*}
A string or symbol is
\begin{itemize}
\item \dfn{non-nullable} iff it is not nullable;
\item \dfn{properly nullable} iff it is nullable,
but not nulling; and
\item \dfn{non-nulling} iff it is not nulling.
\end{itemize}
We often refer to nullable symbols as \dfn{nullables},
and to properly nullable symbols as \dfn{proper nullables},
We assume the grammars in this paper are ``augmented''.
We say that a grammar \V{g} is \dfn{augmented} iff
\begin{itemize}
\item there is an \dfn{accept rule},
\begin{equation}
\label{eq:accept-rule}
\Vrule{accept} = \tuple{ \Vsym{accept} \de \Vsym{start} },
\end{equation}
where \( \Vsym{start} \in \Vocab{\V{g}} \)
and \( \Vsym{accept} = \Accept{\V{g}} \) is the \dfn{accept symbol};
\item the accept symbol does not appear as a RHS symbol in any rule,
\begin{equation*}
\begin{gathered}
\forall \; \V{x} \in \Rules{\V{g}} \mid
\; \nexists \; \Vstr{pre}, \; \Vstr{post} \; \mid \\
\V{pre} \cat \Vsym{accept} \cat \V{post} = \RHS{\Vrule{x}}; \text{ and} \\
\end{gathered}
\end{equation*}
\item the accept rule is unique,
\[ \Vsym{accept} = \LHS{\V{x}} \implies \Vrule{accept} = \V{x}. \]
\end{itemize}
Let the input to
the parse be $\VTtyped{w}{STR}$.
Locations in the input will be of type \type{LOC},
where \( \type{LOC} = \naturals \).
When we state our complexity results later,
they will often be in terms of $\V{n}$,
where $\V{n} = \Vsize{w}$.
\typed{\VVelement{w}{i}}{\type{SYM}} is the \Vloc{i}'th
character
of the input,
and is well-defined when
$0 \le \Vloc{i} < \Vsize{w}$.
\section{Earley's algorithm}
\label{sec:earley}
A dotted rule (type \type{DR}) is a duple,
\[ \type{DR} = \type{RULE} \times \naturals, \]
such that, if \V{dr} is of type \type{DR},
\[ \begin{gathered}
\Rule{\V{dr}} \defined \Hd{\V{dr}}, \\
\Pos{\V{dr}} \defined \Tl{\V{dr}}, \text{ and} \\
\Pos{\V{dr}} \le \size{\Rule{\V{dr}}}.
\end{gathered} \]
We say that $\Rule{\V{dr}}$ is the \dfn{rule} of \V{dr}
and $\Pos{\V{dr}}$ is the \dfn{dot position} of \V{dr}.
The dot position of a dotted rule indicates the extent to which
the rule has been recognized,
and is represented with a large raised dot,
so that if
\begin{equation*}
\tuple{ \Vsym{A} \de \Vsym{X} \cat \Vsym{Y} \cat \Vsym{Z}}
\end{equation*}
is a rule,
\begin{equation*}
\tuple{ \Vsym{A} \de \V{X} \cat \V{Y} \mydot \V{Z}}
\end{equation*}
is the dotted rule with the dot at
$\V{pos} = 2$,
between \Vsym{Y} and \Vsym{Z}.
Every rule concept, when applied to a dotted rule,
is applied to the rule of the dotted rule.
The following are examples:
\[
\begin{gathered}
\LHS{\Vdr{x}} \defined \LHS{\Rule{\V{x}}}. \\
\RHS{\Vdr{x}} \defined \RHS{\Rule{\V{x}}}.
\end{gathered}
\]
We also make the following definitions:
\[
\Postdot{\Vdr{x}} \defined
\begin{cases}
\Vsym{next}, \text{ if } \existQ{
\Vstr{pre}, \; \Vstr{post}, \; \Vsym{A}}{ \\
\qquad \V{x} = \tuple{ \V{A} \de \V{pre} \mydot \V{next} \cat \V{post}}}, \\
\unicorn, \text{ otherwise.}
\end{cases} \\
\]
\[
\Next{\Vdr{x}} \defined
\begin{cases}
\VTtyped{\tuple{ \Vsym{A} \de \Vstr{pre} \cat \Vsym{next} \mydot \Vstr{post}}}{DR}, \\
\qquad \text{if } \existQ{
\Vstr{pre}, \; \Vstr{post}, \; \Vsym{A}}{ \\
\qquad \qquad \V{x} = \tuple{ \V{A} \de \V{pre} \mydot \V{next} \cat \V{post}} }, \\
\unicorn, \text{ otherwise.}
\end{cases} \\
\]
The \dfn{initial dotted rule} is
\begin{equation}
\label{eq:initial-dr}
\Vdr{initial} = \tuple{\Vsym{accept} \de \mydot \Vsym{start} },
\end{equation}
where \Vsym{accept} and \Vsym{start} are as in
the accept rule,
\eqref{eq:accept-rule}
on page \pageref{eq:accept-rule}.
A \dfn{predicted dotted rule} is a dotted rule,
other than the initial dotted rule,
with a dot position of zero,
for example,
\begin{equation*}
\Vdr{predicted} = \tuple{\Vsym{A} \de \mydot \Vstr{alpha} }.
\end{equation*}
A \dfn{confirmed dotted rule}
is the initial dotted rule,
or a dotted rule
with a dot position greater than zero.
A \dfn{completed dotted rule} is a dotted rule with its dot
position after the end of its RHS,
for example,
\begin{equation*}
\Vdr{completed} = \tuple{\Vsym{A} \de \Vstr{alpha} \mydot }.
\end{equation*}
Predicted, confirmed and completed dotted rules
are also called, respectively,
\dfn{predictions}, \dfn{confirmations} and \dfn{completions}.
An Earley item (type \type{EIM}) is a triple,\footnote{%
This definition of \type{EIM} departs from tradition.
According to the tradition, the current location is not an element
of the tuples which define \type{EIM}s.
Instead \type{EIM}s
are grouped into sets
that share the same current location.
These sets of co-located \type{EIM}s are called
``Earley sets''.
Membership in an Earley set becomes a ``property''
of each \type{EIM}.
In set theory, by the Axiom of Extensionality,
sets with the same membership (or extension)
are equivalent.
In set theory, non-extensional properties,
such as membership in an Earley set,
do not make sets with the same extension distinct.
It can happen that \type{EIM}s in different Earley sets,
by the Axiom of Extensionality, are equal as sets.
Since in this paper, as in most modern mathematics, equality
means equality as sets,
and since \type{EIM}s at different current locations
are conceptually distinct for almost all purposes,
the omission of current location from the \type{EIM}
definition is awkward.
For this reason,
in this paper, we honor the traditional definition of an \type{EIM}
in the breach.
}
\[ \type{EIM} = \type{DR} \times \type{LOC} \times \type{LOC}, \]
such that, when \V{x} is of type \type{EIM},
\begin{gather*}
\DR{\V{x}} \defined \Hd{\V{x}}, \\
\Origin{\V{x}} \defined \HTl{\V{x}}, \text{ and} \\
\Current{\V{x}} \defined \TTl{\V{x}}.
\end{gather*}
\begin{MYsloppy}
We say that $\DR{\Veim{x}}$ is the \dfn{dotted rule} of \V{x}.
We say that $\Current{\Veim{x}}$
is the \dfn{current location} of \Veim{x}.
The current location of an Earley item \Veim{x} is the location in the input
where the rule $\Rule{\DR{\Veim{x}}}$
was recognized as far as the dot in the dotted rule of \Veim{x}.
\end{MYsloppy}
We say that
$\Origin{\Veim{x}}$
is the \dfn{origin} of \Veim{x}.
The origin of an Earley item
is the location in the input where recognition of $\Rule{\DR{\Veim{x}}}$
started.
For convenience, the type \type{ORIG} will be a synonym
for \type{LOC}, indicating that the variable designates
the origin entry of an Earley item.
We find it convenient to apply dotted rule concepts to
\type{EIM}'s,
so that the concept applied to the \type{EIM} is
the concept applied to the dotted rule of the \type{EIM}.
The following are examples:
\begin{gather*}
\LHS{\Veim{x}} \defined \LHS{\DR{\V{dr}}}. \\
\RHS{\Veim{x}} \defined \RHS{\DR{\V{dr}}}. \\
\Pos{\Veim{x}} \defined \Pos{\DR{\V{dr}}}. \\
\Postdot{\Veim{x}} \defined \Postdot{\DR{\V{dr}}}. \\
\Next{\Veim{x}} \defined \Next{\DR{\V{dr}}}. \\
\Rule{\Veim{x}} \defined \Rule{\DR{\V{dr}}}.
\end{gather*}
An Earley parser builds a table of Earley sets,
\begin{equation*}
\Vtable{i},
\quad \text{where} \quad
0 \le \Vloc{i} \le \size{\Cw}.
\end{equation*}
Earley sets (type \type{ES}) are set of \type{EIM}s,
\[ \type{ES} = \setOf{\type{EIM}}. \]
Earley sets are often named by their location.
That is,
the bijection between \Ves{i} and \Vloc{i}
allows locations to be treated as the ``names'' of Earley sets.
We often write \Ves{i} to mean the Earley set at \Vloc{i},
and \Vloc{x} to mean the location of Earley set \Ves{x}.
The type designator \type{ES} is often omitted to avoid clutter,
especially in cases where the Earley set is not
named by location.
Occasionally the naming location is a expression,
so that
$$ \es{(\Origin{\Veim{x}})} $$
is the Earley set at the origin of the \type{EIM} \Veim{x}.
If \es{\V{working}} is an Earley set,
$\size{\es{\V{working}}}$ is the number of Earley items.
Recalling the accept rule,
\eqref{eq:accept-rule}
on page \pageref{eq:accept-rule},
we say that
the input \Cw{} is accepted if and only if
\begin{equation*}
\tuple{\tuple{\Vsym{accept} \de \Vsym{start} \mydot}, 0} \in \bigEtable{\Vsize{\Cw}}.
\end{equation*}
\section{Operations of the Earley algorithm}
\label{sec:earley-ops}
\textbf{Initialization}:
\begin{equation}
\label{eq:initial}
\inference{
\TRUE
}{
\begin{array}{c}
\VTtyped{ \tuple{ \Vdr{initial}, 0, 0 }}{EIM} \\
\end{array}
}
\end{equation}
Here \Vdr{initial} is
from \eqref{eq:initial-dr} on \pageref{eq:initial-dr}.
Earley {\bf initialization} only takes
place in Earley set 0,
and always adds exactly one \type{EIM}.
\vspace{1ex}
\textbf{Scanning}:
\begin{equation}
\label{eq:scan}
\inference{
\begin{array}{c}
\Postdot{\Veim{mainstem}} = \Cw\bigl[ \Vloc{current} \bigr]
\end{array}
}{
\begin{array}{c}
\VTtyped{ \tuple{ \Next{\Veim{mainstem}}, \Origin{\V{mainstem}},
\V{current}+1
}}{EIM}
\end{array}
}
\end{equation}
\vspace{1ex}
\textbf{Reduction}:
\begin{equation}
\label{eq:reduction}
\inference{
\begin{array}{c}
\Current{\Veim{mainstem}} = \Origin{\Veim{tributary}} \\
\Postdot{\Veim{mainstem}} = \LHS{\Veim{tributary}}
\end{array}
}{
\begin{array}{c}
\left< \begin{gathered}
\Next{\Veim{mainstem}}, \Origin{\V{mainstem}}, \\
\Current{\V{tributary}}
\end{gathered} \right> : \type{EIM}
\end{array}
}
\end{equation}
\vspace{1ex}
\textbf{Prediction}:
\begin{equation}
\label{eq:prediction}
\inference{
\begin{array}{c}
\tuple{ \Vsym{lhs} \de \Vstr{rhs} } \in \Rules{\V{g}} \\
\Postdot{\V{mainstem}} = \Vsym{lhs}
\end{array}
}{
\begin{array}{c}
\left< \begin{gathered}
\tuple{ \Vsym{lhs} \de \mydot \Vstr{rhs} }, \Current{\V{mainstem}}, \\
\Current{\V{mainstem}}
\end{gathered} \right> : \type{EIM}
\end{array}
}
\end{equation}
In traditional implementations,
the operations are applied
to create Earley sets,
in order from 0 to \Vsize{w}.
Duplicate \type{EIM}s are not added.
In typical implementations,
scanning is run
ahead of the other operations
in the sense that, while the other operations are adding
items to the Earley set at \Vloc{i},
scanned items are added to \es{(\V{i}+1)}.
Traditionally, each Earley set is implementated as a list.
The result of a prediction operation,
\eqref{eq:prediction} on page \pageref{eq:prediction},
may be the \V{mainstem} of a reduction operation,
\eqref{eq:reduction} on page \pageref{eq:reduction}.%
\label{text:prediction-completion-cycle}
That reduction operation may, in turn,
produce new predictions.
Typically, an implementation dealt with this
by making repeated passes
through the Earley set,
terminating when no more \type{EIM}s could be added.
\section{The Aycock-Horspool ``ideal'' solution}
\label{sec:aycock-horspool-ideal-solution}
The need to make multiple passes over each Earley set
was long seen as a problem.
Aycock and Horspool experimented with a suggestion
from Jay Earley that required
a dynamically-updated data structure,
but found this solution unsatisfactory~\cite[p. 621]{AH2002}.
Instead, they found a way to revise the prediction step
itself that eliminated the problem.
To present Aycock and Horspool's revised prediction step,
we first rewrite the original prediction operation
\eqref{eq:prediction} on page \pageref{eq:prediction},
in the form of a function,
\[ \Vtyped{Opred}{\type{EIM} \tfnMap \setOf{\type{EIM}}}, \]
such that
\begin{equation}
\label{eq:original-earley-prediction}
\begin{gathered}
\VFA{Opred}{\Veim{x}} =
\left\lbrace
\Veim{pred} \; \middle| \;
\begin{gathered}
\LHS{\V{pred}} = \Postdot{\V{eim}} \\
\land \, \Rule{\V{pred}} \in \Rules{\V{g}} \\
\land \, \Pos{\V{pred}} = 0 \\
\land \, \Current{\V{pred}} = \Current{\Veim{x}} \\
\land \, \Origin{\V{pred}} = \Current{\Veim{x}}
\end{gathered}
\right\rbrace . \end{gathered}
\end{equation}
Let
\( \Vtyped{AHpred}{\type{EIM} \tfnMap \setOf{\type{EIM}}} \)
be the transitive closure of \Fname{Opred}.
More precisely,
\begin{equation}
\label{eq:ah-earley-prediction}
\VFA{AHpred}{\VTtyped{x}{EIM}} = \Fname{predAll} \circ \Fname{Opred},
\end{equation}
where
\begin{equation*}
\begin{gathered}
\VFA{predAll}{\Vtyped{eimSet}{\setOf{\type{EIM}}}} = \\
\setB{\VTtyped{eim}{EIM}}{ \existQ{\Vtyped{i}{\naturals}}{
\V{eim} \in \VFA{predN}{\V{i}, \V{eimSet}} } }, \\[1ex]
\VFA{predN}{0, \Vtyped{eimSet}{\setOf{\type{EIM}}}} = \V{eimSet}, \text{ and} \\[1ex]
\VFA{predN}{(\Vtyped{n}{\naturals})+1, \Vtyped{eimSet}{\setOf{\type{EIM}}}} = \\
\setB{ \VTtyped{eim}{EIM} }{ \existQ{\VTtyped{eim2}{EIM}}{ \\
\V{eim} \in \VFA{Opred}{\V{eim2}} \\
\land \, \V{eim2} \in \VFA{predN}{\V{n},\V{eimSet}}}}.
\end{gathered}
\end{equation*}
Aycock and Horspool proved \cite[pp. 621-622]{AH2002} that,
by replacing the Earley prediction operation
(equation \ref{eq:original-earley-prediction}
on page \pageref{eq:original-earley-prediction}),
with its transitive closure
(equation \ref{eq:ah-earley-prediction}
on page \pageref{eq:ah-earley-prediction}),
they had created an algorithm that successful dealt with
nullable symbols.
Their algorithm required only one pass,
``retain[ed] the elegance of Earley's algorithm'',
and did not require a new, dynamically-updated data structure.
\section{The Aycock-Horspool finite automata}
\label{sec:aycock-horspool-finite-automata}
Aycock and Horspool called the algorithm of
the previous section ``ideal'',
but the quote marks are theirs
\cite[p. 621]{AH2002}.
This seems to reflect a perception on their part that,
while they had an idea
that allowed an Earley-based parse engine
to handle nullable symbols gracefully,
a practical implementation would demand some refinements.
The ``ideal'' solution of~\cite{AH2002},
if naively implemented,
required every prediction operation to compute a transitive closure.
But, during the parse, this transitive closure was a constant ---
it depended only on the grammar and the postdot symbol of the argument EIM.
Clearly, precomputation could eliminate most or all of the runtime
cost of their new prediction operation.
For their precomputation, Aycock and Horspool~\cite{AH2002}
invented a new semi-deterministic finite automata,
which \cite{AH2002} calls a ``split LR(0) $\epsilon$-DFA''.
In this paper,
we calling their ``split LR(0) $\epsilon$-DFA'',
an Aycock-Horspool Finite Automata (AHFA).
The AHFA is based
on a few observations.
\begin{itemize}
\item
In practice, Earley items sharing the same origin,
but having different dotted rules,
often appear together in the same Earley set.
\item
There is in the literature a method
for associating groups of dotted rules that often appear together
when parsing.
This method is the LR(0) DFA used in the much-studied
LALR and LR parsers.
\item
The LR(0) items that are the components of LR(0)
states are, exactly, dotted rules.
\item
By taking into account symbols that derive the
null string, the LR(0) DFA could be turned into an