-
Notifications
You must be signed in to change notification settings - Fork 1.9k
/
Copy pathall-text.txt
2069 lines (2069 loc) · 63 KB
/
all-text.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
Google AI Language
{jacobdevlin,mingweichang,kentonl,kristout}@google.com
Abstract
We introduce a new language representa-
tion model called BERT, which stands for
Bidirectional Encoder Representations from
Transformers. Unlike recent language repre-
sentation models (Peters et al., 2018a; Rad-
ford et al., 2018), BERT is designed to pre-
train deep bidirectional representations from
unlabeled text by jointly conditioning on both
left and right context in all layers. As a re-
sult, the pre-trained BERT model can be fine-
tuned with just one additional output layer
to create state-of-the-art models for a wide
range of tasks, such as question answering and
language inference, without substantial task-
specific architecture modifications.
BERT is conceptually simple and empirically
powerful.
It obtains new state-of-the-art re-
sults on eleven natural language processing
tasks, including pushing the GLUE score to
80.5% (7.7% point absolute improvement),
MultiNLI accuracy to 86.7% (4.6% absolute
improvement), SQuAD v1.1 question answer-
ing Test F1 to 93.2 (1.5 point absolute im-
provement) and SQuAD v2.0 Test F1 to 83.1
(5.1 point absolute improvement).
1
Introduction
Language model pre-training has been shown to
be effective for improving many natural language
processing tasks (Dai and Le, 2015; Peters et al.,
2018a; Radford et al., 2018; Howard and Ruder,
2018). These include sentence-level tasks such as
natural language inference (Bowman et al., 2015;
Williams et al., 2018) and paraphrasing (Dolan
and Brockett, 2005), which aim to predict the re-
lationships between sentences by analyzing them
holistically, as well as token-level tasks such as
named entity recognition and question answering,
where models are required to produce fine-grained
output at the token level (Tjong Kim Sang and
De Meulder, 2003; Rajpurkar et al., 2016).
There are two existing strategies for apply-
ing pre-trained language representations to down-
stream tasks: feature-based and fine-tuning. The
feature-based approach, such as ELMo (Peters
et al., 2018a), uses task-specific architectures that
include the pre-trained representations as addi-
tional features. The fine-tuning approach, such as
the Generative Pre-trained Transformer (OpenAI
GPT) (Radford et al., 2018), introduces minimal
task-specific parameters, and is trained on the
downstream tasks by simply fine-tuning all pre-
trained parameters. The two approaches share the
same objective function during pre-training, where
they use unidirectional language models to learn
general language representations.
We argue that current techniques restrict the
power of the pre-trained representations, espe-
cially for the fine-tuning approaches.
The ma-
jor limitation is that standard language models are
unidirectional, and this limits the choice of archi-
tectures that can be used during pre-training. For
example, in OpenAI GPT, the authors use a left-to-
right architecture, where every token can only at-
tend to previous tokens in the self-attention layers
of the Transformer (Vaswani et al., 2017). Such re-
strictions are sub-optimal for sentence-level tasks,
and could be very harmful when applying fine-
tuning based approaches to token-level tasks such
as question answering, where it is crucial to incor-
porate context from both directions.
In this paper, we improve the fine-tuning based
approaches by proposing BERT: Bidirectional
Encoder
Representations
from
Transformers.
BERT alleviates the previously mentioned unidi-
rectionality constraint by using a “masked lan-
guage model” (MLM) pre-training objective, in-
spired by the Cloze task (Taylor, 1953).
The
masked language model randomly masks some of
the tokens from the input, and the objective is to
predict the original vocabulary id of the masked
arXiv:1810.04805v2 [cs.CL] 24 May 2019
word based only on its context.
Unlike left-to-
right language model pre-training, the MLM ob-
jective enables the representation to fuse the left
and the right context, which allows us to pre-
train a deep bidirectional Transformer. In addi-
tion to the masked language model, we also use
a “next sentence prediction” task that jointly pre-
trains text-pair representations. The contributions
of our paper are as follows:
• We demonstrate the importance of bidirectional
pre-training for language representations. Un-
like Radford et al. (2018), which uses unidirec-
tional language models for pre-training, BERT
uses masked language models to enable pre-
trained deep bidirectional representations. This
is also in contrast to Peters et al. (2018a), which
uses a shallow concatenation of independently
trained left-to-right and right-to-left LMs.
• We show that pre-trained representations reduce
the need for many heavily-engineered task-
specific architectures. BERT is the first fine-
tuning based representation model that achieves
state-of-the-art performance on a large suite
of sentence-level and token-level tasks, outper-
forming many task-specific architectures.
• BERT advances the state of the art for eleven
NLP tasks.
The code and pre-trained mod-
els are available at https://github.com/
google-research/bert.
2
Related Work
There is a long history of pre-training general lan-
guage representations, and we briefly review the
most widely-used approaches in this section.
2.1
Unsupervised Feature-based Approaches
Learning widely applicable representations of
words has been an active area of research for
decades, including non-neural (Brown et al., 1992;
Ando and Zhang, 2005; Blitzer et al., 2006) and
neural (Mikolov et al., 2013; Pennington et al.,
2014) methods.
Pre-trained word embeddings
are an integral part of modern NLP systems, of-
fering significant improvements over embeddings
learned from scratch (Turian et al., 2010). To pre-
train word embedding vectors, left-to-right lan-
guage modeling objectives have been used (Mnih
and Hinton, 2009), as well as objectives to dis-
criminate correct from incorrect words in left and
right context (Mikolov et al., 2013).
These approaches have been generalized to
coarser granularities, such as sentence embed-
dings (Kiros et al., 2015; Logeswaran and Lee,
2018) or paragraph embeddings (Le and Mikolov,
2014).
To train sentence representations, prior
work has used objectives to rank candidate next
sentences (Jernite et al., 2017; Logeswaran and
Lee, 2018), left-to-right generation of next sen-
tence words given a representation of the previous
sentence (Kiros et al., 2015), or denoising auto-
encoder derived objectives (Hill et al., 2016).
ELMo and its predecessor (Peters et al., 2017,
2018a) generalize traditional word embedding re-
search along a different dimension. They extract
context-sensitive features from a left-to-right and a
right-to-left language model. The contextual rep-
resentation of each token is the concatenation of
the left-to-right and right-to-left representations.
When integrating contextual word embeddings
with existing task-specific architectures, ELMo
advances the state of the art for several major NLP
benchmarks (Peters et al., 2018a) including ques-
tion answering (Rajpurkar et al., 2016), sentiment
analysis (Socher et al., 2013), and named entity
recognition (Tjong Kim Sang and De Meulder,
2003). Melamud et al. (2016) proposed learning
contextual representations through a task to pre-
dict a single word from both left and right context
using LSTMs. Similar to ELMo, their model is
feature-based and not deeply bidirectional. Fedus
et al. (2018) shows that the cloze task can be used
to improve the robustness of text generation mod-
els.
2.2
Unsupervised Fine-tuning Approaches
As with the feature-based approaches, the first
works in this direction only pre-trained word em-
bedding parameters from unlabeled text
(Col-
lobert and Weston, 2008).
More recently, sentence or document encoders
which produce contextual token representations
have been pre-trained from unlabeled text and
fine-tuned for a supervised downstream task (Dai
and Le, 2015; Howard and Ruder, 2018; Radford
et al., 2018). The advantage of these approaches
is that few parameters need to be learned from
scratch.
At least partly due to this advantage,
OpenAI GPT (Radford et al., 2018) achieved pre-
viously state-of-the-art results on many sentence-
level tasks from the GLUE benchmark (Wang
et al., 2018a).
Left-to-right language model-
BERT
BERT
E[CLS]
E1
E[SEP]
...
EN
E1’
...
EM’
C
T1
T[SEP]
...
TN
T1’
...
TM’
[CLS]
Tok 1
[SEP]
...
Tok N
Tok 1
...
TokM
Question
Paragraph
Start/End Span
BERT
E[CLS]
E1
E[SEP]
...
EN
E1’
...
EM’
C
T1
T[SEP]
...
TN
T1’
...
TM’
[CLS]
Tok 1
[SEP]
...
Tok N
Tok 1
...
TokM
Masked Sentence A
Masked Sentence B
Pre-training
Fine-Tuning
NSP
Mask LM
Mask LM
Unlabeled Sentence A and B Pair
SQuAD
Question Answer Pair
NER
MNLI
Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec-
tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize
models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special
symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-
tions/answers).
ing and auto-encoder objectives have been used
for pre-training such models (Howard and Ruder,
2018; Radford et al., 2018; Dai and Le, 2015).
2.3
Transfer Learning from Supervised Data
There has also been work showing effective trans-
fer from supervised tasks with large datasets, such
as natural language inference (Conneau et al.,
2017) and machine translation (McCann et al.,
2017). Computer vision research has also demon-
strated the importance of transfer learning from
large pre-trained models, where an effective recipe
is to fine-tune models pre-trained with Ima-
geNet (Deng et al., 2009; Yosinski et al., 2014).
3
BERT
We introduce BERT and its detailed implementa-
tion in this section. There are two steps in our
framework: pre-training and fine-tuning.
Dur-
ing pre-training, the model is trained on unlabeled
data over different pre-training tasks.
For fine-
tuning, the BERT model is first initialized with
the pre-trained parameters, and all of the param-
eters are fine-tuned using labeled data from the
downstream tasks. Each downstream task has sep-
arate fine-tuned models, even though they are ini-
tialized with the same pre-trained parameters. The
question-answering example in Figure 1 will serve
as a running example for this section.
A distinctive feature of BERT is its unified ar-
chitecture across different tasks. There is mini-
mal difference between the pre-trained architec-
ture and the final downstream architecture.
Model Architecture
BERT’s model architec-
ture is a multi-layer bidirectional Transformer en-
coder based on the original implementation de-
scribed in Vaswani et al. (2017) and released in
the tensor2tensor library.1 Because the use
of Transformers has become common and our im-
plementation is almost identical to the original,
we will omit an exhaustive background descrip-
tion of the model architecture and refer readers to
Vaswani et al. (2017) as well as excellent guides
such as “The Annotated Transformer.”2
In this work, we denote the number of layers
(i.e., Transformer blocks) as L, the hidden size as
H, and the number of self-attention heads as A.3
We primarily report results on two model sizes:
BERTBASE (L=12, H=768, A=12, Total Param-
eters=110M) and BERTLARGE (L=24, H=1024,
A=16, Total Parameters=340M).
BERTBASE was chosen to have the same model
size as OpenAI GPT for comparison purposes.
Critically, however, the BERT Transformer uses
bidirectional self-attention, while the GPT Trans-
former uses constrained self-attention where every
token can only attend to context to its left.4
1https://github.com/tensorflow/tensor2tensor
2http://nlp.seas.harvard.edu/2018/04/03/attention.html
3In all cases we set the feed-forward/filter size to be 4H,
i.e., 3072 for the H = 768 and 4096 for the H = 1024.
4We note that in the literature the bidirectional Trans-
Input/Output Representations
To make BERT
handle a variety of down-stream tasks, our input
representation is able to unambiguously represent
both a single sentence and a pair of sentences
(e.g., ⟨ Question, Answer ⟩) in one token sequence.
Throughout this work, a “sentence” can be an arbi-
trary span of contiguous text, rather than an actual
linguistic sentence. A “sequence” refers to the in-
put token sequence to BERT, which may be a sin-
gle sentence or two sentences packed together.
We use WordPiece embeddings (Wu et al.,
2016) with a 30,000 token vocabulary. The first
token of every sequence is always a special clas-
sification token ([CLS]). The final hidden state
corresponding to this token is used as the ag-
gregate sequence representation for classification
tasks. Sentence pairs are packed together into a
single sequence. We differentiate the sentences in
two ways. First, we separate them with a special
token ([SEP]). Second, we add a learned embed-
ding to every token indicating whether it belongs
to sentence A or sentence B. As shown in Figure 1,
we denote input embedding as E, the final hidden
vector of the special [CLS] token as C ∈ RH,
and the final hidden vector for the ith input token
as Ti ∈ RH.
For a given token, its input representation is
constructed by summing the corresponding token,
segment, and position embeddings. A visualiza-
tion of this construction can be seen in Figure 2.
3.1
Pre-training BERT
Unlike Peters et al. (2018a) and Radford et al.
(2018), we do not use traditional left-to-right or
right-to-left language models to pre-train BERT.
Instead, we pre-train BERT using two unsuper-
vised tasks, described in this section. This step
is presented in the left part of Figure 1.
Task #1: Masked LM
Intuitively, it is reason-
able to believe that a deep bidirectional model is
strictly more powerful than either a left-to-right
model or the shallow concatenation of a left-to-
right and a right-to-left model.
Unfortunately,
standard conditional language models can only be
trained left-to-right or right-to-left, since bidirec-
tional conditioning would allow each word to in-
directly “see itself”, and the model could trivially
predict the target word in a multi-layered context.
former is often referred to as a “Transformer encoder” while
the left-context-only version is referred to as a “Transformer
decoder” since it can be used for text generation.
In order to train a deep bidirectional representa-
tion, we simply mask some percentage of the input
tokens at random, and then predict those masked
tokens. We refer to this procedure as a “masked
LM” (MLM), although it is often referred to as a
Cloze task in the literature (Taylor, 1953). In this
case, the final hidden vectors corresponding to the
mask tokens are fed into an output softmax over
the vocabulary, as in a standard LM. In all of our
experiments, we mask 15% of all WordPiece to-
kens in each sequence at random. In contrast to
denoising auto-encoders (Vincent et al., 2008), we
only predict the masked words rather than recon-
structing the entire input.
Although this allows us to obtain a bidirec-
tional pre-trained model, a downside is that we
are creating a mismatch between pre-training and
fine-tuning, since the [MASK] token does not ap-
pear during fine-tuning. To mitigate this, we do
not always replace “masked” words with the ac-
tual [MASK] token. The training data generator
chooses 15% of the token positions at random for
prediction. If the i-th token is chosen, we replace
the i-th token with (1) the [MASK] token 80% of
the time (2) a random token 10% of the time (3)
the unchanged i-th token 10% of the time. Then,
Ti will be used to predict the original token with
cross entropy loss. We compare variations of this
procedure in Appendix C.2.
Task #2:
Next Sentence Prediction (NSP)
Many important downstream tasks such as Ques-
tion Answering (QA) and Natural Language Infer-
ence (NLI) are based on understanding the rela-
tionship between two sentences, which is not di-
rectly captured by language modeling. In order
to train a model that understands sentence rela-
tionships, we pre-train for a binarized next sen-
tence prediction task that can be trivially gener-
ated from any monolingual corpus. Specifically,
when choosing the sentences A and B for each pre-
training example, 50% of the time B is the actual
next sentence that follows A (labeled as IsNext),
and 50% of the time it is a random sentence from
the corpus (labeled as NotNext).
As we show
in Figure 1, C is used for next sentence predic-
tion (NSP).5 Despite its simplicity, we demon-
strate in Section 5.1 that pre-training towards this
task is very beneficial to both QA and NLI. 6
5The final model achieves 97%-98% accuracy on NSP.
6The vector C is not a meaningful sentence representation
without fine-tuning, since it was trained with NSP.
[CLS]
he
likes
play
##ing
[SEP]
my
dog
is
cute
[SEP]
Input
E[CLS]
Ehe
Elikes
Eplay
E##ing
E[SEP]
Emy
Edog
Eis
Ecute
E[SEP]
Token
Embeddings
EA
EB
EB
EB
EB
EB
EA
EA
EA
EA
EA
Segment
Embeddings
E0
E6
E7
E8
E9
E10
E1
E2
E3
E4
E5
Position
Embeddings
Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmenta-
tion embeddings and the position embeddings.
The NSP task is closely related to representation-
learning objectives used in Jernite et al. (2017) and
Logeswaran and Lee (2018). However, in prior
work, only sentence embeddings are transferred to
down-stream tasks, where BERT transfers all pa-
rameters to initialize end-task model parameters.
Pre-training data The pre-training procedure
largely follows the existing literature on language
model pre-training. For the pre-training corpus we
use the BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).
For Wikipedia we extract only the text passages
and ignore lists, tables, and headers. It is criti-
cal to use a document-level corpus rather than a
shuffled sentence-level corpus such as the Billion
Word Benchmark (Chelba et al., 2013) in order to
extract long contiguous sequences.
3.2
Fine-tuning BERT
Fine-tuning is straightforward since the self-
attention mechanism in the Transformer al-
lows BERT to model many downstream tasks—
whether they involve single text or text pairs—by
swapping out the appropriate inputs and outputs.
For applications involving text pairs, a common
pattern is to independently encode text pairs be-
fore applying bidirectional cross attention, such
as Parikh et al. (2016); Seo et al. (2017). BERT
instead uses the self-attention mechanism to unify
these two stages, as encoding a concatenated text
pair with self-attention effectively includes bidi-
rectional cross attention between two sentences.
For each task, we simply plug in the task-
specific inputs and outputs into BERT and fine-
tune all the parameters end-to-end.
At the in-
put, sentence A and sentence B from pre-training
are analogous to (1) sentence pairs in paraphras-
ing, (2) hypothesis-premise pairs in entailment, (3)
question-passage pairs in question answering, and
(4) a degenerate text-∅ pair in text classification
or sequence tagging. At the output, the token rep-
resentations are fed into an output layer for token-
level tasks, such as sequence tagging or question
answering, and the [CLS] representation is fed
into an output layer for classification, such as en-
tailment or sentiment analysis.
Compared to pre-training, fine-tuning is rela-
tively inexpensive. All of the results in the pa-
per can be replicated in at most 1 hour on a sin-
gle Cloud TPU, or a few hours on a GPU, starting
from the exact same pre-trained model.7 We de-
scribe the task-specific details in the correspond-
ing subsections of Section 4. More details can be
found in Appendix A.5.
4
Experiments
In this section, we present BERT fine-tuning re-
sults on 11 NLP tasks.
4.1
GLUE
The General Language Understanding Evaluation
(GLUE) benchmark (Wang et al., 2018a) is a col-
lection of diverse natural language understanding
tasks. Detailed descriptions of GLUE datasets are
included in Appendix B.1.
To fine-tune on GLUE, we represent the input
sequence (for single sentence or sentence pairs)
as described in Section 3, and use the final hid-
den vector C ∈ RH corresponding to the first
input token ([CLS]) as the aggregate representa-
tion. The only new parameters introduced during
fine-tuning are classification layer weights W ∈
RK×H, where K is the number of labels. We com-
pute a standard classification loss with C and W,
i.e., log(softmax(CW T )).
7For example, the BERT SQuAD model can be trained in
around 30 minutes on a single Cloud TPU to achieve a Dev
F1 score of 91.0%.
8See (10) in https://gluebenchmark.com/faq.
System
MNLI-(m/mm)
QQP
QNLI
SST-2
CoLA
STS-B
MRPC
RTE
Average
392k
363k
108k
67k
8.5k
5.7k
3.5k
2.5k
-
Pre-OpenAI SOTA
80.6/80.1
66.1
82.3
93.2
35.0
81.0
86.0
61.7
74.0
BiLSTM+ELMo+Attn
76.4/76.1
64.8
79.8
90.4
36.0
73.3
84.9
56.8
71.0
OpenAI GPT
82.1/81.4
70.3
87.4
91.3
45.4
80.0
82.3
56.0
75.1
BERTBASE
84.6/83.4
71.2
90.5
93.5
52.1
85.8
88.9
66.4
79.6
BERTLARGE
86.7/85.9
72.1
92.7
94.9
60.5
86.5
89.3
70.1
82.1
Table 1: GLUE Test results, scored by the evaluation server (https://gluebenchmark.com/leaderboard).
The number below each task denotes the number of training examples. The “Average” column is slightly different
than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are single-
model, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and
accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.
We use a batch size of 32 and fine-tune for 3
epochs over the data for all GLUE tasks. For each
task, we selected the best fine-tuning learning rate
(among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set.
Additionally, for BERTLARGE we found that fine-
tuning was sometimes unstable on small datasets,
so we ran several random restarts and selected the
best model on the Dev set. With random restarts,
we use the same pre-trained checkpoint but per-
form different fine-tuning data shuffling and clas-
sifier layer initialization.9
Results are presented in Table 1.
Both
BERTBASE and BERTLARGE outperform all sys-
tems on all tasks by a substantial margin, obtaining
4.5% and 7.0% respective average accuracy im-
provement over the prior state of the art. Note that
BERTBASE and OpenAI GPT are nearly identical
in terms of model architecture apart from the at-
tention masking. For the largest and most widely
reported GLUE task, MNLI, BERT obtains a 4.6%
absolute accuracy improvement. On the official
GLUE leaderboard10, BERTLARGE obtains a score
of 80.5, compared to OpenAI GPT, which obtains
72.8 as of the date of writing.
We find that BERTLARGE significantly outper-
forms BERTBASE across all tasks, especially those
with very little training data. The effect of model
size is explored more thoroughly in Section 5.2.
4.2
SQuAD v1.1
The
Stanford
Question
Answering
Dataset
(SQuAD v1.1) is a collection of 100k crowd-
sourced question/answer pairs (Rajpurkar et al.,
2016).
Given a question and a passage from
9The GLUE data set distribution does not include the Test
labels, and we only made a single GLUE evaluation server
submission for each of BERTBASE and BERTLARGE.
10https://gluebenchmark.com/leaderboard
Wikipedia containing the answer, the task is to
predict the answer text span in the passage.
As shown in Figure 1, in the question answer-
ing task, we represent the input question and pas-
sage as a single packed sequence, with the ques-
tion using the A embedding and the passage using
the B embedding. We only introduce a start vec-
tor S ∈ RH and an end vector E ∈ RH during
fine-tuning. The probability of word i being the
start of the answer span is computed as a dot prod-
uct between Ti and S followed by a softmax over
all of the words in the paragraph: Pi =
eS·Ti
�
j eS·Tj .
The analogous formula is used for the end of the
answer span. The score of a candidate span from
position i to position j is defined as S·Ti + E·Tj,
and the maximum scoring span where j ≥ i is
used as a prediction. The training objective is the
sum of the log-likelihoods of the correct start and
end positions. We fine-tune for 3 epochs with a
learning rate of 5e-5 and a batch size of 32.
Table 2 shows top leaderboard entries as well
as results from top published systems (Seo et al.,
2017; Clark and Gardner, 2018; Peters et al.,
2018a; Hu et al., 2018). The top results from the
SQuAD leaderboard do not have up-to-date public
system descriptions available,11 and are allowed to
use any public data when training their systems.
We therefore use modest data augmentation in
our system by first fine-tuning on TriviaQA (Joshi
et al., 2017) befor fine-tuning on SQuAD.
Our best performing system outperforms the top
leaderboard system by +1.5 F1 in ensembling and
+1.3 F1 as a single system. In fact, our single
BERT model outperforms the top ensemble sys-
tem in terms of F1 score. Without TriviaQA fine-
11QANet is described in Yu et al. (2018), but the system
has improved substantially after publication.
System
Dev
Test
EM
F1
EM
F1
Top Leaderboard Systems (Dec 10th, 2018)
Human
-
-
82.3 91.2
#1 Ensemble - nlnet
-
-
86.0 91.7
#2 Ensemble - QANet
-
-
84.5 90.5
Published
BiDAF+ELMo (Single)
-
85.6
-
85.8
R.M. Reader (Ensemble)
81.2 87.9 82.3 88.5
Ours
BERTBASE (Single)
80.8 88.5
-
-
BERTLARGE (Single)
84.1 90.9
-
-
BERTLARGE (Ensemble)
85.8 91.8
-
-
BERTLARGE (Sgl.+TriviaQA)
84.2 91.1 85.1 91.8
BERTLARGE (Ens.+TriviaQA) 86.2 92.2 87.4 93.2
Table 2:
SQuAD 1.1 results. The BERT ensemble
is 7x systems which use different pre-training check-
points and fine-tuning seeds.
System
Dev
Test
EM
F1
EM
F1
Top Leaderboard Systems (Dec 10th, 2018)
Human
86.3 89.0 86.9 89.5
#1 Single - MIR-MRC (F-Net)
-
-
74.8 78.0
#2 Single - nlnet
-
-
74.2 77.1
Published
unet (Ensemble)
-
-
71.4 74.9
SLQA+ (Single)
-
71.4 74.4
Ours
BERTLARGE (Single)
78.7 81.9 80.0 83.1
Table 3: SQuAD 2.0 results. We exclude entries that
use BERT as one of their components.
tuning data, we only lose 0.1-0.4 F1, still outper-
forming all existing systems by a wide margin.12
4.3
SQuAD v2.0
The SQuAD 2.0 task extends the SQuAD 1.1
problem definition by allowing for the possibility
that no short answer exists in the provided para-
graph, making the problem more realistic.
We use a simple approach to extend the SQuAD
v1.1 BERT model for this task. We treat ques-
tions that do not have an answer as having an an-
swer span with start and end at the [CLS] to-
ken. The probability space for the start and end
answer span positions is extended to include the
position of the [CLS] token. For prediction, we
compare the score of the no-answer span: snull =
S·C + E·C to the score of the best non-null span
12The TriviaQA data we used consists of paragraphs from
TriviaQA-Wiki formed of the first 400 tokens in documents,
that contain at least one of the provided possible answers.
System
Dev
Test
ESIM+GloVe
51.9 52.7
ESIM+ELMo
59.1 59.2
OpenAI GPT
-
78.0
BERTBASE
81.6
-
BERTLARGE
86.6 86.3
Human (expert)†
-
85.0
Human (5 annotations)†
-
88.0
Table 4: SWAG Dev and Test accuracies. †Human per-
formance is measured with 100 samples, as reported in
the SWAG paper.
ˆ
si,j = maxj≥iS·Ti + E·Tj. We predict a non-null
answer when ˆ
si,j > snull + τ, where the thresh-
old τ is selected on the dev set to maximize F1.
We did not use TriviaQA data for this model. We
fine-tuned for 2 epochs with a learning rate of 5e-5
and a batch size of 48.
The results compared to prior leaderboard en-
tries and top published work (Sun et al., 2018;
Wang et al., 2018b) are shown in Table 3, exclud-
ing systems that use BERT as one of their com-
ponents. We observe a +5.1 F1 improvement over
the previous best system.
4.4
SWAG
The Situations With Adversarial Generations
(SWAG) dataset contains 113k sentence-pair com-
pletion examples that evaluate grounded common-
sense inference (Zellers et al., 2018). Given a sen-
tence, the task is to choose the most plausible con-
tinuation among four choices.
When fine-tuning on the SWAG dataset, we
construct four input sequences, each containing
the concatenation of the given sentence (sentence
A) and a possible continuation (sentence B). The
only task-specific parameters introduced is a vec-
tor whose dot product with the [CLS] token rep-
resentation C denotes a score for each choice
which is normalized with a softmax layer.
We fine-tune the model for 3 epochs with a
learning rate of 2e-5 and a batch size of 16. Re-
sults are presented in Table 4. BERTLARGE out-
performs the authors’ baseline ESIM+ELMo sys-
tem by +27.1% and OpenAI GPT by 8.3%.
5
Ablation Studies
In this section, we perform ablation experiments
over a number of facets of BERT in order to better
understand their relative importance. Additional
Dev Set
Tasks
MNLI-m QNLI MRPC SST-2 SQuAD
(Acc)
(Acc)
(Acc)
(Acc)
(F1)
BERTBASE
84.4
88.4
86.7
92.7
88.5
No NSP
83.9
84.9
86.5
92.6
87.9
LTR & No NSP
82.1
84.3
77.5
92.1
77.8
+ BiLSTM
82.1
84.1
75.7
91.6
84.9
Table 5: Ablation over the pre-training tasks using the
BERTBASE architecture. “No NSP” is trained without
the next sentence prediction task. “LTR & No NSP” is
trained as a left-to-right LM without the next sentence
prediction, like OpenAI GPT. “+ BiLSTM” adds a ran-
domly initialized BiLSTM on top of the “LTR + No
NSP” model during fine-tuning.
ablation studies can be found in Appendix C.
5.1
Effect of Pre-training Tasks
We demonstrate the importance of the deep bidi-
rectionality of BERT by evaluating two pre-
training objectives using exactly the same pre-
training data, fine-tuning scheme, and hyperpa-
rameters as BERTBASE:
No NSP: A bidirectional model which is trained
using the “masked LM” (MLM) but without the
“next sentence prediction” (NSP) task.
LTR & No NSP: A left-context-only model which
is trained using a standard Left-to-Right (LTR)
LM, rather than an MLM. The left-only constraint
was also applied at fine-tuning, because removing
it introduced a pre-train/fine-tune mismatch that
degraded downstream performance. Additionally,
this model was pre-trained without the NSP task.
This is directly comparable to OpenAI GPT, but
using our larger training dataset, our input repre-
sentation, and our fine-tuning scheme.
We first examine the impact brought by the NSP
task.
In Table 5, we show that removing NSP
hurts performance significantly on QNLI, MNLI,
and SQuAD 1.1. Next, we evaluate the impact
of training bidirectional representations by com-
paring “No NSP” to “LTR & No NSP”. The LTR
model performs worse than the MLM model on all
tasks, with large drops on MRPC and SQuAD.
For SQuAD it is intuitively clear that a LTR
model will perform poorly at token predictions,
since the token-level hidden states have no right-
side context. In order to make a good faith at-
tempt at strengthening the LTR system, we added
a randomly initialized BiLSTM on top. This does
significantly improve results on SQuAD, but the
results are still far worse than those of the pre-
trained bidirectional models. The BiLSTM hurts
performance on the GLUE tasks.
We recognize that it would also be possible to
train separate LTR and RTL models and represent
each token as the concatenation of the two mod-
els, as ELMo does. However: (a) this is twice as
expensive as a single bidirectional model; (b) this
is non-intuitive for tasks like QA, since the RTL
model would not be able to condition the answer
on the question; (c) this it is strictly less powerful
than a deep bidirectional model, since it can use
both left and right context at every layer.
5.2
Effect of Model Size
In this section, we explore the effect of model size
on fine-tuning task accuracy. We trained a number
of BERT models with a differing number of layers,
hidden units, and attention heads, while otherwise
using the same hyperparameters and training pro-
cedure as described previously.
Results on selected GLUE tasks are shown in