This repository has been archived by the owner on Jun 14, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2109.02903.txt
1962 lines (1659 loc) · 59 KB
/
2109.02903.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
IndicBART: A Pre-trained Model for Indic Natural Language Generation
Raj Dabre1 Himani Shrotriya2 Anoop Kunchukuttan3
Ratish Puduppully4 Mitesh M. Khapra5 Pratyush Kumar6
National Institute of Information and Communications Technology1 IIT Madras2,5,6
Microsoft3,6 University of Edinburgh4
1
3
4
5
arXiv:2109.02903v2 [cs.CL] 27 Oct 2022
Abstract
In this paper, we study pre-trained sequenceto-sequence models for a group of related languages, with a focus on Indic languages. We
present IndicBART, a multilingual, sequenceto-sequence pre-trained model focusing on 11
Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate
IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a model specific to related languages like IndicBART is
competitive with large pre-trained models like
mBART50 despite being significantly smaller.
It also performs well on very low-resource
translation scenarios where languages are not
included in pre-training or fine-tuning. Script
sharing, multilingual training, and better utilization of limited model capacity contribute
to the good performance of the compact IndicBART model.
1
Introduction
Recently, there has been significant progress in
deep learning based natural language generation
(NLG) for machine translation, abstractive summarization, data-to-text generation, etc. due to the
adoption of attention-based sequence-to-sequence
(S2S) models (conditional language models) (Wu
et al., 2016; Paulus et al., 2018; Puduppully et al.,
2019). Pre-trained S2S models have been shown
to be useful to improve performance on various
NLG tasks (Rothe et al., 2020; Kale and Rastogi,
2020; Lewis et al., 2020). Specifically, multilingual
pre-trained S2S models jointly trained on monolingual corpora from multiple languages such as
mBART25 (Liu et al., 2020), mBART50 (Tang
et al., 2020a) and mT5 (Xue et al., 2021) have seen
increased adoption and low-resource languages
have benefitted from cross-lingual transfer. How-
ever, these massively multilingual massive (M3)
models have major limitations. They serve only
a few of the world’s languages (<100 languages),
the pre-training corpora are dominated by highresource languages, the vocabulary representation
for low-resource languages is inadequate, and the
models are large, making them expensive and slow
to train, fine-tune and decode.
An alternative approach is to build pre-trained
S2S models for a group of related languages. Previous work has shown the benefits of pre-trained
language models as well as NMT models that cater
to a set of related languages (Kakwani et al., 2020;
Tan et al., 2019; Khanuja et al., 2021; Reid et al.,
2021). Owing to their public availability, these
models have seen heavy adoption1 . However, such
a study on multilingual pre-trained S2S models for
Indic languages is missing in the literature. In this
work, we address this gap in the literature by studying multilingual pre-trained S2S models for Indic
languages.
The result of this study is IndicBART, a multilingual pre-trained sequence to sequence model
specifically trained for Indic languages, which are
spoken by more than a billion users2 . It supports English and 11 Indian languages including 7 Indo-Aryan (Assamese, Bengali, Gujarati,
Hindi, Marathi, Oriya, Punjabi) and 4 Dravidian
(Kannada, Malayalam, Tamil, Telugu) languages.
Of these, mBART25, mBART50 and mT5 support
only 2, 7 and 9 languages respectively. There are
linguistic similarities between the two language
families on account of contact relatedness resulting from geographical colocation. Within, the two
language families there are genetic relations between languages due to them being derived from
1
Over 10,000 downloads for MuRIL (https:
//huggingface.co/google/muril-base-cased)
and
IndicBERT
(https://huggingface.co/
ai4bharat/indic-bert).
2
https://en.wikipedia.org/wiki/
Demographics_of_India
common ancestor languages34 . Due to this, the
Indian subcontinent is considered to be a linguistic area or sprachbund (Emeneau, 1956). There is
evidence that such contact-relatedness can result
in positive cross-lingual transfer for NLP applications like NMT (Goyal et al., 2020a). Hence, we
train a single model for all Indic languages. It
is a compact model with just 244M parameters,
which is much smaller than the M3 models such as
mBART50 and mT5(-base) which contain 611M
and 580M parameters respectively. We also propose a variant of IndicBART, i.e. IndicALBART,
that is highly compact with just 97M parameters.
We compare IndicBART with M3 models on two
downstream generation tasks: machine translation
and extreme summarization (Narayan et al., 2018).
The results indicate that IndicBART is competitive
or better by up to 2 BLEU/ROUGE compared to
M3 models like mBART50. IndicBART also performs well in the following zero-shot scenarios:
(a) on languages not included in pre-training, and
(b) languages for which there is no fine-tuning data.
The following aspects of the IndicBART model
contribute to its strong performance and increased
language coverage within the Indic group vis-à-vis
M3 models, while being highly compact:
1. It is trained on a smaller set of related languages,
which reduces model capacity requirements. Moreover, available model capacity is effectively utilized, since transfer learning works when languages
share linguistic features and data represents shared
topical themes.
2. It is trained on the largest publicly available
Indic language corpora, IndicCorp (Kakwani et al.,
2020), which includes large, high-quality news
crawls for Indian languages as well as English
content from Indian websites - thus being representative of Indian English and topics.
3. We utilize the orthographic similarity between
Indic scripts (Kunchukuttan et al., 2018) to map all
the Indic language data to a single script, effectively
reducing the number of scripts from 9 to 1 (each
script having approximately 50 characters). This
increases the shared subwords in the vocabulary,
and we observe that single script models enable better cross-lingual transfer while fine-tuning. Since
subwords embeddings consume a significant fraction of the parameter space, single script models
also better utilize available vocabulary budget5 .
4. Extremely compressed pre-trained S2S models (IndicALBART) suitable for deployment can
be trained by sharing parameters across layers of
the transformer layers. For related languages, we
show compressed pre-trained models are competitive with full models on downstream tasks when
fine-tuned on distilled data.
The IndicBART model and its variants,
along with details on how to fine-tune them,
can be accessed at https://github.com/
AI4Bharat/indic-bart/. We also release
the models on the HuggingFace model hub at
https://huggingface.co/ai4bharat/
IndicBART and https://huggingface.
co/ai4bharat/IndicBARTSS. Models are
available under an MIT license to spur further
innovation in NLG for Indic languages and study
of pre-trained S2S models for related languages.
3
https://en.wikipedia.org/wiki/
Proto-Indo-Aryan_language
4
https://en.wikipedia.org/wiki/
Proto-Dravidian_language
5
Where mBART-25 and mBART-50 have vocabularies
of 250K subwords to accommodate 25 to 50 languages, IndicBART has a vocabulary of 64K subwords which is 4 times
smaller.
2
Related Work
Pre-trained models. Pre-trained models learned
using self-supervised objectives and large monolingual corpora have contributed to rapid advances
in NLU (Devlin et al., 2019) and NLG (Lewis
et al., 2020). Following initial work on English pretrained models, multilingual pre-trained models
have been proposed for NLU (Devlin et al., 2019;
Conneau et al., 2020) as well as NLG (Liu et al.,
2020; Tang et al., 2020a; Xue et al., 2021) supporting around 100 languages. These pre-trained
M3 models have proven to be very useful in improving NLG performance in low-resource settings,
especially for applications other than translation.
Language group-specific models. The proposed
IndicBART model is also a multilingual pre-trained
S2S model, similar in architecture and training to
mBART. However, in contrast to mBART and mT5,
the proposed IndicBART caters specifically to Indic
languages. While language-group specific NLU
language models like IndicBERT (Kakwani et al.,
2020) and MuRIL (Khanuja et al., 2021) and NMT
models (Tan et al., 2019) have been proposed, ours
is one of the first efforts to create a pre-trained
S2S model for a specific language group (and the
first for Indic languages). AfroMT (Reid et al.,
2021) is a concurrent effort focussed on African
languages and low monolingual corpora scenarios
belonging to various language families. However,
AfroMT heavily relies on synthetic data, which
may not reflect the true data distribution across
languages. Furthermore, AfroMT effort is focussed
only on MT, whereas we investigate IndicBART on
an additional NLG task - abstractive summarization.
Interestingly, the publicly available group-specific
language models (IndicBERT and MuRIL) both
cater to Indic languages, pointing to perceived need
for Indic language specific models.
Language relatedness. Language-group specific
models are motivated from previous work that emphasizes the role of language relatedness in crosslingual transfer for NMT (Nguyen and Chiang,
2017; Dabre et al., 2017; Aharoni et al., 2019;
Kudugunta et al., 2019; Dabre et al., 2020) and
NLU (Kakwani et al., 2020; Khemchandani et al.,
2021; Dhamecha et al., 2021). We use a single
script for representing Indic data since orthographic
similarity between Indic languages has been utilized to represent data in a common script and improve cross-lingual transfer for machine transliteration (Kunchukuttan et al., 2018), machine translation (Dabre et al., 2018; Goyal et al., 2020b;
Ramesh et al., 2021) and NLU (Khemchandani
et al., 2021; Dhamecha et al., 2021).
Parameter Sharing and Distillation. Parameter
sharing across layers has shown promise for NMT
(Dabre and Fujita, 2019) and pre-trained LMs (Lan
et al., 2020) in building compressed models while
maintaining end-task performance. The IndicALBART model proposed in this work is the first
model to explore parameter-sharing across layers
for pre-trained S2S models. For NMT models
trained from scratch, sequence-to-sequence distillation (Kim and Rush, 2016) has been shown as
an effective way to transfer knowledge to smaller
models, while training large models on distilled
data (a form of self-training) has been shown to improve translation quality (Dabre and Fujita, 2020).
Our results indicate that these results hold when
fine-tuning on pre-trained S2S models as well.
3
IndicBART
The IndicBART model is conceptually based on
the mBART25/50 model family, which are Transformer models (Vaswani et al., 2017) trained on
monolingual corpora with masked span reconstruction objective. We refer the readers to the mBART
literature (Lewis et al., 2020; Liu et al., 2020) for
architectural details and highlight specific details
and differences from the mBART25/50 setup.
3.1
Design Considerations for IndicBART
Considerations that drove our model choices are:
Compactness: The model should be compact
given our focus on a smaller set of related languages, as well as to accelerate training and finetuning. Such a model will be usable by a larger
base of users with limited computational resources.
Content Relevance: In addition to Indian languages, we include English since transfer-learning
from English is a natural use case, and English is
widely used in the Indian subcontinent. We also
use English content from the Indian subcontinent
to reflect relevant content.
Leveraging Relatedness: We utilize orthographic
similarity between Indian languages, most of which
use abugida scripts derived from the Brahmi script.
The logical character set has high overlaps, though
each script has its own code-point range in the
Unicode standard (Kunchukuttan et al., 2018). We
map all the data to Devanagari, enabling better
transfer learning6 with a more compact vocabulary
compared to mBART.
3.2
Model and Training Details
IndicBART uses (N=) 6 encoder and decoder layers with hidden and filter sizes of 1024 and 4096,
respectively, and 16 attention heads (244M parameters). Similar to mBART, we mask (p=)35% of
the words in each sentence by randomly sampling
a span length according to a Poisson distribution
(λ = 3.5). We use dropouts of 0.1, label smoothing
of 0.1, Adam optimizer with a maximum learning
rate of 0.001, weight decay of 0.00001, linear learning rate warm-up and decay with 16,000 warm-up
steps, batch sizes of 4096 tokens. We train for
750,000 iterations on 48 NVIDIA V-100 GPUs,
corresponding to roughly 2 epochs, taking around 5
days7 . In comparison, mBART25/50 models need
much longer time (2+ weeks) on 256 GPUs.
To explore more compressed pre-trained models,
we train IndicALBART, a variant of IndicBART
with cross-layer parameter sharing, i.e., sharing
parameters across layers. For ablation studies on
the impact of single script representation, we also
6
There is a substantial amount of shared vocabulary between Indian languages written in different scripts. Mapping
scripts to Devanagari enables direct sharing of vocabulary,
leading to improved transfer learning.
7
Longer training was limited by the availability of many
GPUs simultaneously.
train a variant of IndicBART with a 64K vocabulary
using the original scripts, which we call separate
script IndicBART (SSIndicBART).
The models have been trained with the YANMTT toolkit8 (Dabre and Sumita, 2021) which is
based on the mBART implementation of the HuggingFace Transformers library (Wolf et al., 2020).
3.3
Training Data and Pre-processing
We train the IndicBART model on the IndicCorp
(IC) dataset (Kakwani et al., 2020) which contains
11 Indic languages and English. The Indic languages are: Assamese (as), Bengali (bn), Gujarati
(gu), Hindi (hi), Kannada (kn), Malayalam (ml),
Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta)
and Telugu (te). The corpora statistics are mentioned in Table 7 of the appendix. We train the
model on a total of approx. 450 million sentences
and 9 billion tokens, where corpora sizes are balanced with temperature (T=5) based sampling (Arivazhagan et al., 2019). All the Indic language data
is represented in a single script, i.e., the Devanagari script using the IndicNLP library9 (Kunchukuttan, 2020). We use a vocabulary of 64K subwords
learned using SentencePiece (Kudo, 2018; Kudo
and Richardson, 2018) on randomly sampled 1M
raw sentences from the IndicCorp for each language, for a total of 12M sentences. The model is
trained at the sentence-level, unlike the mBART50
model, which is trained on contiguous text chunks
potentially spanning multiple sentences.
4
Experiments: NMT
Machine Translation is a standard, popular, crosslingual generation task for which various pretrained models are evaluated. We compare IndicBART and its variants with mBART50, which
should be the most directly comparable model. We
study their performance in: (a) low-resource, (b)
multilingual and (c) zero-shot training settings.
4.1
Models Compared
We study IndicBART via the following models:
Models trained from scratch: We train bilingual
(Bi) as well as multilingual many-to-one (M2O)
and one-to-many (O2M) transformer models.
Fine-tuned models: We fine-tune mBART50
(MB50), IndicBART (IB) and its variants namely
8
9
https://github.com/prajdabre/yanmtt
https://github.com/anoopkunchukuttan/indic_nlp_library
IndicALBART (IALB) and separate script IndicBART (SSIB). The type of fine-tuning is indicated by +type, which can be Bi, O2M or M2O.
If needed, the corpus is indicated by +corpus.
Distilled models: We use the multilingually finetuned IndicBART model and translate the training
data source sentences, which yields distillation data
(Kim and Rush, 2016). We use this data to train
M2O and O2M models from scratch, as well as
by fine-tuning on mBART50, IndicBART and IndicALBART. This was motivated by Dabre and Fujita
(2020) who show that the distillation data generated using models employing transfer learning significantly improves the performance of compact
models for low-resource languages.
4.2
Datasets and Preprocessing
The statistics of training corpora are in Table 7 in
the appendix.
Training: For a low-resource setting (LR), we use
the PMI subset (Haddow and Kirefu, 2020) of the
WAT 2021 MultiIndicMT10 (Nakazawa et al., 2021)
training set for finetuning. This represents an extremely low-resource parallel corpus setting where
we expect IndicBART to be the most helpful. We
experiment with extending the PMI data (approximately 326K pairs) with the CVIT-PIB (henceforth
PIB: 930K pairs) data (Siripragrada et al., 2020)
which is similar in domain to the former. We also
use the high-resource, general domain Samanantar corpus (Ramesh et al., 2021) (46.2M pairs) to
compare with the generalization capabilities of pretrained models which are fine-tuned with small
corpora (PMI, PIB).
Testing: We use the WAT 2021 MultiIndicMT testset and the FLORES101 devtest (Goyal et al., 2021)
for evaluation of our models. Both these test sets
are n-way parallel (2,390 and 1,012 sentences respectively). The WAT 2021 test set shares the same
domain as the training set. The FLORES devtest
comes from a different, general domain. We rely
on the FLORES dataset to evaluate performance of
models trained on the PMI/PIB domain on a more
general domain.
Validation: We use the WAT2021 development set
of 1,000 sentences.
Preprocessing: For IndicBART and IndicALBART, we use the Indic NLP library to convert
the Indic side of the parallel data to the Devanagari script. For mBART50, only Kannada, Punjabi
10
http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual
and Oriya scripts are converted to Devanagari as
mBART50 does not support these languages. Results for these are italicized. For separate script
IndicBART we do not do script conversion.
With this setup, we study the benefits of pretraining in low-resource settings (fine-tuned on
PMI and PIB) and compare it with high-resource
settings (trained on Samanantar) on in-domain
(WAT2021) and general (FLORES) test sets. Unless explicitly mentioned, our models are assumed
to be trained/fine-tuned/distilled with the PMI training data.
4.3
Model Training Settings
We use a single GPU for bilingual and 8 GPUs for
multilingual models, all of which are Transformers.
Multilingual models are trained using the approach
in Johnson et al. (2017) where corpora for various
language pairs are first balanced according to their
size, then concatenated after appending target language indicator tokens, and finally fed to the NMT
model for training. Wherever possible and applicable, we tuned hyperparameters such as hidden
sizes, dropout, label smoothing, warm-up, tokens
per batch, per GPU, learning rate and weight decay. The ADAM optimizer was used. We train
our models till convergence on the development
set BLEU scores (Papineni et al., 2002). We decode train/tests sets using beam search with a beam
of size 4 and a length penalty of 0.8. We report
the BLEU scores on the decoded results computed
using sacreBLEU11 (Post, 2018). For additional
details, refer to section B in the appendix.
4.4
Comparison of Pre-trained Models
We first describe the main results of using IndicBART and its variants for machine translation
and compare it with other relevant models. Table 1
shows results for models trained on the PMI corpus
and evaluated on the WAT21 test set.
Language specific models are compact and
competitive: Considering bilingual models, IndicBART outperforms models trained from scratch
and gives competitive results when compared
to mBART50. For Indic to English translation,
mBART50 tends to be better, but this is not surprising because it is trained on far larger amounts of
English data in addition to being almost 3 times
larger than IndicBART. For English to Indic translation, both models tend to give similar scores. In
11
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a
+version.1.5.1
the case of multilingual models, IndicBART is,
once again, vastly better than its counterpart trained
from scratch and when compared to mBART50
the gap which existed in case of bilingual settings
disappears and sometimes reverses in favor of IndicBART. In both cases, IndicBART outperforms
mBART50 for Kannada, Punjabi and Oriya which
the latter is not trained for. This shows that having a compact language group specific model can
be competitive with if not better than a general
purpose model trained on a larger number of languages while only having one-third the number of
parameters as the latter.
Extreme compression has its downside: Comparing the performance of IndicBART and
mBART50 against IndicALBART in multilingual
settings, it seems that a 60% and 84% reduction
of parameters, respectively, has a negative impact
on the translation quality, which results in drops of
up to 3 BLEU. However, this may be considered
as a reasonable tradeoff given the high levels of
compression achieved. Especially given that IndicALBART is 84% smaller than mBART50, means
that large capacity GPUs (which not everyone has
easy access to) may not be needed. Furthermore,
the drops in quality can be addressed via distillation.
Distillation successfully transfers performance
from large to smaller models: We see that finetuning the pre-trained IndicALBART on distilled
data from IndicBART can match the performance
of the IndicBART model. Fine-tuning pre-trained
IndicALBART performs better than training a randomly initialized model on the same distilled data
in the XX-En direction. On the other hand, both the
approaches are competitive in the En-XX direction.
Self-training on distilled data is beneficial:
When IndicBART and MB50 are fine-tuned on
distillation data generated from a previously finetuned model, we see significant improvements in
the XX-En direction, and modest improvements
in the En-XX directions. These observations are
mostly in line with Dabre and Fujita (2020).
In summary, compact language group specific
pre-trained models are competitive with large universal language models. This can result in reasonable gains in fine-tuning multilingual models (3.33.5 hours for IndicBART variants vs 4.7-5 hours for
mBART50) and significantly reduce the memory
footprint (97-244M vs 611M) for deployment.
Model
#Params
bn
Bi
MB50+Bi
IB+Bi
78M
611M
244M
13.5
23.2
23.6
M2O
MB50+M2O
IB+M2O
IALB+M2O
78M
611M
244M
97M
18.9
24.8
24.8
23.1
MB50+M2O
IB+M2O
611M
244M
26.1
26.0
M2O
IAIB+M2O
78M
97M
23.6
24.9
Bi
MB50+Bi
IB+Bi
78M
611M
244M
4.5
8.6
8.2
O2M
MB50+O2M
IB+O2M
IALB+O2M
78M
611M
244M
97M
7.4
8.9
9.1
8.1
MB50+O2M
IB+O2M
611M
244M
9.4
9.3
O2M
IAIB+O2M
78M
97M
8.9
8.9
gu
hi
kn
ml
mr
XX-En
Bilingual Models
27.4 30.9 22.5 16.5 18.4
35.4 38.3 26.8 29.2 27.7
35.5 36.8 31.6 27.9 26.8
Multilingual Models
24.8 27.8 23.8 21.6 20.7
33.9 36.8 30.1 28.8 28.1
33.9 37.2 32.4 28.5 28.5
33.2 34.4 29.5 27.1 27.0
Distilled Large Models
35.9 38.3 32.9 29.6 29.3
35.9 38.0 33.7 29.9 29.4
Distilled Compact Models
33.3 36.0 30.2 26.0 26.9
34.4 36.6 31.9 27.7 28.1
En-XX
Bilingual Models
17.9 21.7 12.1 3.9 10.0
23.5 27.0 17.4 6.0 15.8
23.6 26.9 17.7 6.0 15.8
Multilingual Models
22.5 25.9 16.2 5.6 14.7
22.8 27.5 18.1 6.5 16.3
24.0 27.3 18.5 6.7 16.7
22.3 26.3 17.0 5.8 15.3
Distilled Large Models
24.5 27.5 17.5 6.1 16.4
25.0 28.2 19.2 6.7 17.0
Distilled Compact Models
24.1 27.5 18.2 6.3 16.0
23.4 27.2 17.8 6.3 16.2
or
pa
ta
te
18.4
27.8
28.3
27.1
35.8
36.3
17.1
27.1
27.0
16.5
30.8
29.9
21.2
27.5
28.8
27.3
26.4
34.5
35.7
34.1
20.6
27.0
27.3
25.2
21.8
29.2
29.5
27.4
30.1
30.3
37.1
37.4
28.5
28.4
31.7
31.6
27.7
28.6
34.0
35.5
25.6
26.5
27.8
29.0
9.2
11.6
11.8
17.9
24.5
25.1
7.2
11.2
10.8
2.1
3.3
3.6
11.4
12.0
12.9
11.6
21.9
25.1
26.4
24.2
10.0
11.6
11.6
10.5
2.7
3.7
3.7
3.2
12.8
13.2
26.3
26.5
11.6
11.8
2.9
3.7
12.5
12.7
25.6
25.3
11.0
11.3
3.2
3.1
Table 1: Comparison of IndicBART with other models. Scores are reported on the WAT 2021 test set.
Model
bn
IB+M2O
SSIB+M2O
24.8
24.1
IB+O2M
SSIB+O2M
9.1
9.3
hi
ml
or
XX-En
37.2 28.5 28.8
35.5 27.9 28.1
En-XX
27.3 6.7 16.9
27.3 6.2 16.6
ta
27.3
26.9
11.6
11.4
Table 2: Ablation studies on the impact of multilingualism and script unification on downstream performance
of IndicBART. Scores are on the WAT 2021 test set.
4.5
Ablation Studies
We now perform ablation experiments to study
the (a.) impact of script unification on translation,
(b.) impact of corpora sizes and domains on translation, (c.) translation quality for languages unseen
during fine-tuning, and (d.) translation quality on
languages unseen during pre-training. Although
we train models on all languages, we only report on
a subset due to lack of space. Please see Sections C,
D in the appendix for more detailed results.
4.5.1
Impact Of Script Unification
Table 2 contains the ablation tests, giving the
results for the impact of script unification with
multilingual fine-tuning. Comparing scores of
models fine-tuned on unified script IndicBART
(IB+M2O/O2M) against separate script IndicBART
(SSIB+M2O/O2M) it is clear that overall, the for-
Model
bn
IB+PMI
IB+PMI+PIB
Samanantar
IB+Samanantar
IB+PMI
IB+PMI+PIB
Samanantar
IB+Samanantar
hi
ml
or
ta
Test Set: WAT 2021
24.8 37.2 28.5 28.8 27.3
28.9 41.7 33.2 33.2 32.0
27.9 41.8 32.7 32.9 31.2
27.1 41.0 31.6 32.3 30.1
Test Set: FLORES
10.4 14.8 8.1 11.2 10.5
13.0 22.0 12.7 15.1 13.8
30.7 36.0 30.4 28.6 27.7
30.1 35.3 29.1 28.5 26.6
Table 3: Ablation study of the impact of using different
fine-tuning corpora sizes (PMI+PIB) and their comparison against a model trained from scratch as well as
fine-tuned on a general domain corpus (Samanantar).
We evaluate Indic to English translation on the WAT
2021 as well as the FLORES test sets.
mer is better than the latter which could indicate
that script unification enables languages to better
benefit from each other. The case of Kannada,
Punjabi and Oriya, further, illustrates the utility of
script unification. The results for these languages
are italicized in the rows labelled MB50+Bi and
MB50+O2M/M2O in Table 1. mBART50 was not
pre-trained on these languages, so we converted the
training data in these languages in the Devanagari
script12 . With this trick, we still managed to get
large performance improvements over the baselines
trained from scratch, and these improvements are
often close to those exhibited by using IndicBART.
This shows that we may not need to pre-train on all
languages. However, explicitly training on the languages of interest should lead to better translation
quality (Tang et al., 2020b).
4.5.2
Impact Of Corpora Size and Domain
Table 3 shows the impact of corpora sizes as well
as training data domain on some Indic to English
pairs (complete results in Appendix D). All models are multilingual (M2O), have the same size
and are trained on unified script data. In order
to clearly assess the impact of domains, we evaluate on the WAT 2021 as well as the FLORES
test sets. Regardless of the test sets or testing domains, comparing rows IB+PMI and IB+PMI+PIB,
it is clear that increasing the amount of fine-tuning
data has a positive impact on the final translation
quality. However, PMI+PIB data is in-domain for
the WAT 2021 test set but out-of-domain for the
12
None of the pre-training languages use the same script as
kn, pa, or.
Setting
IB+Full
IB+Zero
SSIB+Zero
M2O
kn-en pa-en
32.4
35.7
27.5
31.5
24.0
28.2
O2M
en-kn en-pa
18.5
26.4
6.1
10.4
3.9
7.4
Table 4: Evaluation of Kannada and Punjabi to/from
English translation, which aren’t seen when finetuning.
FLORES test set, and the performance on the latter
test set still improves.Furthermore, comparing rows
IB+PMI+PIB and Samanantar, we can see widely
different results depending on the test set. For the
WAT 2021 test set, fine-tuning on the PMI+PIB
dataset is comparable to training on Samanantar
from scratch, indicating that for domain specific
models, having a small in-domain fine-tuning data
is sufficient. On the other hand, on the more general domain FLORES test sets training on the more
diverse Samanantar data is clearly better. Finally,
the scores in the row IB+Samanantar show that
pre-training has minimal impact when the parallel
corpora are large, an observation in line with Liu
et al. (2020).
4.5.3
Unseen Languages During Fine-Tuning
We evaluate Kannada and Punjabi to/from English
translation where the IndicBART model, with and
without script unification, is fine-tuned on the multilingual PMI data where the training data for these
languages is missing (denoted by “Zero”). We compare against a setting where the training data is used
(denoted by “Full”). Table 4 shows what happens
when languages are seen during pre-training but
not during fine-tuning. There are two critical observations: First, despite not having seen any training
data for the given language pairs, we still obtain a
reasonable translation for translation into English.
However, the quality of translation from English
is poor due to the decoder not having seen those
specific Indic languages during fine-tuning. Incorporating a monolingual de-noising objective for
unseen target languages during fine-tuning could
alleviate this problem. Second, script unification
has a large impact on the final performance, often
improving performance by up to 3.5 BLEU over a
separate script model.
4.5.4
Unseen Languages During Pre-Training
We study Nepalese (ne) and Sinhala (si) to English
translation using the parallel training data from
Model
Bi (Scratch)
IB+Bi
(Liu et al., 2020)
ne-en
5.2
10.5
14.5
si-en
4.3
8.5
13.7
Table 5: Evaluation of Nepali and Sinhala to English
translation where IndicBART hasn’t seen Nepali and
Sinhala during pre-training.
Guzmán et al. (2019) (also used in Liu et al. (2020))
for bilingual fine-tuning, and evaluate on the FLORES devtest set13 . Note that for Sinhala we have to
resort to script mapping into Devanagari. Table 5
shows what happens when we perform fine-tuning
for languages that IndicBART is not trained on.
The baselines, trained using the unified script IndicBART vocabulary, will seem weaker than what
is reported in previous work, but it should be noted
that the vocabulary was not actually trained for
Nepali and Sinhala. Regardless, fine-tuning leads
to substantial improvements in translation quality,
which indicates the utility of IndicBART even for
unseen languages. Comparing against Liu et al.
(2020) who use the same fine-tuning data as us
but their mBART model is pre-trained on both languages, we can see that our models are not too far
behind.
5
Experiments: Extreme Summarization
We compare the performance of fine-tuning IndicBART, its variants and mBART50 on the challenging extreme summarization task (Narayan et al.,
2018) for Indic languages. The small datasets, enable a good study of the utility of pre-training.
5.1
Models Trained
We fine-tune and compare the mBART50 (MB),
IndicBART (IB), IndicALBART (IALB) and the
separate script IndicBART model (SSIB) models.
Punjabi is not present in mBART50 and has its
script mapped to Devanagari before fine-tuning
(italicized results).
5.2
Datasets and Preprocessing
We used the multilingual XL-Sum dataset (Hasan
et al., 2021) for our experiments. The Indic languages we focus on for evaluating our IndicBART
models are: Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil and Telugu. We use the updated splits
13
https://github.com/facebookresearch/
flores
Lang
MB50
IB
SSIB
IALB
bn
gu
hi
mr
pa
ta
te
21.87