Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed markdown syntax and typo #86

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 69 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

[📜 Paper](https://arxiv.org/abs/2305.16307) | [🌐 Website](https://ai4bharat.iitm.ac.in/indic-trans2) | [▶️ Demo](https://models.ai4bharat.org/#/nmt/v2) | [🤗 HF Interface](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_interface) | [![colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/colab_inference.ipynb)

IndicTrans2 is the first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages — including multiple scripts for low-resouce languages like Kashmiri, Manipuri and Sindhi. It adopts script unification wherever feasible to leverage transfer learning by lexical sharing between languages. Overall, the model supports five scripts Perso-Arabic (Kashmiri, Sindhi, Urdu), Ol Chiki (Santali), Meitei (Manipuri), Latin (English), and Devanagari (used for all the remaining languages).
IndicTrans2 is the first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages — including multiple scripts for low-resource languages like Kashmiri, Manipuri and Sindhi. It adopts script unification wherever feasible to leverage transfer learning by lexical sharing between languages. Overall, the model supports five scripts Perso-Arabic (Kashmiri, Sindhi, Urdu), Ol Chiki (Santali), Meitei (Manipuri), Latin (English), and Devanagari (used for all the remaining languages).

We open-souce all our training dataset (BPCC), back-translation data (BPCC-BT), final IndicTrans2 models, evaluation benchmarks (IN22, which includes IN22-Gen and IN22-Conv) and training and inference scripts for easier use and adoption within the research community. We hope that this will foster even more research in low-resource Indic languages, leading to further improvements in the quality of low-resource translation through contributions from the research community.
We open-source all our training dataset (BPCC), back-translation data (BPCC-BT), final IndicTrans2 models, evaluation benchmarks (IN22, which includes IN22-Gen and IN22-Conv) and training and inference scripts for easier use and adoption within the research community. We hope that this will foster even more research in low-resource Indic languages, leading to further improvements in the quality of low-resource translation through contributions from the research community.

This code repository contains instructions for downloading the artifacts associated with IndicTrans2, as well as the code for training/fine-tuning the multilingual NMT models.

Expand Down Expand Up @@ -68,48 +68,60 @@ Here is the list of languages supported by the IndicTrans2 models:

## Tables of Contents

- [Download Models and Other Artifacts](#download-models-and-other-artifacts)
- [Multilingual Translation Models](#multilingual-translation-models)
- [Training Data](#training-data)
- [Evaluation Data](#evaluation-data)
- [Installation](#installation)
- [Data](#data)
- [Training](#training)
- [Evaluation](#evaluation)
- [Preparing Data for Training](#preparing-data-for-training)
- [Using our SPM model and Fairseq dictionary](#using-our-spm-model-and-fairseq-dictionary)
- [Training your own SPM models and learning Fairseq dictionary](#training-your-own-spm-models-and-learning-fairseq-dictionary)
- [Training / Fine-tuning](#training--fine-tuning)
- [Inference](#inference)
- [Fairseq Inference](#fairseq-inference)
- [CT2 Inference](#ct2-inference)
- [Evaluations](#evaluations)
- [Baseline Evaluation](#baseline-evaluation)
- [LICENSE](#license)
- [Citation](#citation)
- [IndicTrans2](#indictrans2)
- [Updates](#updates)
- [Tables of Contents](#tables-of-contents)
- [Download Models and Other Artifacts](#download-models-and-other-artifacts)
- [Multilingual Translation Models](#multilingual-translation-models)
- [Training Data](#training-data)
- [Evaluation Data](#evaluation-data)
- [Installation](#installation)
- [Additional notes about Installation](#additional-notes-about-installation)
- [Data](#data)
- [Training](#training)
- [Evaluation](#evaluation)
- [Preparing Data for Training](#preparing-data-for-training)
- [Using our SPM model and Fairseq dictionary](#using-our-spm-model-and-fairseq-dictionary)
- [Training your own SPM models and learning Fairseq dictionary](#training-your-own-spm-models-and-learning-fairseq-dictionary)
- [Training / Fine-tuning](#training--fine-tuning)
- [Inference](#inference)
- [Fairseq Inference](#fairseq-inference)
- [CT2 Inference](#ct2-inference)
- [Evaluations](#evaluations)
- [Baseline Evaluation](#baseline-evaluation)
- [LICENSE](#license)
- [Citation](#citation)

## Download Models and Other Artifacts

### Multilingual Translation Models

| Model | En-Indic | Indic-En | Indic-Indic | Evaluations |
| ---------------------------- | ----------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Base (used for benchmarking) | [download](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/indic-en-preprint.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/indic-indic.zip) | [translations](https://indictrans2-public.objectstore.e2enetworks.net/translation_outputs.zip) (as of May 10, 2023), [metrics](https://drive.google.com/drive/folders/1lOOdaU0VdRSBgJEsNav5zC7wwLBis9NI?usp=sharing) |
| Distilled | [download](https://indictrans2-public.objectstore.e2enetworks.net/it2_distilled_ckpts/en-indic.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/it2_distilled_ckpts/indic-en.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/it2_distilled_ckpts/indic-indic.zip) |
Click to download zip of model

| Model | En-Indic | Indic-En | Indic-Indic | Evaluations |
| ---------------------------- | ------------------------ | --------------------------- | ----------------------------- | ----------------------------------------------------------------------------- |
| Base (used for benchmarking) | [base-en-indic :arrow_down:][base-en-indic] | [base-indic-en :arrow_down:][base-indic-en] | [base-indic-indic :arrow_down:][base-indic-indic] | [base-translations :arrow_down:][base-translation] (as of May 10, 2023), [base-metrics ::arrow_down:][base-metric] |
| Distilled | [distilled-en-indic :arrow_down:][distilled-en-indic] | [distilled-indic-en :arrow_down:][distilled-indic-en] | [distilled-indic-indic :arrow_down:][distilled-indic-indic] | |

### Training Data

| Data | URL |
| ---------------------------------------- | ------------------------------------------------------------------------------ |
| Bharat Parallel Corpus Collection (BPCC) | [download](https://indictrans2-public.objectstore.e2enetworks.net/BPCC.zip) |
| Back-translation (BPCC-BT) | [download](https://indictrans2-public.objectstore.e2enetworks.net/BT_data.zip) |
| Data | URL |
| ---------------------------------------- | -------------------------------- |
| Bharat Parallel Corpus Collection (BPCC) | [BPCC :arrow_down:][BPCC] |
| Back-translation (BPCC-BT) | [BPCC-BT :arrow_down:][BPCC-BT] |

[BPCC]: https://indictrans2-public.objectstore.e2enetworks.net/BPCC.zip
[BPCC-BT]: https://indictrans2-public.objectstore.e2enetworks.net/BT_data.zip

### Evaluation Data

| Data | URL |
| ----------------------- | ------------------------------------------------------------------------------------ |
| IN22 test set | [download](https://indictrans2-public.objectstore.e2enetworks.net/IN22_testset.zip) |
| FLORES-22 Indic dev set | [download](https://indictrans2-public.objectstore.e2enetworks.net/flores-22_dev.zip) |
| IN22 test set | [IN22 Test Set :arrow_down:][IN22-TEST-SET] |
| FLORES-22 Indic dev set | [FLORES 22 Test Set :arrow_down:][FLORES22-TEST-SET] |

[IN22-TEST-SET]: https://indictrans2-public.objectstore.e2enetworks.net/IN22_testset.zip
[FLORES22-TEST-SET]:https://indictrans2-public.objectstore.e2enetworks.net/flores-22_dev.zip

## Installation

Expand All @@ -124,13 +136,14 @@ cd IndicTrans2
source install.sh
```

Note: We recommend creating a virtual environment with python>=3.7.
Note: We recommend creating a virtual environment with *python>=3.7.x*

### Additional notes about Installation
The ``prepare_data_joint_finetuning.sh`` and ``prepare_data_joint_training.sh`` scripts expect that the sentencepiece commandline utility and GNU parallel are installed.
1. To install the sentencepiece command line utility, please follow the instructions [here](https://github.com/google/sentencepiece?tab=readme-ov-file#build-and-install-sentencepiece-command-line-tools-from-c-source).
2. Please check if GNU parallel is installed, if not please install the same or alternatively in case of installation issues, remove ``parallel --pipe --keep-order`` from the respective training / finetuning script as well as ``apply_sentence_piece.sh``.

The `prepare_data_joint_finetuning.sh` and `prepare_data_joint_training.sh` scripts expect that the sentencepiece command line utility and GNU parallel are installed.

1. To install the sentencepiece command line utility, please follow the instructions [here](https://github.com/google/sentencepiece?tab=readme-ov-file#build-and-install-sentencepiece-command-line-tools-from-c-source).
2. Please check if GNU parallel is installed, if not please install the same or alternatively in case of installation issues, remove `parallel --pipe --keep-order` from the respective training / finetuning script as well as `apply_sentence_piece.sh`.

## Data

Expand Down Expand Up @@ -202,7 +215,7 @@ Additionally, we provide augmented back-translation data generated by our interm
</tbody>
</table>

<br>
---

### Evaluation

Expand Down Expand Up @@ -254,7 +267,7 @@ python3 scripts/dedup_benchmark.py <in_data_dir> <out_data_dir> <benchmark_dir>

Once you complete the deduplication of the training data with the available benchmarks, you can preprocess and binarize the data for training models. Please download our trained SPM model and learned Fairseq dictionary using the following links for your experiments.

| | En-Indic | Indic-En | Indic-Indic |
| # | En-Indic | Indic-En | Indic-Indic |
| ------------------ | -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| SPM model | [download](https://indictrans2-public.objectstore.e2enetworks.net/en-indic-spm.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/indic-en-spm.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/indic-indic-spm.zip) |
| Fairseq dictionary | [download](https://indictrans2-public.objectstore.e2enetworks.net/en-indic-fairseq-dict.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/indic-en-fairseq-dict.zip) | [download](https://indictrans2-public.objectstore.e2enetworks.net/indic-indic-fairseq-dict.zip) |
Expand Down Expand Up @@ -311,15 +324,15 @@ If you want to train your own SPM model and learn Fairseq dictionary, then pleas
2. Perform script unification for Indic languages wherever possible using `scripts/preprocess_translate.py` and concatenate all Indic data into a single file.
3. Train two SPM models, one for English and other for Indic side using the following:

```bash
spm_train --input=train.indic --model_prefix=<model_name> --vocab_size=<vocab_size> --character_coverage=1.0 --model_type=BPE
```
```bash
spm_train --input=train.indic --model_prefix=<model_name> --vocab_size=<vocab_size> --character_coverage=1.0 --model_type=BPE
```

4. Copy the trained SPM models in the experiment directory mentioned earlier and learn the Fairseq dictionary using the following:

```bash
bash prepare_data_joint_training.sh <exp_dir>
```
```bash
bash prepare_data_joint_training.sh <exp_dir>
```

5. You will need to use the same Fairseq dictionary for any subsequent fine-tuning experiments and refer to the steps described above ([link](#using-our-spm-model-and-fairseq-dictionary)).

Expand All @@ -344,11 +357,11 @@ bash finetune.sh <exp_dir> <model_arch> <pretrained_ckpt>
- `<model_arch>`: custom transformer architecture used for model training
- `transformer_18_18` - For IT2 Base models
- `transformer_base18L` - For IT2 Distilled models
- `<pretrained_ckpt>`: path to the fairseq model checkpoint to be loaded for further fine-tuning
- `<pre-trained_ckpt>`: path to the fairseq model checkpoint to be loaded for further fine-tuning

You can download the model artifacts released as a part of this work from the [following section](#download-models-and-other-artifacts).

The pretrained checkpoints have 3 directories, a fairseq model directory and 2 CT-ported model directories. Please note that the CT2 models are provided only for efficient inference. For fine-tuning purposes you should use the `fairseq_model`. Post that you can use the [fairseq-ct2-converter](https://opennmt.net/CTranslate2/guides/fairseq.html) to port your fine-tuned checkpoints to CT2 for faster inference.
The pre-trained checkpoints have 3 directories, a fairseq model directory and 2 CT-ported model directories. Please note that the CT2 models are provided only for efficient inference. For fine-tuning purposes you should use the `fairseq_model`. Post that you can use the [fairseq-ct2-converter](https://opennmt.net/CTranslate2/guides/fairseq.html) to port your fine-tuned checkpoints to CT2 for faster inference.

## Inference

Expand Down Expand Up @@ -449,7 +462,7 @@ bash eval_rev.sh <devtest_data_dir> <ckpt_dir> <system>
- `<ckpt_dir>`: path to the fairseq model checkpoint directory
- `<system>`: system name suffix to store the predictions in the format `test.{lang}.pred.{system}`

**_Note: You don’t need to reverse the test set directions for each language pair._**
***Note: You don’t need to reverse the test set directions for each language pair.***

In case of Indic-Indic evaluation, please use the following command:

Expand Down Expand Up @@ -521,3 +534,13 @@ url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}
```
<!-- Reference -->

[base-en-indic]: https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip
[base-indic-en]: https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/indic-en-preprint.zip
[base-indic-indic]: https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/indic-indic.zip
[base-translation]: https://indictrans2-public.objectstore.e2enetworks.net/translation_outputs.zip
[base-metric]: https://drive.google.com/drive/folders/1lOOdaU0VdRSBgJEsNav5zC7wwLBis9NI?usp=sharing
[distilled-en-indic]: https://indictrans2-public.objectstore.e2enetworks.net/it2_distilled_ckpts/en-indic.zip
[distilled-indic-en]: https://indictrans2-public.objectstore.e2enetworks.net/it2_distilled_ckpts/indic-en.zip
[distilled-indic-indic]: https://indictrans2-public.objectstore.e2enetworks.net/it2_distilled_ckpts/indic-indic.zip