Skip to content
View LRLNMT's full-sized avatar

Block or report LRLNMT

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
LRLNMT/README.md

Sequence-to-Sequence Multilingual Pre-Trained Models: A Hope for Low-Resource Language Translation?

Enviroment Setup

source <<conda_path_need_to_add>>/etc/profile.d/conda.sh
conda create -n lrl_nmt_mbart50_ft  python=3.7.11
conda activate lrl_nmt_mbart50_ft
echo $CONDA_PREFIX

pip install -r requirements.txt 

1. Open-Domain Training/Fine-tuning on cc_aligned

1.a. 100k cc_aligned English → XX

bash run_opendomain_ft_100k_English_to_xx.sh -l "<<language_name>>" > logs_run_opendomain_ft_100k_English_to_xx

Language (-l) parameter value can be:

  • Afrikaans
  • French
  • Hindi
  • Kannada
  • Sinhala
  • Tamil
  • Xhosa
  • Yoruba
  • Irish

if you wish to train all (8 languages) suppported language pairs, pass wildcard syombol as -l "*":

bash run_opendomain_ft_100k_English_to_xx.sh -l "*" > logs_run_opendomain_ft_100k_English_to_xx

1.b. 100k cc_aligned XX → English

bash run_opendomain_ft_100k_xx-to-English.sh -l "<<language_name>>" > logs_run_opendomain_ft_100k_xx-to-English

Language (-l) parameter value can be:

  • Afrikaans
  • French
  • Hindi
  • Kannada
  • Sinhala
  • Tamil
  • Xhosa
  • Yoruba
  • Irish

if you wish to train all (8 languages) suppported language pairs, pass wildcard syombol as -l "*":

bash run_opendomain_ft_100k_xx-to-English.sh -l "*" > logs_run_opendomain_ft_100k_xx-to-English

1.c. 25k cc_aligned English → XX

bash run_opendomain_ft_25k_English_to_xx.sh -l "<<language_name>>" > logs_run_opendomain_ft_25k_English_to_xx

Language (-l) parameter value can be:

  • Afrikaans
  • Assamese
  • French
  • Hindi
  • Kannada
  • Sinhala
  • Tamil
  • Xhosa
  • Yoruba
  • Irish

if you wish to train all (9 languages) suppported language pairs, pass wildcard syombol as -l "*":

bash run_opendomain_ft_25k_English_to_xx.sh -l "*" > logs_run_opendomain_ft_25k_English_to_xx

1.d. 25k cc_aligned XX → English

bash run_opendomain_ft_25k_xx-to-English.sh -l "<<language_name>>" > logs_run_opendomain_ft_25k_xx-to-English

Language (-l) parameter value can be:

  • Afrikaans
  • Assamese
  • French
  • Hindi
  • Kannada
  • Sinhala
  • Tamil
  • Xhosa
  • Yoruba
  • Irish

if you wish to train all (9 languages) suppported language pairs, pass wildcard syombol as -l "*":

bash run_opendomain_ft_25k_xx-to-English.sh -l "*" > logs_run_opendomain_ft_25k_xx-to-English

2. Domain specific Training/Fine-tuning.

2.a. English → XX

bash domain-specific_ft_English_to_xx.sh -l "<<language_name>>" -d "<<domain>>"   > logs_run_domain-specific-English-to-xx

Possible combination of domain (-d) and language (-l) parameter values can be:

Domain (-d) languages (-l)
bible Afrikaans, Assamese, French, Hindi, Irish, Kannada, Sinhala, Tamil, Xhosa, Yoruba
jw300 Afrikaans, Xhosa, Yoruba
government Sinhala, Tamil
PrimeMinisterCorpus Assamese, Hindi, Kannada
dgt French, Irish

if you wish to train all suppported language pairs for each domains:

bash domain-specific_ft_English_to_xx.sh -l "*" -d "bible" > logs_bible_en-xx

bash domain-specific_ft_English_to_xx.sh -l "*" -d "jw300" > logs_jw300_en-xx

bash domain-specific_ft_English_to_xx.sh -l "*" -d "government" > logs_gov_en-xx

bash domain-specific_ft_English_to_xx.sh -l "*" -d "PrimeMinisterCorpus" > logs_pmc_en-xx

bash domain-specific_ft_English_to_xx.sh -l "*" -d "dgt" > logs_dgt_en-xx

Please note wildcard (*) is only supported for (-l) languges not for (-d) domain, you must select a (-d) domain.

2.a. XX → English

bash domain-specific_ft_xx_to_English.sh -l "<<language_name>>" -d "<<domain>>"   > logs_run_domain-specific-xx-to-English

Possible combination of domain (-d) and language (-l) parameter values can be:

Domain (-d) languages (-l)
bible Afrikaans, Assamese, French, Hindi, Irish, Kannada, Sinhala, Tamil, Xhosa, Yoruba
jw300 Afrikaans, Xhosa, Yoruba
government Sinhala, Tamil
PrimeMinisterCorpus Assamese, Hindi, Kannada
dgt French, Irish

if you wish to train all suppported language pairs for each domains:

bash domain-specific_ft_xx_to_English.sh -l "*" -d "bible" > logs_bible_xx-en

bash domain-specific_ft_xx_to_English.sh -l "*" -d "jw300" > logs_jw300_xx-en

bash domain-specific_ft_xx_to_English.sh -l "*" -d "government" > logs_gov_xx-en

bash domain-specific_ft_xx_to_English.sh -l "*" -d "PrimeMinisterCorpus" > logs_pmc_xx-en

bash domain-specific_ft_xx_to_English.sh -l "*" -d "dgt" > logs_dgt_xx-en

Please note wildcard (*) is only supported for (-l) languges not for (-d) domain, you must select a (-d) domain.

Popular repositories Loading

  1. LRLNMT LRLNMT Public

    Python 4 2

  2. DomainAdaptExt DomainAdaptExt Public

    Shell

  3. BibleCorpus BibleCorpus Public

    Bible Corpus

    Python 1