This repository is based on mission-impossible-language-models.
First, please clone the two repositories:
git clone https://github.com/xiulinyang/multilingual-LM.git
cd multilingual-LM
rm -rf mistral
git clone https://github.com/xiulinyang/mistral.git
Second, please create two virtual environment by
conda create -n mission python=3.9
conda activate mission
pip install -r requirements.txt
and
cd mistral
conda create -n mistral python=3.8.12 pytorch=1.11.0 torchdata cudatoolkit=11.3 -c pytorch
conda activate mistral
pip install -r setup/pip-requirements.txt
Please change the file or root paths in the following scripts:
- utils.py
- training/conf/template/gpt2-small-template.yaml
- training/conf/template/multilingual_dataset_template.yaml
- training/conf/template/multilingual_train_template.yaml (wandb)
cd data
python tag.py path/to/language/file -b batch_size -l LANG
# e.g., python tag.py multilingual/RU/train/RU.train -b 2 -l RU
# I usually set the batch size small because in my algorithm if stanza cannot parse the sentence, I will give up the whole batch
You can simply run the experiment by the following command
bash run.sh LANG PERTURBATION RANDOM_SEED
#e.g., bash run.sh NL shuffle_deterministic21 41
Parameters:
- LANG: Language code (e.g., EN, DE, etc.).
- PERTURBATION: Perturbation type (defined in the FUNCTION_MAPS in utils.py).
- RANDOM_SEED: Random seed (in our experiments, we use 41, 53, 81).
If you want to experiment with additional languages or apply perturbations beyond those discussed in our paper, follow these steps:
1Add your language data:
Place the new language files in the data/
folder, maintaining the existing data structure.
2. Update language references:
- Add the new language name to
util.py
andtraining/multilingual_dataset.py
. - Update the tokenizer configuration in
mistral/conf/models/
.
- Please update the
util.py
andtraining/multilingual_dataset.py
with your new perturbation function.
If you need the OPUS12 and OPUS30 corpus, or have any questions, feel free to open an issue or contact me at [email protected].