This repository contains scripts and files related to the data-mix for training.
Run the following script to generate the data templates: data/(train/valid/test)_data_paths.txt
python scripts/generate_data_args.py
To obtain the final file that can be used by the training script, run the following commands:
export DATA_PATH=/path/to/tokenized/datasets
envsubst < data/train_data_paths.txt > data/train_data_paths.txt.tmp
envsubst < data/valid_data_paths.txt > data/valid_data_paths.txt.tmp
envsubst < data/test_data_paths.txt > data/test_data_paths.txt.tmp
In Megatron, pass the following arguments
--train-weighted-split-paths-path /path/to/train_data_paths.txt.tmp \
--valid-weighted-split-paths-path /path/to/valid_data_paths.txt.tmp \