Skip to content

bigcode-project/bigcode-data-mix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

bigcode-data-mix

This repository contains scripts and files related to the data-mix for training.

1 (optional) - Generate the templates

Run the following script to generate the data templates: data/(train/valid/test)_data_paths.txt

python scripts/generate_data_args.py

2 - Substitute the data path

To obtain the final file that can be used by the training script, run the following commands:

export DATA_PATH=/path/to/tokenized/datasets
envsubst < data/train_data_paths.txt > data/train_data_paths.txt.tmp
envsubst < data/valid_data_paths.txt > data/valid_data_paths.txt.tmp
envsubst < data/test_data_paths.txt > data/test_data_paths.txt.tmp

In Megatron, pass the following arguments

--train-weighted-split-paths-path /path/to/train_data_paths.txt.tmp \
--valid-weighted-split-paths-path /path/to/valid_data_paths.txt.tmp \

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages