This folder contains example configuration files to easily and quickly reproduce the processing flow of the ROOTS dataset, created by the BigScience initiative to train the BLOOM models.
The raw data files can be downloaded as described in BLOOM/Oscar. Then use bloom-oscar.yaml to perform the whole processing.
An analysis of our reproduction will be published soon.