Exploration of BERT-like models trained on The Stack.
-
Code used to train StarEncoder.
- StarEncoder was fine-tuned for PII detection to pre-process the data used to train StarCoder
-
This repo also contains functionality to train encoders with contrastive objectives.
After installing requirements, training can be launched via the example launcher script:
./launcher.sh
-
--train_data_name
can be used to use to set the training dataset. -
Hyperparamaters can be changed in
exp_configs.py
.- The tokenizer to be used is treated as a hyperparameter and then must also be set in
exp_configs.py
. - alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
- Setting alpha to 1 corresponds to the standard BERT objective.
- Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.
- The tokenizer to be used is treated as a hyperparameter and then must also be set in