-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some update to tr10 config #20
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,24 +46,28 @@ GLOBAL_BATCH_SIZE=2048 | |
|
||
NLAYERS=40 | ||
NHIDDEN=5120 | ||
NHEADS=32 | ||
NHEADS=40 | ||
SEQ_LEN=2048 | ||
VOCAB_SIZE=150000 | ||
|
||
TRAIN_TOKENS=300_000_000_000 | ||
TRAIN_SAMPLES=$(python -c "print($TRAIN_TOKENS // $SEQ_LEN)") | ||
|
||
SAVE_INTERVAL=300 | ||
|
||
OPTIMIZER_ARGS=" \ | ||
--optimizer adam \ | ||
--adam-beta1 0.9 \ | ||
--adam-beta2 0.95 \ | ||
--adam-eps 1e-8 \ | ||
--lr 6e-5 \ | ||
--lr 1e-4 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GPT3 paper suggest a higher learning rate. Is there a reason why we would use |
||
--min-lr 6e-6 \ | ||
--lr-decay-style cosine \ | ||
--lr-decay-samples 126_953_125 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you removed this one w/o any commentary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original tr1-13B said:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was looking at setting it by default to the entire number of samples we have We have been using this in arch/scaling. However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed. |
||
--lr-warmup-samples 216_320 \ | ||
--clip-grad 1.0 \ | ||
--weight-decay 1e-1 \ | ||
--hidden-dropout 0.0 \ | ||
--attention-dropout 0.0 \ | ||
Comment on lines
+69
to
+70
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs |
||
" | ||
|
||
EXIT_OPTS=" \ | ||
|
@@ -80,7 +84,7 @@ GPT_ARGS=" \ | |
--micro-batch-size $MICRO_BATCH_SIZE \ | ||
--rampup-batch-size 16 16 6_000_000 \ | ||
--global-batch-size $GLOBAL_BATCH_SIZE \ | ||
--train-samples 300_000_000 \ | ||
--train-samples $TRAIN_SAMPLES \ | ||
--tokenizer-type PretrainedFromHF \ | ||
--tokenizer-name-or-path $TOKENIZER_NAME \ | ||
--loss-scale 12 \ | ||
|
@@ -165,7 +169,7 @@ export CMD=" \ | |
--load $CHECKPOINT_PATH \ | ||
--data-path $DATA_PATH \ | ||
--data-impl mmap \ | ||
--split 900,100,0 \ | ||
--split 950,50,0 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. currently using a small dataset, so I had to give valid a larger chunk. But for the real training this needs to be restored to the above split. |
||
--distributed-backend nccl \ | ||
$DEEPSPEED_ARGS \ | ||
" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't why we chose 32? We seem to have updated the
NHIDDEN
value to be 5120 because it was divisible by 128, and5120 // 128 = 40
.https://huggingface.slack.com/archives/C01NHER1JLS/p1627034738272600?thread_ts=1626827659.189400&cid=C01NHER1JLS
cc @VictorSanh @stas00 @mryab (People who were involved in the original post)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, 530B training used:
So the same proportion as 32 and 5120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.
I'm not sure if there is a research showing size of the head vs. number of the heads performance.