Detail about BERT-based Training #1

SivilTaram · 2020-03-09T15:25:57Z

Thanks for the great work on reproducing the T-Ptr-$lambda$ model! I have reproduced the non-BERT result with your kindly instruction. However, when I tried to combine the model with pretrained chinese BERT model (the Google offical bert-chinese-uncased), the model seesm not to convergence. Could you kindly provide more detais about your bert-based training for reference (e.g. learning_rate, warmup_steps, and training epoches) ? Any suggestion is also well welcome.

Thanks a lot, Qian.

The text was updated successfully, but these errors were encountered:

liu-nlper · 2020-03-11T16:46:16Z

@SivilTaram
Hi Qian, I used the same settings as the non-BERT model when trained the BERT-based model(L-6_H-256_A-8).
I have not trained with the official 12-layer BERT model yet, I guess the 18k data is too little , which makes the model difficult to converge，may be you can try the following strategies:

Reduce the layers of the model, i.e. use the first few layers of the pre-trained BERT model.
Freeze the encoder in the first few epochs and then train the whole model.
Design some special unsupervised pretraining tasks for the copy model, pretraining encoder and decoder at the same time.

SivilTaram · 2020-03-12T14:56:17Z

@liu-nlper Thanks for your quick response! I will try again following your kind suggestions. If it is solved, I will get back to report the experimental results.

SivilTaram · 2020-03-18T07:17:54Z

After struggling for a few days, finally I have to admit that it is difficult to incorporate the official 12-layer BERT chinese version into the task of rewrite (either for reproduced T-Ptr-Net, or T-Ptr-Lambda, even for L-Ptr-Lambda). I have tried for several ways as following, but none of them has shown improvements than the non-BERT baseline:

12-layer encoder, 12-layer decoder (encoder initialized by BERT, finetuned with learning rate from 0.1 to 1.5)
12-layer encoder, 6-layer decoder, hidden 768 (encoder initialized by BERT, finetuned with learning rate from 0.1 to 1.5)
6-layer encoder, 6-layer deocder, hidden 256 (BERT as encoder embedding)
LSTM encoder, LSTM decoder, hidden 512 (BERT as encoder embedding)
6-layer encoder, 6-layer decoder, hidden 768 (encoder initalized by the first 6-layer of BERT).

I post the above results for reference. If any reader has employ the BERTology (Google's 12 layer chinese model) into the task successfully, please feel free to concat me (qian dot liu at buaa.edu.cn), thanks :)

zjwzcn07 · 2020-03-19T06:48:50Z

After struggling for a few days, finally I have to admit that it is difficult to incorporate the official 12-layer BERT chinese version into the task of rewrite (either for reproduced T-Ptr-Net, or T-Ptr-Lambda, even for L-Ptr-Lambda). I have tried for several ways as following, but none of them has shown improvements than the non-BERT baseline:

12-layer encoder, 12-layer decoder (encoder initialized by BERT, finetuned with learning rate from 0.1 to 1.5)

12-layer encoder, 6-layer decoder, hidden 768 (encoder initialized by BERT, finetuned with learning rate from 0.1 to 1.5)

6-layer encoder, 6-layer deocder, hidden 256 (BERT as encoder embedding)

LSTM encoder, LSTM decoder, hidden 512 (BERT as encoder embedding)

6-layer encoder, 6-layer decoder, hidden 768 (encoder initalized by the first 6-layer of BERT).

I post the above results for reference. If any reader has employ the BERTology (Google's 12 layer chinese model) into the task successfully, please feel free to concat me (qian dot liu at buaa.edu.cn), thanks :)

@SivilTaram

I also try these bert model to initialize transformer layer, but didn't show improvements. Model following:

L3H8
L6H8 (1st, 2ed, 3rd, 4th, 5th, 6th layer from bert)
L6H8(1st, 3ed, 5rd, 7th, 9th, 11th layer from bert)
L12H8

But I find bert_based model show better in other dev dataset.
Could you add me on WeChat. My WeChat ID is CHNyouqh. THX :) !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detail about BERT-based Training #1

Detail about BERT-based Training #1

SivilTaram commented Mar 9, 2020

liu-nlper commented Mar 11, 2020

SivilTaram commented Mar 12, 2020

SivilTaram commented Mar 18, 2020 •

edited

Loading

zjwzcn07 commented Mar 19, 2020 •

edited

Loading

Detail about BERT-based Training #1

Detail about BERT-based Training #1

Comments

SivilTaram commented Mar 9, 2020

liu-nlper commented Mar 11, 2020

SivilTaram commented Mar 12, 2020

SivilTaram commented Mar 18, 2020 • edited Loading

zjwzcn07 commented Mar 19, 2020 • edited Loading

SivilTaram commented Mar 18, 2020 •

edited

Loading

zjwzcn07 commented Mar 19, 2020 •

edited

Loading