v0.7.0
News
- GluonNLP will be featured in KDD 2019 Alaska! Check out our tutorial: From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond.
- GluonNLP was featured in JSALT 2019 in Montreal, 2019-6-14! Checkout https://jsalt19.mxnet.io.
Models and Scripts
BERT
- BERT model pre-trained on OpenWebText Corpus, BooksCorpus, and English Wikipedia. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang
Source | GluonNLP | google-research/bert | google-research/bert |
---|---|---|---|
Model | bert_12_768_12 | bert_12_768_12 | bert_24_1024_16 |
Dataset | openwebtext_book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
SST-2 | 95.3 | 93.5 | 94.9 |
RTE | 73.6 | 66.4 | 70.1 |
QQP | 72.3 | 71.2 | 72.1 |
SQuAD 1.1 | 91.0/84.4 | 88.5/80.8 | 90.9/84.1 |
STS-B | 87.5 | 85.8 | 86.5 |
MNLI-m/mm | 85.3/84.9 | 84.6/83.4 | 86.7/85.9 |
-
The SciBERT model introduced by Iz Beltagy and Arman Cohan and Kyle Lo in "SciBERT: Pretrained Contextualized Embeddings for Scientific Text". The model checkpoints are converted from the original repository from AllenAI with the following datasets (#735):
scibert_scivocab_uncased
scibert_scivocab_cased
scibert_basevocab_uncased
scibert_basevocab_cased
-
The BioBERT model introduced by Lee, Jinhyuk, et al. in "BioBERT: a pre-trained biomedical language representation model for biomedical text mining". The model checkpoints are converted from the original repository with the following datasets (#735):
biobert_v1.0_pmc_cased
biobert_v1.0_pubmed_cased
biobert_v1.0_pubmed_pmc_cased
biobert_v1.1_pubmed_cased
-
The ClinicalBERT model introduced by Kexin Huang and Jaan Altosaar and Rajesh Ranganath in "ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission". The model checkpoints are converted from the original repository with the
clinicalbert_uncased
dataset (#735) -
The ERNIE model introduced by Sun, Yu, et al. in "ERNIE: Enhanced Representation through Knowledge Integration". You can get the model checkpoints converted from the original repository with
model.get_model("ernie_12_768_12", "baidu_ernie_uncased")
(#759) thanks @paperplanet -
BERT fine-tuning script for named entity recognition on CoNLL2003 with test F1 92.2 (#612).
-
BERT fine-tuning script for Chinese XNLI dataset with 78.3% validation accuracy. (#759) thanks @paperplanet
-
BERT fine-tuning script for intent classification and slot labelling on ATIS (95.9 F1) and SNIPS (95.9 F1). (#817)
GPT-2
- The GPT-2 language model introduced by Radford, Alec, et al. in "Language Models are Unsupervised Multitask Learners". The model checkpoints are converted from the original repository, with a script to generate text from GPT-2 model (
gpt2_117m
,gpt2_345m
) trained on theopenai_webtext
dataset (#761).
ESIM
- The ESIM model for text matching introduced by Chen, Qian, et al. in "Enhanced LSTM for Natural Language Inference". (#689)
Data
- Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
- Sentiment analysis datasets: CR, MPQA (#663)
- Intent classification and slot labeling datasets: ATIS and SNIPS (#816)
New Features
- [Feature] support save model / trainer states to S3 (#700)
- [Feature] support load model/trainer states from s3 (#702)
- [Feature] Add SentencePieceTokenizer for BERT (#669)
- [FEATURE] Flexible vocabulary (#732)
- [API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
- [Feature] add the List batchify function (#812) thanks @ThomasDelteil
- [FEATURE] Add LAMB optimizer (#733)
Bug Fixes
- [BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
- [BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
- fix bert forward call parameter mismatch (#695) thanks @paperplanet
- [BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
- Fix _get_rnn_cell (#648) thanks @MarisaKirisame
- [BUGFIX] fix mrpc dataset idx (#708)
- [bugfix] fix hybrid beam search sampler(#710)
- [BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
- [BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
- [BUGFIX] Fix TokenEmbedding serialization with
emb[emb.unknown_token] != 0
(#763) - [BUGFIX] Fix glue test result serialization (#773)
- [BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori
API Changes
- [API] Dropping support for wiki_multilingual and wiki_cn (#764)
- [API] Remove get_bert_model from the public API list (#767)
Enhancements
- [FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
- [Script] Add inference function for BERT classification (#639) thanks @TaoLv
- [SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
- [Enhancement] One script to export bert for classification/regression/QA (#705)
- [enhancement] refactor bert finetuning script (#692)
- [Enhancement] only use the best model for inference for bert classification (#716)
- [Dataset] redistribute conll2004 (#719)
- [Enhancement] add periodic evaluation for BERT pre-training (#720)
- [FEATURE]add XNLI task (#717)
- [refactor] Refactor BERT script folder (#744)
- [Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
- [REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
- [Refactor] Refactor BERT SQuAD inference code (#758)
- [Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
- [Dataset] Move MRPC dataset to API (#780)
- [BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
- [DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
- [Improvement] Implement parser evaluation in Python (#772)
- [Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
- [Enhancement] Mix precision support for BERT finetuning (#793)
- Generate BERT training samples in compressed format (#651)
Minor Fixes
- Various documentation fixes: #635, #637, #647, #656, #664, #667, #670, #676, #678, #681, #698, #704, #731, #745, #762, #771, #746, #778, #800, #810, #807 #814 thanks @rongruosong @crcrpar @mrchypark @xwind-h
- Fix BERT multiprocessing data creation bug which causes unnecessary dispatching to single worker (#649)
- [BUGFIX] Update BERT test and pre-train script (#661)
- update url for ws353 (#701)
- bump up version (#742)
- [DOC] Update textCNN results (#737)
- padding value warning (#747)
- [TUTORIAL][DOC] Tutorial Updates (#802) thanks @faramarzmunshi
Continuous Integration
- skip failing tests in mxnet master (#685)
- [CI] update nodes for CI (#686)
- [CI] CI refactoring to speed up tests (#566)
- [CI] fix codecov (#693)
- use fixture for squad dataset tests (#699)
- [CI] create zipped notebooks for link check (#712)
- Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
- [CI] set root in BERT tests (#738)
- Fix conftest.py function_scope_seed (#748)
- [CI] Fix links in contribute.rst (#752)
- [CI] Update CI dependencies (#756)
- Revert "[CI] Update CI dependencies (#756)" (#769)
- [CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
- [CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
- [CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
- add license checker (#804)
- enable timeout (#813)
- Fix website build on master branch (#819)