This repository has been archived by the owner on Jan 15, 2024. It is now read-only.
v0.3.3
GluonNLP v0.3 contains many exciting new features.
(depends on MXNet 1.3.0b20180725)
Models
- Language Models
- The Cache Language Model as introduced by Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017 is introduced as part of gluonnlp.model.train (#110)
- The Activation Regularizer and Temporal Activation Regularizer as introduced by Merity, S., et al. "Regularizing and optimizing LSTM language models". ICLR 2018 is introduced as part of gluonnlp.loss (#110)
- Machine Translation
- The Transformer Model as introduced by Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017* is introduced as part of the gluonnlp nmt scripts (#133)
- Word embeddings
- Trainable word embedding models are introduced as part of gluonnlp.model.train (#136)
- Word2Vec by Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
- FastText models by Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135-146.
- Trainable word embedding models are introduced as part of gluonnlp.model.train (#136)
New Datasets
- Machine Translation
- WMT2014BPE (#135) (#177) (#180)
- Question Answering
- Stanford Question Answering Dataset (SQuAD) Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2383-2392). (#113)
- Word Embeddings
API changes
- The download directory for datasets and other artifacts can now be specified
via the MXNET_HOME environment variable. (#106) - TokenEmbedding class now exposes the Inverse Vocab as well (#123)
- SortedSampler now supports use_average_length option (#135)
- Add more strategies for bucket creation (#145)
- Add tokenizer to bleu (#154)
- Add Convolutional Encoder and Highway Layer (#129) (#186)
- Add plain text of translation data. (#158)
- Use Sherlock Holmes dataset instead of PTB for language model notebook (#174)
- Add classes JiebaToknizer and NLTKStanfordSegmenter for Chinese Word Segmentation (#164)
- Allow toggling output and prompt in documentation website (#184)
- Add shape assertion statements for better user experience to some attention cells (#201)
- Add support for computation of word embeddings for unknown words in
TokenEmbedding
class (#185) - Distribute subword vectors for pretrained fastText embeddings enabling embeddings for unknown words (#185)
Fixes & Small Changes
- fixed bptt_batchify sometimes returned an invalid last batch (#120)
- Fixed wrong PPL calculation in word language model script for multi-GPU (#150)
- Fix split compound words and wmt16 results (#151)
- Adapt pretrained word embeddings example notebook for nd.topk change in mxnet 1.3 (#153)
- Fix beam search script (#175)
- Fix small bugs in parser (#183)
- TokenEmbedding: Skip lines with invalid bytes instead of crashing (#188)
- Fix overly large memory use in TokenEmbedding serialization/deserialization if some tokens are overly large (eg. 50k characters) (#187)
- Remove duplicates in WordSim353 when combining segments (#192)