Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

v0.3.3

Compare
Choose a tag to compare
@leezu leezu released this 13 Jun 05:26
· 676 commits to master since this release

GluonNLP v0.3 contains many exciting new features.
(depends on MXNet 1.3.0b20180725)

Models

  • Language Models
  • Machine Translation
    • The Transformer Model as introduced by Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017* is introduced as part of the gluonnlp nmt scripts (#133)
  • Word embeddings
    • Trainable word embedding models are introduced as part of gluonnlp.model.train (#136)
      • Word2Vec by Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
      • FastText models by Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135-146.

New Datasets

API changes

  • The download directory for datasets and other artifacts can now be specified
    via the MXNET_HOME environment variable. (#106)
  • TokenEmbedding class now exposes the Inverse Vocab as well (#123)
  • SortedSampler now supports use_average_length option (#135)
  • Add more strategies for bucket creation (#145)
  • Add tokenizer to bleu (#154)
  • Add Convolutional Encoder and Highway Layer (#129) (#186)
  • Add plain text of translation data. (#158)
  • Use Sherlock Holmes dataset instead of PTB for language model notebook (#174)
  • Add classes JiebaToknizer and NLTKStanfordSegmenter for Chinese Word Segmentation (#164)
  • Allow toggling output and prompt in documentation website (#184)
  • Add shape assertion statements for better user experience to some attention cells (#201)
  • Add support for computation of word embeddings for unknown words in TokenEmbedding class (#185)
  • Distribute subword vectors for pretrained fastText embeddings enabling embeddings for unknown words (#185)

Fixes & Small Changes

  • fixed bptt_batchify sometimes returned an invalid last batch (#120)
  • Fixed wrong PPL calculation in word language model script for multi-GPU (#150)
  • Fix split compound words and wmt16 results (#151)
  • Adapt pretrained word embeddings example notebook for nd.topk change in mxnet 1.3 (#153)
  • Fix beam search script (#175)
  • Fix small bugs in parser (#183)
  • TokenEmbedding: Skip lines with invalid bytes instead of crashing (#188)
  • Fix overly large memory use in TokenEmbedding serialization/deserialization if some tokens are overly large (eg. 50k characters) (#187)
  • Remove duplicates in WordSim353 when combining segments (#192)

See all commits