Skip to content

Files

char_cnn

Character-level Language Modeling

Overview

In character-level language modeling tasks, each sequence is broken into elements by characters. Therefore, in a character-level language model, at each time step the model is expected to predict the next coming character. We evaluate the temporal convolutional network as a character-level language model on the PennTreebank dataset and the text8 dataset.

Data

  • PennTreebank: When used as a character-level lan- guage corpus, PTB contains 5,059K characters for training, 396K for validation, and 446K for testing, with an alphabet size of 50. PennTreebank is a well-studied (but relatively small) language dataset.

  • text8: text8 is about 20 times larger than PTB, with about 100M characters from Wikipedia (90M for training, 5M for validation, and 5M for testing). The corpus contains 27 unique alphabets.

See data_generator in utils.py. We download the language corpus using observations package in python.

Note

  • Just like in a recurrent network implementation where it is common to repackage hidden units when a new sequence begins, we pass into TCN a sequence T consisting of two parts: 1) effective history L1, and 2) valid sequence L2:
Sequence [---------T---------] = [--L1-- -----L2-----]

In the forward pass, the whole sequence is passed into TCN, but only the L2 portion is used for training. This ensures that the training data are also provided with sufficient history. The size of T and L2 can be adjusted via flags seq_len and validseqlen.

  • The choice of dataset to use can be specified via the --dataset flag. For instance, running
python char_cnn_test.py --dataset ptb

would (download if no data found, and) train on the PennTreebank (PTB) dataset.

  • Empirically, we found that Adam works better than SGD on the text8 dataset.