In character-level language modeling tasks, each sequence is broken into elements by characters. Therefore, in a character-level language model, at each time step the model is expected to predict the next coming character. We evaluate the temporal convolutional network as a character-level language model on the PennTreebank dataset and the text8 dataset.
-
PennTreebank: When used as a character-level lan- guage corpus, PTB contains 5,059K characters for training, 396K for validation, and 446K for testing, with an alphabet size of 50. PennTreebank is a well-studied (but relatively small) language dataset.
-
text8: text8 is about 20 times larger than PTB, with about 100M characters from Wikipedia (90M for training, 5M for validation, and 5M for testing). The corpus contains 27 unique alphabets.
See data_generator
in utils.py
. We download the language corpus using observations package
in python.
- Just like in a recurrent network implementation where it is common to repackage
hidden units when a new sequence begins, we pass into TCN a sequence
T
consisting of two parts: 1) effective historyL1
, and 2) valid sequenceL2
:
Sequence [---------T---------] = [--L1-- -----L2-----]
In the forward pass, the whole sequence is passed into TCN, but only the L2
portion is used for
training. This ensures that the training data are also provided with sufficient history. The size
of T
and L2
can be adjusted via flags seq_len
and validseqlen
.
- The choice of dataset to use can be specified via the
--dataset
flag. For instance, running
python char_cnn_test.py --dataset ptb
would (download if no data found, and) train on the PennTreebank (PTB) dataset.
- Empirically, we found that Adam works better than SGD on the text8 dataset.