char_cnn

Adding tarfile member sanitization to extractall()

Nov 4, 2022

0705580 · Nov 4, 2022

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	finish ptb experiment code	Jun 5, 2018
__init__.py	__init__.py	finish ptb experiment code	Jun 5, 2018
char_cnn_test.py	char_cnn_test.py	update default params	Jul 11, 2018
model.py	model.py	finish ptb experiment code	Jun 5, 2018
observations.py	observations.py	Adding tarfile member sanitization to extractall()	Nov 4, 2022
ptb.py	ptb.py	finish ptb experiment code	Jun 5, 2018
utils.py	utils.py	clean code	Jun 6, 2018

README.md

Character-level Language Modeling

Overview

In character-level language modeling tasks, each sequence is broken into elements by characters. Therefore, in a character-level language model, at each time step the model is expected to predict the next coming character. We evaluate the temporal convolutional network as a character-level language model on the PennTreebank dataset and the text8 dataset.

Data

PennTreebank: When used as a character-level lan- guage corpus, PTB contains 5,059K characters for training, 396K for validation, and 446K for testing, with an alphabet size of 50. PennTreebank is a well-studied (but relatively small) language dataset.
text8: text8 is about 20 times larger than PTB, with about 100M characters from Wikipedia (90M for training, 5M for validation, and 5M for testing). The corpus contains 27 unique alphabets.

See data_generator in utils.py. We download the language corpus using observations package in python.

Note

Just like in a recurrent network implementation where it is common to repackage hidden units when a new sequence begins, we pass into TCN a sequence T consisting of two parts: 1) effective history L1, and 2) valid sequence L2:

Sequence [---------T---------] = [--L1-- -----L2-----]

In the forward pass, the whole sequence is passed into TCN, but only the L2 portion is used for training. This ensures that the training data are also provided with sufficient history. The size of T and L2 can be adjusted via flags seq_len and validseqlen.

The choice of dataset to use can be specified via the --dataset flag. For instance, running

python char_cnn_test.py --dataset ptb

would (download if no data found, and) train on the PennTreebank (PTB) dataset.

Empirically, we found that Adam works better than SGD on the text8 dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

char_cnn

char_cnn

README.md

Character-level Language Modeling

Overview

Data

Note

Files

char_cnn

Directory actions

More options

Directory actions

More options

Latest commit

History

char_cnn

Folders and files

parent directory

README.md

Character-level Language Modeling

Overview

Data

Note