ngramModelTrainer

Learns an n-gram language model given a corpus. The corpus should be text file, with a single word per line, containing no inter-word spaces.

The learned quantities are:

Probabilities of unigrams, p( g_i )
Probabilities of bigrams, p( g_i | g_i-1 )
Probabilities of trigrams, p( g_i | g_i-1, g_i-2 )

Testing and running

Test the script by running with no argument:

python3 ngramModelTrainer

Use the -h flag for details on how to use the tool with proper input:

python3 ngramModelTrainer -h

There are a few example inputs on fixtures/.

The output is saved as four MATLAB matrices.

unigrams: u(i) stands for p(i).
bigrams: b(i, j) stands for p(j | i).
trigrams: t(i, j, k) stands for p(k | j, i).
quadgrams (tetragrams): q(i, j, k, l) stands for p(l | k, j, i).

Alphabet

An alphabet of specific acceptable unigrams is required to be defined. By default, we are using an alphabet of 36 possible letters/digits. These are held in a python list called 'alphabet', in the following order:

Positions 0-25: Latin lowercase alphabet letters, in standard alphabetical order.
Positions 26-35: Digits 0-9.

'Alternative' alphabets

Non-'standard' versions of the above alphabet may be used. These include: dutta_extended: a number of extra characters (these are notably encodings of the characters and punctuation found in the George Washington handwritten document set). sophia: polytonic greek characters. dummy: a limited testing set of 3 characters

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
fixtures		fixtures
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
ngramModelTrainer.py		ngramModelTrainer.py
tokenize_corpus.py		tokenize_corpus.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ngramModelTrainer

Testing and running

Alphabet

'Alternative' alphabets

About

Releases

Packages

Languages

License

sfikas/ngramModelTrainer

Folders and files

Latest commit

History

Repository files navigation

ngramModelTrainer

Testing and running

Alphabet

'Alternative' alphabets

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages