Skip to content

Language modeling based on ngrams models and smoothing techniques

Notifications You must be signed in to change notification settings

mmz33/N-Gram-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

N-Gram-Language-Model

Includes:

  • Index words
  • Store ngrams in a Trie data structure
  • Efficiently extract ngrams and their frequencies
  • Compute out-of-vocabulary (OOV) rate
  • Compute ngram probabilities with absolute discounting with interpolation smoothing.
  • Compute Perplexity

Introduction

A statistical language model is the development of probabilistic models to predict the probability of a sequence of words. It is able to predict the next word in a sequence given a history context represented by the preceding words.

The probability that we want to model can be factorized using the chain rule as follows:

where equation is a special token to denote the start of the sentence.

In practice, we usually use what is called N-Gram models that use Markov process assumption to limit the history context. Examples of N-Grams are:

Training

Using Maximum Likelihood criteria, these probabilities can be estimated using counts. For example, for the bigram model,

equation

equation

However, this can be problamatic if we have unseen data because the counts will be 0 and thus the probability is undefined. To solve this problem, we use smoothing techniques. There are different smoothing techniques and the one that we used is called absolute discounting with interpolation.

Perplexity

To meausre the performance of a language model, we compute the perplexity of the test corpus using trained m-Grams:

Results

Model was tested on europarl dataset (dir data):

Test PP with bigrams = 130.09

Test PP with trigrams = 94.82

About

Language modeling based on ngrams models and smoothing techniques

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages