Skip to content

Ngram Dataset

Martin Trenkmann edited this page Aug 13, 2023 · 12 revisions

The dataset backing NGRAMS is the Google Books Ngram Dataset v3 which is the largest publicly available source of ngram data. It contains word ngrams of length 1 to 5 extracted from books digitized by Google up to and including the year 2019. The dataset was released in February 2020.

At the moment NGRAMS indexes the English, German, and Russian corpora. Contact us if you need support for other languages.

Data Model

Raw Data

The data model of the raw data is displayed in the diagram below. Basically each corpus is a set of ngrams. An ngram is a sequence of tokens. Each ngram has associated statistical data.

classDiagram
direction LR

Corpus "1" -- "1..*" Ngram : has
Ngram "1" -- "1..*" Stat : has

class Ngram {
    tokens : string[]
}
class Stat {
    year : int
    matchCount : int
    volumeCount : int
}
Loading

NGRAMS

Based on the model above, NGRAMS employs a more advanced model which is displayed in the diagram below. The types are also used in our REST API.

classDiagram
direction LR

Ngram --|> NgramLite : extends
Corpus "1" -- "1..*" Ngram : has
Corpus "1" -- "1" CorpusInfo : has

class CorpusInfo {
    name : string
    label : string
    stats : CorpusStat[]
}
class CorpusStat {
    numNgrams : int64
    minYear : int
    maxYear : int
    minMatchCount : int64
    maxMatchCount : int64
    minTotalMatchCount : int64
    maxTotalMatchCount : int64
}
class Ngram {
    stats : NgramStat[]
}
class NgramLite {
    id : string
    abstract : bool
    absTotalMatchCount : int64
    relTotalMatchCount : double
    tokens : NgramToken[]
}
class NgramStat {
    year : int
    absMatchCount : int
    relMatchCount : double
}
class NgramToken {
    text : string
    type : NgramTokenType
    inserted : bool
    completed : bool
}
Loading

NgramLite

raw.property refers to a property in the raw data model.

  • NgramLite.id is an ID generated by NGRAMS.
  • NgramLite.abstract is a flag marking an ngram as abstract. An abstract ngram is an ngram that has been derived from other ngrams applying a filter operation such as case-folding or collapsing. An abstract ngram has no one-to-one correspondence to any ngram from the raw dataset and hence has no associated statistical data.
  • NgramLite.absTotalMatchCount is the sum of all Ngram.stats[i].absMatchCount values.
  • NgramLite.relTotalMatchCount = Ngram.stats[i].absMatchCount / totalMatchCountAllYears(corpus, n) where totalMatchCountAllYears(corpus, n) returns data from total_counts files, e.g.
  • NgramLite.tokens[i].text = raw.Ngram.tokens[i]
  • NgramLite.tokens[i].type is the token's type such as TEXT or TAGGED_NOUN.
  • NgramLite.tokens[i].inserted is a flag marking the token as inserted after application of a wildcard operator. This property is dynamically computed at runtime while processing a user query.
  • NgramLite.tokens[i].completed is a flag marking the token as completed after application of the completion operator. This property is dynamically computed at runtime while processing a user query.

Ngram

  • Ngram.stats[i].year = raw.Stat[i].year
  • Ngram.stats[i].absMatchCount = raw.Stat[i].matchCount
  • Ngram.stats[i].relMatchCount = raw.Stat[i].matchCount / totalMatchCount(corpus, n, year) where totalMatchCount(corpus, n, year) returns data from total_counts files, e.g.

CorpusInfo

  • CorpusInfo.name is the name of a corpus such as "English".
  • CorpusInfo.label is the short name of a corpus such as "eng".
  • CorpusInfo.stats is statistical data derived from the set of indexed ngrams.

CorpusStat

  • CorpusStat.numNgrams is the number of indexed ngrams.
  • CorpusStat.minYear is the minimum of all Ngram.stats[i].year values.
  • CorpusStat.maxYear is the maximum of all Ngram.stats[i].year values.
  • CorpusStat.minMatchCount is the minimum of all Ngram.stats[i].absMatchCount values.
  • CorpusStat.maxMatchCount is the maximum of all Ngram.stats[i].absMatchCount values.
  • CorpusStat.minTotalMatchCount is the minimum of all NgramLite.absTotalMatchCount values.
  • CorpusStat.maxTotalMatchCount is the maximum of all NgramLite.absTotalMatchCount values.

Ngram Types

There are three types of ngrams in the raw dataset.

  1. With terms only, e.g. the quick brown fox
  2. With part-of-speech tagged terms, e.g. the quick brown fox_NOUN
  3. With standalone part-of-speech tags, e.g. the quick brown _NOUN_

Ngrams of type 2 can have multiple tagged terms, but because of the combinatorial explosion Google did not tag 4- and 5-grams this way. So in fact, the 4-gram the quick brown fox_NOUN does not exist in the dataset, but the 3-gram quick brown fox_NOUN does.

NGRAMS Index

NGRAMS has its own custom-made NoSQL system tailored for indexing and storing ngram data. Due to the static nature of the data, things have been heavily optimized for rapid read-only access.

The index contains ngrams of type 1 and 2, see Ngram Types, with complete statistical data as shown in Data Model. It does not contain ngrams of type 3 because the goal of NGRAMS' query language is to replace wildcards with actual words and not standalone tags.

The following table gives and overview of the number of ngrams that have been indexed.

Corpus #1grams #2grams #3grams #4grams #5grams total
English 76.9 M 1.6 B 11.8 B 5.1 B 5.0 B 23.6 B
German 38.8 M 686.9 M 2.8 B 699.1 M 409.2 M 4.7 B
Russian 12.8 M 313.0 M 973.9 M 181.0 M 97.5 M 1.6 B