-
Notifications
You must be signed in to change notification settings - Fork 0
Ngram Dataset
The dataset backing NGRAMS is the Google Books Ngram Dataset v3 which is the largest publicly available source of ngram data. It contains word ngrams of length 1 to 5 extracted from books digitized by Google up to and including the year 2019. The dataset was released in February 2020.
At the moment NGRAMS indexes the English
, German
, and Russian
corpora. Contact us if you need support for other languages.
The data model of the raw data is displayed in the diagram below. Basically each corpus is a set of ngrams. An ngram is a sequence of tokens. Each ngram has associated statistical data.
classDiagram
direction LR
Corpus "1" -- "1..*" Ngram : has
Ngram "1" -- "1..*" Stat : has
class Ngram {
tokens : string[]
}
class Stat {
year : int
matchCount : int
volumeCount : int
}
Based on the model above, NGRAMS employs a more advanced model which is displayed in the diagram below. The types are also used in our REST API.
classDiagram
direction LR
Ngram --|> NgramLite : extends
Corpus "1" -- "1..*" Ngram : has
Corpus "1" -- "1" CorpusInfo : has
class CorpusInfo {
name : string
label : string
stats : CorpusStat[]
}
class CorpusStat {
numNgrams : int64
minYear : int
maxYear : int
minMatchCount : int64
maxMatchCount : int64
minTotalMatchCount : int64
maxTotalMatchCount : int64
}
class Ngram {
stats : NgramStat[]
}
class NgramLite {
id : string
abstract : bool
absTotalMatchCount : int64
relTotalMatchCount : double
tokens : NgramToken[]
}
class NgramStat {
year : int
absMatchCount : int
relMatchCount : double
}
class NgramToken {
text : string
type : NgramTokenType
inserted : bool
completed : bool
}
raw.property
refers to a property in the raw data model.
-
NgramLite.id
is an ID generated by NGRAMS. -
NgramLite.abstract
is a flag marking an ngram as abstract. An abstract ngram is an ngram that has been derived from other ngrams applying a filter operation such as case-folding or collapsing. An abstract ngram has no one-to-one correspondence to any ngram from the raw dataset and hence has no associated statistical data. -
NgramLite.absTotalMatchCount
is the sum of allNgram.stats[i].absMatchCount
values. -
NgramLite.relTotalMatchCount = Ngram.stats[i].absMatchCount / totalMatchCountAllYears(corpus, n)
wheretotalMatchCountAllYears(corpus, n)
returns data fromtotal_counts
files, e.g.-
totalMatchCountAllYears(eng, 1)
returns data from eng/totalcounts-1 -
totalMatchCountAllYears(eng, 2)
returns data from eng/totalcounts-2 - and so on
-
NgramLite.tokens[i].text = raw.Ngram.tokens[i]
-
NgramLite.tokens[i].type
is the token's type such asTEXT
orTAGGED_NOUN
. -
NgramLite.tokens[i].inserted
is a flag marking the token as inserted after application of a wildcard operator. This property is dynamically computed at runtime while processing a user query. -
NgramLite.tokens[i].completed
is a flag marking the token as completed after application of the completion operator. This property is dynamically computed at runtime while processing a user query.
Ngram.stats[i].year = raw.Stat[i].year
Ngram.stats[i].absMatchCount = raw.Stat[i].matchCount
-
Ngram.stats[i].relMatchCount = raw.Stat[i].matchCount / totalMatchCount(corpus, n, year)
wheretotalMatchCount(corpus, n, year)
returns data fromtotal_counts
files, e.g.-
totalMatchCount(eng, 1, year)
returns data from eng/totalcounts-1 -
totalMatchCount(eng, 2, year)
returns data from eng/totalcounts-2 - and so on
-
-
CorpusInfo.name
is the name of a corpus such as "English". -
CorpusInfo.label
is the short name of a corpus such as "eng". -
CorpusInfo.stats
is statistical data derived from the set of indexed ngrams.
-
CorpusStat.numNgrams
is the number of indexed ngrams. -
CorpusStat.minYear
is the minimum of allNgram.stats[i].year
values. -
CorpusStat.maxYear
is the maximum of allNgram.stats[i].year
values. -
CorpusStat.minMatchCount
is the minimum of allNgram.stats[i].absMatchCount
values. -
CorpusStat.maxMatchCount
is the maximum of allNgram.stats[i].absMatchCount
values. -
CorpusStat.minTotalMatchCount
is the minimum of allNgramLite.absTotalMatchCount
values. -
CorpusStat.maxTotalMatchCount
is the maximum of allNgramLite.absTotalMatchCount
values.
There are three types of ngrams in the raw dataset.
- With terms only, e.g.
the quick brown fox
- With part-of-speech tagged terms, e.g.
the quick brown fox_NOUN
- With standalone part-of-speech tags, e.g.
the quick brown _NOUN_
Ngrams of type 2 can have multiple tagged terms, but because of the combinatorial explosion Google did not tag 4- and 5-grams this way. So in fact, the 4-gram the quick brown fox_NOUN
does not exist in the dataset, but the 3-gram quick brown fox_NOUN
does.
NGRAMS has its own custom-made NoSQL system tailored for indexing and storing ngram data. Due to the static nature of the data, things have been heavily optimized for rapid read-only access.
The index contains ngrams of type 1 and 2, see Ngram Types, with complete statistical data as shown in Data Model. It does not contain ngrams of type 3 because the goal of NGRAMS' query language is to replace wildcards with actual words and not standalone tags.
The following table gives and overview of the number of ngrams that have been indexed.
Corpus | #1grams | #2grams | #3grams | #4grams | #5grams | total |
---|---|---|---|---|---|---|
English | 76.9 M | 1.6 B | 11.8 B | 5.1 B | 5.0 B | 23.6 B |
German | 38.8 M | 686.9 M | 2.8 B | 699.1 M | 409.2 M | 4.7 B |
Russian | 12.8 M | 313.0 M | 973.9 M | 181.0 M | 97.5 M | 1.6 B |