Skip to content

Vocabulary

The Ranger edited this page Feb 11, 2017 · 1 revision

Gathers and store information of word usage for each user. Primary tokens are lemmas or any other language agnostic canonical tokens. If possible, a usage count will be stored along with each token. This way a unique linguistic profile for every author can be collected.

Dictionaries and Indexes

Multiple dictionaries are used that group relevant information together:

Dictionary Lookup key Description
Dictionary Lemma Hash table of all lemmas found. Includes usage count and string list of authors who has used the lemma.
Wordbook Author Hash table of all authors found. Includes string list of lemmas every author has used along with usage count.

Entities of this data model are described in Google Protobuf file. Java classes are generated using Protobuf's protoc command. Because of circular references, nested model objects have been replaced with string values of the lookup keys from the companion table.

Data Storage

For quick lookup, data is stored in hash tables. Lookup key is chosen to provide fastest lookup as possible. For storage size, data is normalized as much as possible.

Clone this wiki locally