Vocabulary

Gathers and store information of word usage for each user. Primary tokens are lemmas or any other language agnostic canonical tokens. If possible, a usage count will be stored along with each token. This way a unique linguistic profile for every author can be collected.

Dictionaries and Indexes

Multiple dictionaries are used that group relevant information together:

Dictionary	Lookup key	Description
Dictionary	Lemma	Hash table of all lemmas found. Includes usage count and string list of authors who has used the lemma.
Wordbook	Author	Hash table of all authors found. Includes string list of lemmas every author has used along with usage count.

Entities of this data model are described in Google Protobuf file. Java classes are generated using Protobuf's protoc command. Because of circular references, nested model objects have been replaced with string values of the lookup keys from the companion table.

Data Storage

For quick lookup, data is stored in hash tables. Lookup key is chosen to provide fastest lookup as possible. For storage size, data is normalized as much as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary

Dictionaries and Indexes

Data Storage

Clone this wiki locally