Skip to content

Data structures: Keyword index

Jack Dodds edited this page Sep 25, 2019 · 6 revisions

This page is based on reading code from commit 8511284 dated 2018-02-23 and files generated by it. There may errors or omissions!

Structure

The keyword index is structured as a "posting list". A posting list is a mapping of keywords to messages that contain them.

Each keyword is represented by a keyword hash, a string of 24 characters from the set(['_', '+', '0'..'9', 'a'..'z']).

Each message is represented by a Message Index ID (MSG_MID) from the Metadata Index. It is Base36 - variable length - no leading zeros.

External format

The main index is stored in the subdirectory search. An additional file kw-journal.dat is used to store the "global posting list" which contains only recent additions to the index.

Subdirectory search in turn has subdirectories with one character names from the set(['_', '+', '0'..'9', 'a'..'z']). Each subdirectory contains zero or more files. The file names are strings of one or more (typically two) characters from the same set. Each file name starts with the name of the subdirectory that contains it. File sizes are typically < 100 kB. (See configuration parameter sys.postinglist_kb.)

Examples:
~/.local/share/Mailpile/default/search/m/m
~/.local/share/Mailpile/default/search/g/g_
~/.local/share/Mailpile/default/search/r/rdb
~/.local/share/Mailpile/default/search/g/g6spxziaxht1brplfihefyyo

Each file consists of lines terminated by <LF> = 0Ah. Each line consists of two or more fields delimited by tabs <HT> = 09h.

The fields are:

  1. Keyword hash. All the keyword hash values in a file start with the file name of the keyword index file that contains it (except in the case of kw-journal.dat).
    The lines in a file are not sorted, and more than one line may have the same keyword hash.
    When lines have same keyword hash, the Message Index ID lists from all such lines are combined.
    The specified character set is safe for case-insensitive file names used e.g. in Windows.
  2. Message Index ID. Base36 - variable length - no leading zeros.
  3. (and subsequent fields, if any) Same format as field 2.
    The fields after field 1 comprise a list of emails that contain the keyword. These are not sorted.

In one Mailpile instance (which may not be typical) in which about 50,000 emails were indexed, the kw-journal.dat file contained about 5,000 lines totaling about 200 kB; the search subdirectory contained about 2,000 files totaling about 100 MB.

Internal format

The global posting list is stored internally in mailpile.postinglist.GLOBAL_GPL as a dictionary. Dictionary keys are keyword hashes, values are sets of Message Index IDs. A few example entries displayed by adding print GLOBAL_GPL to the code:

{ u's59n1weaos4rz0lzms+yj25m': set([u'12']),
u'p7mble1uaouodot3oyoejfqt': set([u'C', u'B', u'E', u'7']),
u'h1hexqphvgt4lwzqtgml4yvf': set([u'Q']),
u'nfz4kq09fs0jnnj4gb3ackhm': set([u'1Q', u'1P', u'1S', u'1R', u'1U', u'1T', u'1W', ... ]),
u'pe6sidpfbagfnucp9yxsbuek': set([u'10']),
... }

The global posting list is loaded from kw-journal.dat (after the user has authenticated) by calling mailpile.postinglist.GlobalPostingList() from mailpile.postinglist.search.MailIndex.load().

New messages are added to the global posting list by GlobalPostingList._Append(), which appends lines to kw-journal.dat and then adds their contents to items in mailpile.postinglist.GLOBAL_GPL.

The global posting list is periodically saved to kw-journal.dat by mailpile.postinglist.OldPostingList.save() called from mailpile.postinglist._Optimize()

Parts of the keyword index other than the global posting list are stored internally in the posting list cache mailpile.postinglist.PLC_CACHE as a dictionary. Structurally, each dictionary entry corresponds to a file in search; the keys corresponds to the file name while the values are tuples consisting of a time stamp and a second-level dictionary. The second-level dictionary corresponds structurally to the contents of a file in search and is similar in format to the global posting list dictionary. Example showing both levels of dictionaries:

{'x': [1520274976, {
'xgqxjxggn4c6b0s0efxk8oi3': set([u'Y', u'31', u'3D']]),
'xvecx_f8prs2xzx5nyxjgfkc': set([u'3D']]),
'xo+glm5scnsohyrpzpgpn0sq': set([u'1Q', u'1P', u'1R', ... ]]),
'xreotnzskhx9mbczywafnzuw': set([u'12']]),
... } ],
'r': [1520274885, { ... } ],
'e': [1520274885, { ... } ], ...
'l': [1520274851, { ... } ]
}

The entries in the posting list cache are loaded from external storage, or created, as and when needed.

Entries are merged incrementally into the posting list cache from the global posting list by mailpile.postinglist.GlobalPostingList._migrate() called from ... ._Optimize()

Entries in the posting list cache are flushed progressively to external storage by threads created by mailpile.postinglist.PLC_CACHE_FlushAndClean() called from ... ._Optimize().

Security

To improve speed, the keyword index is unencrypted by default. Only the hashes of keywords appear in the index, providing a degree of security even without encryption. By default, dictionary attacks are hindered by appending a (secret) fixed string to each keyword before hashing. Even with this precaution, relations and patterns of emails could conceivably be determined by analysis of the Message Index IDs that contain the same keyword. If it is desired to prevent even this type of attack, the keyword index can optionally be encrypted. (See configuration parameters prefs.encrypt_index, prefs.obfuscate_index.)

Keyword extraction

Keywords are unicode strings of two types:

  • words copied from the body and headers of a message with some processing (e.g. conversion to lower case), referred to here as literal keywords
  • strings in the form [item]:[item label] where [item label] identifies the type of information encoded in [item] (typically metadata) and ':' prevents collisions with literal keywords; known here as labeled keywords.
Labeled keywords

The following is an incomplete list of keyword labels. A message may have more than one keyword with the same label.

[date]:u - hex UTC date (days since the epoch) - there are keywords for the message time stamp (MSG_DATE in the metadata index) and for the time at which Mailpile processed the message.

Clone this wiki locally