Skip to content

Commit

Permalink
Updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
tomfran committed Feb 3, 2024
1 parent 24bdbc1 commit d6cb144
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ of document IDs.
This allows us to quickly find documents containing a given word.

More specifically, for each term we save a postings list as follows:
$$\text{n}\;|\;(\text{id}_i, f_i, [p_0, \dots, p_m]), \dots$$
$$\text{n}\\;|\\;(\text{id}_i, f_i, [p_0, \dots, p_m]), \dots$$

Where $n$ is the number of documents the term appears in, id is the
doc id, $f$ is the frequency, and $p_j$ are the positions where
Expand All @@ -34,8 +34,8 @@ The vocabulary is written on disk using prefix compression.
The idea is to sort terms and then write them as "matching prefix length", and suffix.

Here is an example with three words:
$$\text{watermelon}\;\text{waterfall}\;\text{waterfront}$$
$$0\;\text{watermelon}\;5\;\text{fall}\;6\;\text{ront}$$
$$\text{watermelon}\\;\text{waterfall}\\;\text{waterfront}$$
$$0\\;\text{watermelon}\\;5\\;\text{fall}\\;6\\;\text{ront}$$

Spelling correction is used before answering queries. Given a
word $w$, we use a trigram index to find terms in the vocabulary
Expand All @@ -44,8 +44,8 @@ We then select the one with the lowest [Levenshtein Distance](https://en.wikiped

$$
\text{lev}(a, b) = \begin{cases}
|a| & \text{if}\;|b| = 0, \\
|b| & \text{if}\;|a| = 0, \\
|a| & \text{if}\\;|b| = 0, \\
|b| & \text{if}\\;|a| = 0, \\
1 + \text{min} \begin{cases}
\text{lev}(\text{tail}(a), b) \\
\text{lev}(a, \text{tail}(b)) \\
Expand All @@ -57,11 +57,11 @@ $$
### Query processing

You can query the index with boolean or free test queries. In the first case you can use the usual boolean operators to compose a query, such as:
$$\text{gun}\;\text{AND}\;\text{control}$$
$$\text{gun}\\;\text{AND}\\;\text{control}$$

In the second case, you just enter a phrase and receive a ranked collection of documents matching the query, ordered by [BM25 score](https://en.wikipedia.org/wiki/Okapi_BM25).

$$\text{BM25}(D, Q) = \sum_{i = 1}^{n} \; \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \Big (1 - b + b \cdot \frac{|D|}{\text{avgdl}} \Big )}$$
$$\text{BM25}(D, Q) = \sum_{i = 1}^{n} \\; \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \Big (1 - b + b \cdot \frac{|D|}{\text{avgdl}} \Big )}$$

$$\text{IDF}(q_i) = \ln \Bigg ( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1 \Bigg )$$

Expand Down
Binary file modified misc/web.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d6cb144

Please sign in to comment.