Updated readme

tomfran · Feb 3, 2024 · 2ff8d47 · 2ff8d47
1 parent 29cd732
commit 2ff8d47
Show file tree

Hide file tree

Showing 3 changed files with 77 additions and 6 deletions.
diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml
@@ -1,4 +1,4 @@
-name: Rust
+name: CI (main)
 
 on:
   push:

diff --git a/README.md b/README.md
@@ -1,6 +1,79 @@
-# Search <img alt="Rust's Crab" width="25px" src="https://rustacean.net/assets/rustacean-flat-noshadow.png"/>
+![web.png](misc%2Fweb.png)
+
+# Search-rs
+An on-disk Search Engine with boolean and free text queries and spelling correction.
+
+[![CI (main)](https://github.com/tomfran/search-rs/actions/workflows/rust.yml/badge.svg)](https://github.com/tomfran/search-rs/actions/workflows/rust.yml)
+
+## Architecture
+
+Here is an high level overview of the project architecture, you can 
+find a more detailed presentation in the following [Medium article](https://medium.com/itnext/building-a-search-engine-in-rust-c945b6e638f8).
+
+### Inverted index
+
+The backbone of the engine is an inverted index. The main 
+idea is to have, for each word appearing in the documents, a list
+of document IDs. 
+This allows us to quickly find documents containing a given word.
+
+More specifically, for each term we save a postings list as follows: 
+$$\text{n}\\;|\\;(\text{id}_i, f_i, [p_0, \dots, p_m]), \dots$$
+
+Where $n$ is the number of documents the term appears in, id is the 
+doc id, $f$ is the frequency, and $p_j$ are the positions where 
+the term appears in the document $i$.
+
+
+Delta encoding is used to represent document IDs, as they are strictly increasing, the same goes for the term positions. All those integers are written with [Gamma coding](https://en.wikipedia.org/wiki/Elias_gamma_coding). 
+Generic integers, such as list lengths are written in [VByte encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html#:~:text=Variable%20byte%20(VB)%20encoding%20uses,gap%20and%20to%200%20otherwise.).
+
+### Vocabulary
+
+The vocabulary is written on disk using prefix compression. 
+The idea is to sort terms and then write them as "matching prefix length", and suffix.
+
+Here is an example with three words: 
+$$\text{watermelon}\\;\text{waterfall}\\;\text{waterfront}$$
+$$0\\;\text{watermelon}\\;5\\;\text{fall}\\;6\\;\text{ront}$$
+
+Spelling correction is used before answering queries. Given a 
+word $w$, we use a trigram index to find terms in the vocabulary 
+which shares at least a trigram with it. 
+We then select the one with the lowest [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance). 
+
+$$
+\text{lev}(a, b) = \begin{cases}
+    |a| & \text{if}\\;|b| = 0, \\
+    |b| & \text{if}\\;|a| = 0, \\
+    1 + \text{min} \begin{cases}
+        \text{lev}(\text{tail}(a), b) \\
+        \text{lev}(a, \text{tail}(b)) \\
+        \text{lev}(\text{tail}(a), \text{tail}(b)) \\
+    \end{cases} & \text{otherwise} \\
+\end{cases}
+$$
+
+### Query processing
+
+You can query the index with boolean or free test queries. In the first case you can use the usual boolean operators to compose a query, such as: 
+$$\text{gun}\\;\text{AND}\\;\text{control}$$
+
+In the second case, you just enter a phrase and receive a ranked collection of documents matching the query, ordered by [BM25 score](https://en.wikipedia.org/wiki/Okapi_BM25). 
+
+$$\text{BM25}(D, Q) = \sum_{i = 1}^{n} \\; \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \Big (1 - b + b \cdot \frac{|D|}{\text{avgdl}} \Big )}$$
+
+$$\text{IDF}(q_i) = \ln \Bigg ( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1 \Bigg )$$
+
+A window score is also computed, as the cardinality of 
+the user query, divided by the minimum size windows where all query terms appears in a document, or plus infinity if they don't appear all togheter.
+
+$$\text{window}(D, Q) = \frac{|Q|}{\text{min. window}(Q, D)}$$
+
+Finally they are combined with the following formula: 
+
+$$\text{score}(D, Q) = \alpha \cdot \text{window}(D, Q) + \beta \cdot \text{BM25}(D, Q)$$
 
-Search engine written in Rust, based on an inverted index on disk.
 
 ## Commands
 
@@ -38,12 +111,10 @@ make web folder=path/to/folder
 
 You can then visit `http://0.0.0.0:3000` to find a web interface to enter free text and boolean queries.
 
-![web.png](misc%2Fweb.png)
 
 **Query Syntax**
 
-You can perform Google-like free test queries, results will 
-be ranked via [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) scoring.
+You can perform Google-like free test queries.
 
 You can also specify boolean queries with `"b: "` prefix such as: 
 ```

diff --git a/misc/web.png b/misc/web.png