Updated readme

tomfran · Feb 3, 2024 · 1411998 · 1411998
1 parent 29cd732
commit 1411998
Show file tree

Hide file tree

Showing 7 changed files with 96 additions and 10 deletions.
diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml
@@ -1,4 +1,4 @@
-name: Rust
+name: CI (main)
 
 on:
   push:

diff --git a/README.md b/README.md
@@ -1,6 +1,94 @@
-# Search <img alt="Rust's Crab" width="25px" src="https://rustacean.net/assets/rustacean-flat-noshadow.png"/>
+![web-l.png](misc/web-l.png#gh-light-mode-only)
+![web-d.png](misc/web-d.png#gh-dark-mode-only)
+
+# Search-rs
+An on-disk Search Engine with boolean and free text queries and spelling correction.
+
+[![CI (main)](https://github.com/tomfran/search-rs/actions/workflows/rust.yml/badge.svg)](https://github.com/tomfran/search-rs/actions/workflows/rust.yml)
+
+**Table of contents**
+- [Architecture](#architecture)
+  - [Inverted index](#inverted-index)
+  - [Vocabulary and Documents](#vocabulary-and-documents)
+  - [Query processing](#query-processing)
+- [Commands](#commands)
+- [References](#references)
+
+
+## Architecture
+
+Here is an high level overview of the project architecture, you can 
+find a more detailed presentation in the following [Medium article](https://medium.com/itnext/building-a-search-engine-in-rust-c945b6e638f8).
+
+### Inverted index
+
+The backbone of the engine is an inverted index. The main 
+idea is to have, for each word appearing in the documents, a list
+of document IDs. 
+This allows us to quickly find documents containing a given word.
+
+More specifically, for each term we save a postings list as follows: 
+$$\text{n}\\;|\\;(\text{id}_i, f_i, [p_0, \dots, p_m]), \dots$$
+
+Where $n$ is the number of documents the term appears in, id is the 
+doc id, $f$ is the frequency, and $p_j$ are the positions where 
+the term appears in the document $i$.
+
+We also store offsets for each term, allowing us to jump to the beginning of the postings list for a given term. They are stored in a separate file.
+$$\text{n}\\;|\\;o_0, \dots, o_n$$
+
+Delta encoding is used to represent document IDs, as they are strictly increasing, the same goes for the term positions and offsets. All those integers are written with [Gamma coding](https://en.wikipedia.org/wiki/Elias_gamma_coding). 
+Generic integers, such as list lengths are written in [VByte encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html#:~:text=Variable%20byte%20(VB)%20encoding%20uses,gap%20and%20to%200%20otherwise.).
+
+### Vocabulary and Documents
+
+The vocabulary is written on disk using prefix compression. 
+The idea is to sort terms and then write them as "matching prefix length", and suffix.
+
+Here is an example with three words: 
+$$\text{watermelon}\\;\text{waterfall}\\;\text{waterfront}$$
+$$0\\;\text{watermelon}\\;5\\;\text{fall}\\;6\\;\text{ront}$$
+
+Spelling correction is used before answering queries. Given a 
+word $w$, we use a trigram index to find terms in the vocabulary 
+which shares at least a trigram with it. 
+We then select the one with the lowest [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) and max frequency. 
+
+$$
+\text{lev}(a, b) = \begin{cases}
+    |a| & \text{if}\\;|b| = 0, \\
+    |b| & \text{if}\\;|a| = 0, \\
+    1 + \text{min} \begin{cases}
+        \text{lev}(\text{tail}(a), b) \\
+        \text{lev}(a, \text{tail}(b)) \\
+        \text{lev}(\text{tail}(a), \text{tail}(b)) \\
+    \end{cases} & \text{otherwise} \\
+\end{cases}
+$$
+
+Finally, document paths and lenghts are stored with a similar format.
+$$\text{n}\\;|\\;p_0, l_0, \dots, p_n, l_n$$
+
+### Query processing
+
+You can query the index with boolean or free test queries. In the first case you can use the usual boolean operators to compose a query, such as: 
+$$\text{gun}\\;\text{AND}\\;\text{control}$$
+
+In the second case, you just enter a phrase and receive a ranked collection of documents matching the query, ordered by [BM25 score](https://en.wikipedia.org/wiki/Okapi_BM25). 
+
+$$\text{BM25}(D, Q) = \sum_{i = 1}^{n} \\; \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \Big (1 - b + b \cdot \frac{|D|}{\text{avgdl}} \Big )}$$
+
+$$\text{IDF}(q_i) = \ln \Bigg ( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1 \Bigg )$$
+
+A window score is also computed, as the cardinality of 
+the user query, divided by the minimum size windows where all query terms appears in a document, or plus infinity if they don't appear all togheter.
+
+$$\text{window}(D, Q) = \frac{|Q|}{\text{min. window}(Q, D)}$$
+
+Finally they are combined with the following formula: 
+
+$$\text{score}(D, Q) = \alpha \cdot \text{window}(D, Q) + \beta \cdot \text{BM25}(D, Q)$$
 
-Search engine written in Rust, based on an inverted index on disk.
 
 ## Commands
 
@@ -38,12 +126,10 @@ make web folder=path/to/folder
 
 You can then visit `http://0.0.0.0:3000` to find a web interface to enter free text and boolean queries.
 
-![web.png](misc%2Fweb.png)
 
 **Query Syntax**
 
-You can perform Google-like free test queries, results will 
-be ranked via [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) scoring.
+You can perform Google-like free test queries.
 
 You can also specify boolean queries with `"b: "` prefix such as: 
 ```

diff --git a/misc/web-d.png b/misc/web-d.png
diff --git a/misc/web-l.png b/misc/web-l.png
diff --git a/misc/web.png b/misc/web.png
diff --git a/server/templates/index.html b/server/templates/index.html
@@ -62,7 +62,7 @@
     <title>search-rs</title>
 </head>
 
-<body class=" dark white dark:bg-neutral-900 text-neutral-900 dark:text-white ">
+<body class=" dark white dark:bg-zinc-900 text-zinc-900 dark:text-white ">
 
     <!-- Main Content -->
     <div class="container mx-auto mt-8 max-w-6xl">
@@ -82,7 +82,7 @@
         <div class="mb-6">
             <h1 class="text-3xl font-medium mb-10">Index on {{index_path}}</h1>
             <input type="text"
-                class="outline-neutral-300 dark:outline-neutral-900 w-full p-4 rounded-md bg-neutral-100 dark:bg-neutral-800"
+                class="outline-zinc-300 dark:outline-zinc-900 w-full p-4 rounded-md bg-zinc-100 dark:bg-zinc-800"
                 placeholder="Enter your search query..." autofocus name="query" hx-post="/query"
                 hx-target=".search-results" hx-trigger="keyup[keyCode==13]">
         </div>
@@ -94,7 +94,7 @@ <h1 class="text-3xl font-medium mb-10">Index on {{index_path}}</h1>
     </div>
 
     <div id="to-top"
-        class="hidden fixed flex bottom-16 right-16 h-16 w-16 rounded-full justify-center items-center hover:cursor-pointer bg-neutral-100 hover:bg-neutral-200 dark:bg-neutral-800 dark:hover:bg-neutral-700 ">
+        class="hidden fixed flex bottom-16 right-16 h-16 w-16 rounded-full justify-center items-center hover:cursor-pointer bg-zinc-100 hover:bg-zinc-200 dark:bg-zinc-800 dark:hover:bg-zinc-700 ">
         <i class="fas fa-caret-up text-2xl"></i>
     </div>
 

diff --git a/server/templates/query.html b/server/templates/query.html
@@ -24,7 +24,7 @@ <h1 class=" font-light text-md mb-6">
 
 
     <div id="{{doc.path}}"
-        class="toggle-container hover:cursor-pointer bg-neutral-100 dark:bg-neutral-800 hover:bg-neutral-200 hover:dark:bg-neutral-700 p-6 rounded-md mb-6">
+        class="toggle-container hover:cursor-pointer bg-zinc-100 dark:bg-zinc-800 hover:bg-zinc-200 hover:dark:bg-zinc-700 p-6 rounded-md mb-6">
         <div id="{{doc.path}}_closed">
             <h2 class="text-xl font-semibold mb-4">
                 {{ doc.path }}