diff --git a/.github/workflows/rust.yml b/.github/workflows/rust.yml index b1a8444..ee5f38e 100644 --- a/.github/workflows/rust.yml +++ b/.github/workflows/rust.yml @@ -1,4 +1,4 @@ -name: Rust +name: CI (main) on: push: diff --git a/README.md b/README.md index 333c87c..d3f0010 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,79 @@ -# Search Rust's Crab +![web.png](misc%2Fweb.png) + +# Search-rs +An on-disk Search Engine with boolean and free text queries and spelling correction. + +[![CI (main)](https://github.com/tomfran/search-rs/actions/workflows/rust.yml/badge.svg)](https://github.com/tomfran/search-rs/actions/workflows/rust.yml) + +## Architecture + +Here is an high level overview of the project architecture, you can +find a more detailed presentation in the following [Medium article](https://medium.com/itnext/building-a-search-engine-in-rust-c945b6e638f8). + +### Inverted index + +The backbone of the engine is an inverted index. The main +idea is to have, for each word appearing in the documents, a list +of document IDs. +This allows us to quickly find documents containing a given word. + +More specifically, for each term we save a postings list as follows: +$$\text{n}\\;|\\;(\text{id}_i, f_i, [p_0, \dots, p_m]), \dots$$ + +Where $n$ is the number of documents the term appears in, id is the +doc id, $f$ is the frequency, and $p_j$ are the positions where +the term appears in the document $i$. + + +Delta encoding is used to represent document IDs, as they are strictly increasing, the same goes for the term positions. All those integers are written with [Gamma coding](https://en.wikipedia.org/wiki/Elias_gamma_coding). +Generic integers, such as list lengths are written in [VByte encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html#:~:text=Variable%20byte%20(VB)%20encoding%20uses,gap%20and%20to%200%20otherwise.). + +### Vocabulary + +The vocabulary is written on disk using prefix compression. +The idea is to sort terms and then write them as "matching prefix length", and suffix. + +Here is an example with three words: +$$\text{watermelon}\\;\text{waterfall}\\;\text{waterfront}$$ +$$0\\;\text{watermelon}\\;5\\;\text{fall}\\;6\\;\text{ront}$$ + +Spelling correction is used before answering queries. Given a +word $w$, we use a trigram index to find terms in the vocabulary +which shares at least a trigram with it. +We then select the one with the lowest [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance). + +$$ +\text{lev}(a, b) = \begin{cases} + |a| & \text{if}\\;|b| = 0, \\ + |b| & \text{if}\\;|a| = 0, \\ + 1 + \text{min} \begin{cases} + \text{lev}(\text{tail}(a), b) \\ + \text{lev}(a, \text{tail}(b)) \\ + \text{lev}(\text{tail}(a), \text{tail}(b)) \\ + \end{cases} & \text{otherwise} \\ +\end{cases} +$$ + +### Query processing + +You can query the index with boolean or free test queries. In the first case you can use the usual boolean operators to compose a query, such as: +$$\text{gun}\\;\text{AND}\\;\text{control}$$ + +In the second case, you just enter a phrase and receive a ranked collection of documents matching the query, ordered by [BM25 score](https://en.wikipedia.org/wiki/Okapi_BM25). + +$$\text{BM25}(D, Q) = \sum_{i = 1}^{n} \\; \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \Big (1 - b + b \cdot \frac{|D|}{\text{avgdl}} \Big )}$$ + +$$\text{IDF}(q_i) = \ln \Bigg ( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1 \Bigg )$$ + +A window score is also computed, as the cardinality of +the user query, divided by the minimum size windows where all query terms appears in a document, or plus infinity if they don't appear all togheter. + +$$\text{window}(D, Q) = \frac{|Q|}{\text{min. window}(Q, D)}$$ + +Finally they are combined with the following formula: + +$$\text{score}(D, Q) = \alpha \cdot \text{window}(D, Q) + \beta \cdot \text{BM25}(D, Q)$$ -Search engine written in Rust, based on an inverted index on disk. ## Commands @@ -38,12 +111,10 @@ make web folder=path/to/folder You can then visit `http://0.0.0.0:3000` to find a web interface to enter free text and boolean queries. -![web.png](misc%2Fweb.png) **Query Syntax** -You can perform Google-like free test queries, results will -be ranked via [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) scoring. +You can perform Google-like free test queries. You can also specify boolean queries with `"b: "` prefix such as: ``` diff --git a/misc/web.png b/misc/web.png index cfaa888..1eb1796 100644 Binary files a/misc/web.png and b/misc/web.png differ