From 9ffb0a4ecd65464bbfd1a31e353f022d0563580b Mon Sep 17 00:00:00 2001 From: Teo Orthlieb Date: Thu, 15 Sep 2022 14:37:27 +0200 Subject: [PATCH] Update README.md --- README.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 750654b..68899d3 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,26 @@ # Fast-BM25 -a fast implementation of BM25 +A fast implementation of [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) in Python. +BM25 is a simple and fast ranking function for search engines operating on words (tokens). +It does not play well with misspelling so use it only in contexts where that's not a problem. + +The base BM25 implementation is from [dorianbrown/rank_bm25](https://github.com/dorianbrown/rank_bm25/blob/master/rank_bm25.py). + +## How to use +Initialize BM25 by passing it a corpus, aka an iterator over tokenized documents (a list of Strings). +```py +from fast_bm25 import BM25 + +# Load your corpus +corpus = ... + +bm25 = new BM25(corpus) +results = bm25.get_top_n(["largest", "city", "in", "Japan"], corpus); +``` +*It's not a python package, copy the file if you want to use it* + +## Principle +In a text corpus, the most common words (the, a, an, ...) are often the least informative. +By cutting them off from the query and only searching documents containing at least a word of the query, +BM25 gain a lot of speed while loosing very little precision. +This trade-off is controlled by the parameter `alpha`: higher alpha => more speed and more word cut-off. +At $\alpha = -\inf$ the algorithm is equivalent to regular BM25.