1. Project Introduction

A simple search system that mostly written in python. Typically, search/recommend system would including several key components (1) Web Crawler (2) Indexing (3) Recall model/mechanism (4) Rank model (5) Query Web-Interface.

1.1 Web Crawler

Using selenium and beautifulsoup4 to fetch the data from https://stackoverflow.com that mostly related to database. In addition, currently only crawling the title as document for storage concerns. This reop use sqlite3 as forward-database for simplicity.

1.2 Indexing

Using nltk and gensim to do the text normalization, stemming, lemmatization and doc2vec.
Using a customized object as inverted (index) database. For real applications, we can chose leveldb, dynamodb or rockdb. The customized object is basically a k-value object with the optimization that applies tire-tree on inverted index.

index (key word) Value

sql [ (doc_id, term_frequency), ...]

database [ (doc_id, term_frequency), ...]
There are many advanced efficient index way, and for vector-database the indexing method often consisted of clustering and quantization methods.

1.3 Recall Model / Mechanism

Here using traditional bool-query. It could be further enhance with term-frequency or even a model to recall the candidates.

1.4 Ranking Model / Mechanism

Here using Doc2Vec from gensim. The pretrained weight is from https://github.com/jhlau/doc2vec.
The idea is to recommend the item base on vector similarity after recall-model.

1.5 Query Interface

Using Flask and Jinja template with bootstrap UI framework.

2. Usage Flow

install the requirement packages
install chrome driver for selenium in data/web_driver
- It should align with the config.py
python run_web_crawler.py
- It would create a forward database and storing blob in local machine.
python run_model_indexer.py
- It would run text normalization and then build the inverted-index object.
python run_query_interface.py
- It would run a flask server on localhost:5000.
Config File
- forward_database = r'data/sqldb/example.db'
- inverted_database = r'data/sqldb/inverted_key_word_db.pkl'
- Doc2Vector300D_model_weight = r"data/model/doc2vec.bin"

3. Result Comparison

Current result just using traditional bool query, and not counting lots features like voting, user persona, content of question, and content of reply answers.

This Repo	Original Website

4. Further Improvement

There are lots of things we could improve on and play with:
- Model-based recall mechanism: Personally, I also build the vectors database of title-text, however not index it yet for fast-read. There are some ways to index the dense-vector database via some clustering or quantization methods. I would like to try some of them after more understanding of them. Also, I am not sure how much improvement would it bring to the overall performance.
- Stronger Ranking model: There are many rooms to improve the ranking model, like taking extra-features into the model or using a better model.
- Distributed Storage: we definitely can shading the database, both on forward-database and inverted database.
- Crawling for more content: Due to storage limitations, I treat the title of the document as entire document. However, we could also crawling and parsing content as well.
- Advanced Indexing method: if the data is large, we may need to do some compression on index. In addition, for dense vector-based search/recommend system, we may need to using clustering/product quantization plus multi-index method to index the database. There are several famous method like Non-Orthogonal Inverted Multi-Index or IVFADC.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
doc		doc
search_eng		search_eng
.gitignore		.gitignore
LISENCE.txt		LISENCE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Project Introduction

1.1 Web Crawler

1.2 Indexing

1.3 Recall Model / Mechanism

1.4 Ranking Model / Mechanism

1.5 Query Interface

2. Usage Flow

3. Result Comparison

4. Further Improvement

About

Releases

Packages

Languages

index (key word)	Value
sql	[ (doc_id, term_frequency), ...]
database	[ (doc_id, term_frequency), ...]

KentChun33333/search_eng

Folders and files

Latest commit

History

Repository files navigation

1. Project Introduction

1.1 Web Crawler

1.2 Indexing

1.3 Recall Model / Mechanism

1.4 Ranking Model / Mechanism

1.5 Query Interface

2. Usage Flow

3. Result Comparison

4. Further Improvement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages