This library provides tools aiming to find different opinions in the scientific litterature regarding the user query.
The Kaggle notebook can be find here.
Birielfy:
- It loads all articles into an SQLite DB.
- Sentences are pre-processed.
- Word2vec and TF-IDF are trained.
- Sentences are vectorised.
- The query is pre-processed and vectorised.
- The distance between query and sentences is computed.
- The top-k sentences are kept.
- A clustering is applied on these sentences.
- A ranking regarding its proximity to the centroid and authors of the papers.
Simply use:
pip install -q git+https://github.com/MrMimic/covid-19-kaggle
An then the library can be imported with:
from c19 import parameters, database_utilities, text_preprocessing, embedding, query_matching, clusterise_sentences, plot_clusters, display_output
Please use this script to create the local database.
Please use this one to query the trained DB.
This script allows to re-train the W2V and TF-IDF to re-generate the parquet file.
All queries from the Kaggle challenge have been reformulated here. They have then been processed with the tool presented here.
Results are visible on Kaggle.