Question Answering system is an automated computer system that answers queries posed by humans in natural language.
In this project, various available Ranking Based Question Answering Systems are reviewed and a technique is proposed which selects the best answer from the available QA models using cosine simiarity and NLP, and also answers some domain-specific questions which can't be answered by the above systems.
The data about private organizations is available to these organizations only and sometimes not available in public domain.
-
To build a context-based Question-Answering System for a specific organization which finds relevant answers to the queries using a corpus of information present only with the organization.
-
Understand and implement how search engines like Google work - Aspects of web crawling, web scraping, ranking and finding relevant answers through a huge web of information.
- Society & Culture - e.g. what are the social views of people from different cultures on abstinence?
- Science & Mathematics - e.g. What is fringe in optics?
- Health - e.g. What the hell is restless leg syndrome?
- Education & Reference - e.g. what is the past tense of tell?
- Computers & Internet - e.g. Do You have the code for the Luxor game?
- Sports - e.g. Who is going to win the FIFA World CUP 2006 in Germany?
- Business & Finance - e.g. What is secretary as corporation director?
- Entertainment & Music - e.g. where can I download mp3 songs?
- Family & Relationships - e.g. who's ready for Christmas?
- Politics & Government - e.g. Isn't civil war and oxymoron?
We used Yahoo! Answers topic selection dataset
- Human labelled dataset constructed with 10 largest main categories
- Each class contains 1,40,000 training and 6,000 testing samples
From all answers and other meta information, only the best answer content and the main category information were used.
The architecture of the system comprises of 4 modules:
- Question Classification Model
- Question Answering System
- Question Selection Web Service
- Chrome/Firefox Extension
- Tokenization
- Stop words removal
- Lemmatizing with NLTK
- Measuring the Cosine Similarity
- Text Exploration
- Text Cleaning
- Obtaining POS Tags, Identifying Named Entities, Lemmas, Syntactic Dependency Relations and Orthographic Features.
- Using the obtained properties as Features.
- Using a Linear SVM model on the engineered features.
Linear Support Vector Machine Classifier
Features used:
- Named Entity Recognition
- Lemmas
- POS Tags
Accuracy: 66.316%
BERT: Bidirectional Encoder Representations from Transformers
Transformers: Models that process words in relation to all the other words in a sentence, rather than one-by-one in order. BERT models can therefore consider the full context of a word by looking at the words that come before and after it—particularly useful for understanding the intent behind search queries.
As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once.
Therefore it is considered bidirectional, though it would be more accurate to say that it is non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK]
token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.
In technical terms, the prediction of the output words requires:
- Adding a classification layer on top of the encoder output.
- Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
- Calculating the probability of each word in the vocabulary with softmax.
In Question Answering tasks the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
- Clone this repo so you have a copy in a folder locally.
- Open
chrome://extensions
in the location or go toTools
>Extensions
- Enable
Developer mode
by checking the checkbox in the upper-right corner. - Click on the button labelled
Load unpacked extension...
. - Select the directory where you cloned this repo to.