Ranking-Based-Question-Answering-System

Question Answering system is an automated computer system that answers queries posed by humans in natural language.

Introduction

In this project, various available Ranking Based Question Answering Systems are reviewed and a technique is proposed which selects the best answer from the available QA models using cosine simiarity and NLP, and also answers some domain-specific questions which can't be answered by the above systems.

Objectives - A Hybrid Search System

The data about private organizations is available to these organizations only and sometimes not available in public domain.

To build a context-based Question-Answering System for a specific organization which finds relevant answers to the queries using a corpus of information present only with the organization.
Understand and implement how search engines like Google work - Aspects of web crawling, web scraping, ranking and finding relevant answers through a huge web of information.

Architecture

Classes for classification

Society & Culture - e.g. what are the social views of people from different cultures on abstinence?
Science & Mathematics - e.g. What is fringe in optics?
Health - e.g. What the hell is restless leg syndrome?
Education & Reference - e.g. what is the past tense of tell?
Computers & Internet - e.g. Do You have the code for the Luxor game?
Sports - e.g. Who is going to win the FIFA World CUP 2006 in Germany?
Business & Finance - e.g. What is secretary as corporation director?
Entertainment & Music - e.g. where can I download mp3 songs?
Family & Relationships - e.g. who's ready for Christmas?
Politics & Government - e.g. Isn't civil war and oxymoron?

Dataset

We used Yahoo! Answers topic selection dataset

Human labelled dataset constructed with 10 largest main categories
Each class contains 1,40,000 training and 6,000 testing samples

From all answers and other meta information, only the best answer content and the main category information were used.

Proposed Technique

The architecture of the system comprises of 4 modules:

Question Classification Model
Question Answering System
Question Selection Web Service
Chrome/Firefox Extension

Implementation Details

Tokenization
Stop words removal
Lemmatizing with NLTK
Measuring the Cosine Similarity

Question Classification - Approach

Text Exploration
Text Cleaning
Obtaining POS Tags, Identifying Named Entities, Lemmas, Syntactic Dependency Relations and Orthographic Features.
Using the obtained properties as Features.
Using a Linear SVM model on the engineered features.

Model

Linear Support Vector Machine Classifier

Features used:

Named Entity Recognition
Lemmas
POS Tags

Accuracy: 66.316%

Implementation

Context-Based Classification - BERT

BERT: Bidirectional Encoder Representations from Transformers

Transformers: Models that process words in relation to all the other words in a sentence, rather than one-by-one in order. BERT models can therefore consider the full context of a word by looking at the words that come before and after it—particularly useful for understanding the intent behind search queries.

How BERT works?

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once.

Therefore it is considered bidirectional, though it would be more accurate to say that it is non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

Masked LM (MLM)

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

In technical terms, the prediction of the output words requires:

Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary with softmax.

Fine-tuning BERT for Q&A Task

In Question Answering tasks the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.

BERT Input Format

Start Token Classifier

End Token Classifier

Multilingual Application

Installing the Chrome Extension from Source

Clone this repo so you have a copy in a folder locally.
Open chrome://extensions in the location or go to Tools > Extensions
Enable Developer mode by checking the checkbox in the upper-right corner.
Click on the button labelled Load unpacked extension....
Select the directory where you cloned this repo to.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
Code		Code
Images		Images
Papers		Papers
Report and Presentation		Report and Presentation
.gitignore		.gitignore
README.md		README.md
requirements-app.txt		requirements-app.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation