Concept Search for Exploratory Data Analysis

This repository contains a python package, a web server and a web front-end to find suggestions to words which are being queried in a document store.

Planning/Subtasks

Make list of test queries (single term vs multiple terms/phrase queries)
- Papers etc. about Enron data: http://enrondata.org/content/research/
Implement query expansion algorithms
Merge/cluster results from query expansion algorithms
- Use word2vec word vectors to cluster/map term suggestions
Visualize merged query expansion results
- Normalize scores of suggested words, so that aggregating scores make more sense
- Focus on how many methods suggested a word instead of the scores assigned
Build simple user interface/demo
- Make maximum number of terms suggested a parameter
Do the same for terms that do not correlate/co-occur with query terms (does that make sense?)
How to evaluate/validate results?

Installation

Several python packages have to installed for the various comnponents of this repository.

pandas Flask Flask-cors nltk vincent elasticsearch sklearn scipy gensim

termsuggester

The termsuggester python package contains a pipeline to use different methods to find suggestions for a term. The pipeline uses several term-search methods to get suggestions. The term-search methods are configured and instanciated by the user. The suggestions from the various term-search methods are aggregated. The aggregation method can be selected by the user.

Current term-search methods:

ELSearch: Find suggestions using ElasticSearch significant terms aggregation from a Document Corpus.
WNSearch: Use WordNet to find suggestions for a term
PrecomputedClusterSuggester: Finds suggesstions using a pre-computed term clustering data set stored in ElasticSearch. The term clustering data set is computed with Non-negative matrix factorization (NMF) clustering method.
Word2VecSuggester: Use word2vec to find most similar terms

Current methods for aggregation of results from various term-search methods:

Sum
Average

To add a new term-search method you need to create a class which only condition is to have a suggest_terms(query_word) method. This method must return a suggestion set which is a Python dictionary in the form of: {str : float, str : float, ...} where str is a suggested term and float is the weight of the suggestion (how relevant it is)

The search-term methods may use other applications such as ElasticSearch. In the package we assume that such applications have been properly set up. For example that the related ElasticSearch indixes have been created.

Method set up

ELSearch method requires to run get_dc.py and dc_to_es.py before using termsuggester.
WNSearch method does not require setup.
PrecomputedClusterSuggester method requires to run fit_nmf.py and nmf_to_es.py before using termsuggester. To get NMF word clusters for suggestions, run pip install -U git+https://github.com/scikit-learn/scikit-learn.git Then python fit_nmf.py <n_clusters> <alpha> nmf_output.json (Try n_clusters=500 and alpha=1.) Then store the result in Elasticsearch: python nmf_to_es.py nmf_output.json The index that is constructed can then be used by the PrecomputedClusterSuggester.
Word2VecSuggester requires to run train_word2vec.py before using it.

Example of usage (after various methods setup)

from TermSuggestionsAggregator import TermSuggestionsAggregator, Aggregation
from elsearch import ELSearch
from wnsearch import WNSearch
from precomputed import PrecomputedClusterSuggester

methods = (WNSearch(), ELSearch(), PrecomputedClusterSuggester())
ts = TermSuggestionsAggregator()
d = ts.getSuggestions('car', methods, Aggregation.Average)
print d

webserver

webdemo

Have Elasticsearch running. Elasticsearch must have an enron index containing the enron emails (and optionally precomputed term suggestions).
Have the webserver running: python webserver/webTermSuggester.py.
Run the webdemo

cd webdemo gulp serve

Data

Get the data (IPython notebook)

More ideas (for next sprint)

Ontologies

A review of ontology based query expansion
IBM Watson Concept Expansion
- The Concept Expansion process is also known as Semantic Lexicon Induction or Semantic Set Expansion.
- Probably a more 'ontology based' approach
- Could not find a paper about the IBM Watson Concept Expansion Webservice

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
mutinf		mutinf
termsuggester		termsuggester
webdemo		webdemo
webserver		webserver
.gitignore		.gitignore
GettingTheData.ipynb		GettingTheData.ipynb
LICENSE		LICENSE
README.md		README.md
dc_to_es.py		dc_to_es.py
dw_wordnet.py		dw_wordnet.py
fit_nmf.py		fit_nmf.py
get_dc.py		get_dc.py
nmf_to_es.py		nmf_to_es.py
train_word2vec.py		train_word2vec.py
vsmlib.py		vsmlib.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Concept Search for Exploratory Data Analysis

Planning/Subtasks

Installation

termsuggester

Method set up

Example of usage (after various methods setup)

webserver

webdemo

Data

Related documentation

More ideas (for next sprint)

Ontologies

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

nlesc-sherlock/concept-search

Folders and files

Latest commit

History

Repository files navigation

Concept Search for Exploratory Data Analysis

Planning/Subtasks

Installation

termsuggester

Method set up

Example of usage (after various methods setup)

webserver

webdemo

Data

Related documentation

More ideas (for next sprint)

Ontologies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages