Skip to content

nlesc-sherlock/concept-search

Repository files navigation

Concept Search for Exploratory Data Analysis

This repository contains a python package, a web server and a web front-end to find suggestions to words which are being queried in a document store.

Planning/Subtasks

  • Make list of test queries (single term vs multiple terms/phrase queries)
  • Implement query expansion algorithms
  • Merge/cluster results from query expansion algorithms
    • Use word2vec word vectors to cluster/map term suggestions
  • Visualize merged query expansion results
    • Normalize scores of suggested words, so that aggregating scores make more sense
    • Focus on how many methods suggested a word instead of the scores assigned
  • Build simple user interface/demo
    • Make maximum number of terms suggested a parameter
  • Do the same for terms that do not correlate/co-occur with query terms (does that make sense?)
  • How to evaluate/validate results?

Installation

Several python packages have to installed for the various comnponents of this repository.

pandas Flask Flask-cors nltk vincent elasticsearch sklearn scipy gensim

termsuggester

The termsuggester python package contains a pipeline to use different methods to find suggestions for a term. The pipeline uses several term-search methods to get suggestions. The term-search methods are configured and instanciated by the user. The suggestions from the various term-search methods are aggregated. The aggregation method can be selected by the user.

Current term-search methods:

  • ELSearch: Find suggestions using ElasticSearch significant terms aggregation from a Document Corpus.
  • WNSearch: Use WordNet to find suggestions for a term
  • PrecomputedClusterSuggester: Finds suggesstions using a pre-computed term clustering data set stored in ElasticSearch. The term clustering data set is computed with Non-negative matrix factorization (NMF) clustering method.
  • Word2VecSuggester: Use word2vec to find most similar terms

Current methods for aggregation of results from various term-search methods:

  • Sum
  • Average

To add a new term-search method you need to create a class which only condition is to have a suggest_terms(query_word) method. This method must return a suggestion set which is a Python dictionary in the form of: {str : float, str : float, ...} where str is a suggested term and float is the weight of the suggestion (how relevant it is)

The search-term methods may use other applications such as ElasticSearch. In the package we assume that such applications have been properly set up. For example that the related ElasticSearch indixes have been created.

Method set up

  • ELSearch method requires to run get_dc.py and dc_to_es.py before using termsuggester.

  • WNSearch method does not require setup.

  • PrecomputedClusterSuggester method requires to run fit_nmf.py and nmf_to_es.py before using termsuggester. To get NMF word clusters for suggestions, run pip install -U git+https://github.com/scikit-learn/scikit-learn.git Then python fit_nmf.py <n_clusters> <alpha> nmf_output.json (Try n_clusters=500 and alpha=1.) Then store the result in Elasticsearch: python nmf_to_es.py nmf_output.json The index that is constructed can then be used by the PrecomputedClusterSuggester.

  • Word2VecSuggester requires to run train_word2vec.py before using it.

Example of usage (after various methods setup)

from TermSuggestionsAggregator import TermSuggestionsAggregator, Aggregation
from elsearch import ELSearch
from wnsearch import WNSearch
from precomputed import PrecomputedClusterSuggester

methods = (WNSearch(), ELSearch(), PrecomputedClusterSuggester())
ts = TermSuggestionsAggregator()
d = ts.getSuggestions('car', methods, Aggregation.Average)
print d

webserver

webdemo

  1. Have Elasticsearch running. Elasticsearch must have an enron index containing the enron emails (and optionally precomputed term suggestions).

  2. Have the webserver running: python webserver/webTermSuggester.py.

  3. Run the webdemo

    cd webdemo gulp serve

Data

Related documentation

More ideas (for next sprint)

Ontologies

About

Concept Search for Exploratory Data Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6