This repository contains a python package, a web server and a web front-end to find suggestions to words which are being queried in a document store.
- Make list of test queries (single term vs multiple terms/phrase queries)
- Papers etc. about Enron data: http://enrondata.org/content/research/
- Implement query expansion algorithms
- Merge/cluster results from query expansion algorithms
- Use word2vec word vectors to cluster/map term suggestions
- Visualize merged query expansion results
- Normalize scores of suggested words, so that aggregating scores make more sense
- Focus on how many methods suggested a word instead of the scores assigned
- Build simple user interface/demo
- Make maximum number of terms suggested a parameter
- Do the same for terms that do not correlate/co-occur with query terms (does that make sense?)
- How to evaluate/validate results?
Several python packages have to installed for the various comnponents of this repository.
pandas Flask Flask-cors nltk vincent elasticsearch sklearn scipy gensim
The termsuggester python package contains a pipeline to use different methods to find suggestions for a term. The pipeline uses several term-search methods to get suggestions. The term-search methods are configured and instanciated by the user. The suggestions from the various term-search methods are aggregated. The aggregation method can be selected by the user.
Current term-search methods:
- ELSearch: Find suggestions using ElasticSearch significant terms aggregation from a Document Corpus.
- WNSearch: Use WordNet to find suggestions for a term
- PrecomputedClusterSuggester: Finds suggesstions using a pre-computed term clustering data set stored in ElasticSearch. The term clustering data set is computed with Non-negative matrix factorization (NMF) clustering method.
- Word2VecSuggester: Use word2vec to find most similar terms
Current methods for aggregation of results from various term-search methods:
- Sum
- Average
To add a new term-search method you need to create a class which only condition is to have a suggest_terms(query_word) method. This method must return a suggestion set which is a Python dictionary in the form of: {str : float, str : float, ...} where str is a suggested term and float is the weight of the suggestion (how relevant it is)
The search-term methods may use other applications such as ElasticSearch. In the package we assume that such applications have been properly set up. For example that the related ElasticSearch indixes have been created.
-
ELSearch method requires to run
get_dc.py
anddc_to_es.py
before using termsuggester. -
WNSearch method does not require setup.
-
PrecomputedClusterSuggester method requires to run
fit_nmf.py
andnmf_to_es.py
before using termsuggester. To get NMF word clusters for suggestions, runpip install -U git+https://github.com/scikit-learn/scikit-learn.git
Thenpython fit_nmf.py <n_clusters> <alpha> nmf_output.json
(Tryn_clusters
=500 andalpha
=1.) Then store the result in Elasticsearch:python nmf_to_es.py nmf_output.json
The index that is constructed can then be used by the PrecomputedClusterSuggester. -
Word2VecSuggester requires to run
train_word2vec.py
before using it.
from TermSuggestionsAggregator import TermSuggestionsAggregator, Aggregation
from elsearch import ELSearch
from wnsearch import WNSearch
from precomputed import PrecomputedClusterSuggester
methods = (WNSearch(), ELSearch(), PrecomputedClusterSuggester())
ts = TermSuggestionsAggregator()
d = ts.getSuggestions('car', methods, Aggregation.Average)
print d
-
Have Elasticsearch running. Elasticsearch must have an
enron
index containing the enron emails (and optionally precomputed term suggestions). -
Have the webserver running:
python webserver/webTermSuggester.py
. -
Run the webdemo
cd webdemo gulp serve
- Get the data (IPython notebook)
- Text feature extraction using scikit-learn
- Given a term-document matrix A (where cell t, d contains the weight of term t in document d), the term-term correlation matrix is A*A.T (see Automatic Query Expansion in Information Retrieval, page 13, above equation 5).
- A review of ontology based query expansion
- IBM Watson Concept Expansion
- The Concept Expansion process is also known as Semantic Lexicon Induction or Semantic Set Expansion.
- Probably a more 'ontology based' approach
- Could not find a paper about the IBM Watson Concept Expansion Webservice