For project Sherlock, our team aims to use NLP tools to analyse large collections of documents. The original description of the team's goals are on Sherlock's repo.
The following sections describes the process for going from a bunch of plain text documents (emails in this case) to a nice visualization of the topics in these documents.
During the earlier stages of the project we worked in this repository.
Optionally create a virtual python environment in this repo clone. Then type
$ python setup.py install
The first command may result in missing header messages, which will require you to build or download some additional libraries. On Debian and Ubuntu systems, you can do
$ sudo apt-get install python-dev libyaml-dev libssl-dev libffi-dev
whereas in Fedora or Red Hat distros type
$ sudo yum install python-devel libyaml-devel libssl-devel libffi-devel
You will also need to download some NLTK data:
$ python -m nltk.downloader stopwords
where you have activated your virtual environment, if you use one.
Clean headers
$ python corpora/cleanHeaders.py cwl/enron_mail/ cwl/enron_mail_clean/
Create tokens
$ python corpora/tokenization.py cwl/enron_mail_clean/ cwl/enron_mail_clean_tokens/
Merge tokens into dictionary
$ python corpora/buildDict.py cwl/enron_mail_clean_tokens/ cwl/enron_mail.dict
Build document matrix
$ python corpora/buildDocumentMatrix.py cwl/enron_mail.dict cwl/enron_mail_clean_tokens/ cwl/enron_mail.mtx
See this tutorial
Next, the LDA model is trained in Spark. Follow installation instructions from here: https://spark.apache.org/docs/latest/
RUN:
$ SPARK_HOME=(path to you Spark installation, e.g. /home/johndoe/spark-2.0.1/bin/)
$ $SPARK_HOME/spark-submit corpora/trainModel.py cwl/enron_mail.mtx cwl/enron_mail.lda.model 5 10
See this tutorial
...