corporadb

For project Sherlock, our team aims to use NLP tools to analyse large collections of documents. The original description of the team's goals are on Sherlock's repo.

The following sections describes the process for going from a bunch of plain text documents (emails in this case) to a nice visualization of the topics in these documents.

During the earlier stages of the project we worked in this repository.

1 Installation instructions

Optionally create a virtual python environment in this repo clone. Then type

$ python setup.py install

The first command may result in missing header messages, which will require you to build or download some additional libraries. On Debian and Ubuntu systems, you can do

$ sudo apt-get install python-dev libyaml-dev libssl-dev libffi-dev

whereas in Fedora or Red Hat distros type

$ sudo yum install python-devel libyaml-devel libssl-devel libffi-devel

You will also need to download some NLTK data:

$ python -m nltk.downloader stopwords

where you have activated your virtual environment, if you use one.

2 Pre-processing:

2.1 Preprocessing using step by step Python commands

Clean headers

$ python corpora/cleanHeaders.py cwl/enron_mail/ cwl/enron_mail_clean/

Create tokens

$ python corpora/tokenization.py cwl/enron_mail_clean/ cwl/enron_mail_clean_tokens/

Merge tokens into dictionary

$ python corpora/buildDict.py cwl/enron_mail_clean_tokens/ cwl/enron_mail.dict

Build document matrix

$ python corpora/buildDocumentMatrix.py cwl/enron_mail.dict cwl/enron_mail_clean_tokens/ cwl/enron_mail.mtx

2.2 Preprocessing using a CWL workflow

See this tutorial

3 Train model

Next, the LDA model is trained in Spark. Follow installation instructions from here: https://spark.apache.org/docs/latest/

RUN:

$ SPARK_HOME=(path to you Spark installation, e.g. /home/johndoe/spark-2.0.1/bin/)
$ $SPARK_HOME/spark-submit corpora/trainModel.py cwl/enron_mail.mtx cwl/enron_mail.lda.model 5 10

4 Create and fill database

See this tutorial

5 Create visualisation

...

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
corpora		corpora
corporadb		corporadb
createdb		createdb
cwl		cwl
pyLDAvis		pyLDAvis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
runAll.sh		runAll.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

corporadb

1 Installation instructions

2 Pre-processing:

2.1 Preprocessing using step by step Python commands

2.2 Preprocessing using a CWL workflow

3 Train model

4 Create and fill database

5 Create visualisation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

nlesc-sherlock/corporadb

Folders and files

Latest commit

History

Repository files navigation

corporadb

1 Installation instructions

2 Pre-processing:

2.1 Preprocessing using step by step Python commands

2.2 Preprocessing using a CWL workflow

3 Train model

4 Create and fill database

5 Create visualisation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages