GitHub - davidcoallier/nltk-trainer: Train NLTK objects with zero code

This branch is 90 commits behind japerk/nltk-trainer:master.

Name	Name	Last commit message	Last commit date
Latest commit japerk merge Feb 22, 2012 5c0b53c · Feb 22, 2012 History 136 Commits
docs	docs	doc updates for analyze scripts	Aug 11, 2011
nltk_trainer	nltk_trainer	merge	Feb 22, 2012
tests	tests	doc updates for analyze scripts	Aug 11, 2011
.hgignore	.hgignore	initial docs	Jul 31, 2011
LICENSE	LICENSE	apache license, initial setup.py, shebang in scripts	Feb 27, 2011
README.txt	README.txt	prepare for 1.0 release	Jul 17, 2011
analyze_chunked_corpus.py	analyze_chunked_corpus.py	initial analyzed chunked corpus tests	Jul 20, 2011
analyze_chunker_coverage.py	analyze_chunker_coverage.py	fix fractions, initial tests for analyze tagger coverage	Jul 23, 2011
analyze_classifier_coverage.py	analyze_classifier_coverage.py	fixups for analyze_classifier_coverage	Oct 5, 2011
analyze_tagged_corpus.py	analyze_tagged_corpus.py	dynamic formatting for tag columns	Aug 13, 2011
analyze_tagger_coverage.py	analyze_tagger_coverage.py	bottom metrics column 13, support fileids without metrics	Aug 27, 2011
categorized_corpus2csv.py	categorized_corpus2csv.py	initial script to combine classifiers into AvgProbClassifier, make ot…	Jun 12, 2011
classify_corpus.py	classify_corpus.py	tag_phrases basically working with ChunkedCorpusWriter	Feb 12, 2012
combine_classifiers.py	combine_classifiers.py	default hierarchy to empty list for combining classifiers	Oct 5, 2011
requirements.txt	requirements.txt	prepare for 1.0 release	Jul 17, 2011
setup.py	setup.py	prepare for 1.0 release	Jul 17, 2011
tag_phrases.py	tag_phrases.py	tag_phrases basically working with ChunkedCorpusWriter	Feb 12, 2012
train_chunker.py	train_chunker.py	better corpus loading	Jul 17, 2011
train_classifier.py	train_classifier.py	doc updates for analyze scripts	Aug 11, 2011
train_tagger.py	train_tagger.py	better corpus loading	Jul 17, 2011
translate_corpus.py	translate_corpus.py	tag_phrases basically working with ChunkedCorpusWriter	Feb 12, 2012

Repository files navigation

NLTK Trainer
------------

NLTK Trainer exists to make training and evaluating NLTK objects as easy as possible.


Requirements
------------

You must have Python 2.6 with `argparse <http://pypi.python.org/pypi/argparse/>`_ and `NLTK <http://www.nltk.org/>`_ 2.0 installed. `NumPy <http://numpy.scipy.org/>`_, `SciPy <http://www.scipy.org/>`_, and `megam <http://www.cs.utah.edu/~hal/megam/>`_ are recommended for training Maxent classifiers.


Training Classifiers
--------------------

Example usage with the movie_reviews corpus can be found in `Training Binary Text Classifiers with NLTK Trainer <http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/>`_.

Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances::
	``python train_classifier.py --instances paras --classifier NaiveBayes movie_reviews``

Include bigrams as features::
	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 movie_reviews``

Minimum score threshold::
	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3 movie_reviews``

Maximum number of features::
	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000 movie_reviews``

Use the default Maxent algorithm::
	``python train_classifier.py --instances paras --classifier Maxent movie_reviews``

Use the MEGAM Maxent algorithm::
	``python train_classifier.py --instances paras --classifier MEGAM movie_reviews``

Train on files instead of paragraphs::
	``python train_classifier.py --instances files --classifier MEGAM movie_reviews``

Train on sentences::
	``python train_classifier.py --instances sents --classifier MEGAM movie_reviews``

Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling::
	``python train_classifier.py --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle movie_reviews``

For a complete list of usage options::
	``python train_classifier.py --help``


Using a Trained Classifier
--------------------------

You can use a trained classifier by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_::
	>>> import nltk.data
	>>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")

Or if your classifier pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_::
	>>> import pickle
	>>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))

Either method will return an object that supports the `ClassifierI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.api.ClassifierI-class.html>`_. 

Once you have a ``classifier`` object, you can use it to classify word features with the ``classifier.classify(feats)`` method, which returns a label::
	>>> words = ['some', 'words', 'in', 'a', 'sentence']
	>>> feats = dict([(word, True) for word in words])
	>>> classifier.classify(feats)

If you used the ``--ngrams`` option with values greater than 1, you should include these ngrams in the dictionary using `nltk.util.ngrams(words, n) <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#ngrams>`_::
	>>> from nltk.util import ngrams
	>>> words = ['some', 'words', 'in', 'a', 'sentence']
	>>> feats = dict([(word, True) for word in words + ngrams(words, n)])
	>>> classifier.classify(feats)

The list of words you use for creating the feature dictionary should be created by `tokenizing <http://text-processing.com/demo/tokenize/>`_ the appropriate text instances: sentences, paragraphs, or files depending on the ``--instances`` option.


Training Part of Speech Taggers
-------------------------------

The ``train_tagger.py`` script can use any corpus included with NLTK that implements a ``tagged_sents()`` method. It can also train on the ``timit`` corpus, which includes tagged sentences that are not available through the ``TimitCorpusReader``.

Example usage can be found in `Training Part of Speech Taggers with NLTK Trainer <http://streamhacker.com/2011/03/21/training-part-speech-taggers-nltk-trainer/>`_.

Train the default sequential backoff tagger on the treebank corpus::
	``python train_tagger.py treebank``

To use a brill tagger with the default initial tagger::
	``python train_tagger.py treebank --brill``

To train a NaiveBayes classifier based tagger, without a sequential backoff tagger::
	``python train_tagger.py treebank --sequential '' --classifier NaiveBayes``

To train a unigram tagger::
	``python train_tagger.py treebank --sequential u``

To train on the switchboard corpus::
	``python train_tagger.py switchboard``

To train on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
	``python train_tagger.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

You can also restrict the files used with the ``--fileids`` option::
	``python train_tagger.py conll2000 --fileids train.txt``

For a complete list of usage options::
	``python train_tagger.py --help``


Using a Trained Tagger
----------------------

You can use a trained tagger by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_::
	>>> import nltk.data
	>>> tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")

Or if your tagger pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_::
	>>> import pickle
	>>> tagger = pickle.load(open("/path/to/NAME_OF_TAGGER.pickle"))

Either method will return an object that supports the `TaggerI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.api.TaggerI-class.html>`_.

Once you have a ``tagger`` object, you can use it to tag sentences (or lists of words) with the ``tagger.tag(words)`` method::
	>>> tagger.tag(['some', 'words', 'in', 'a', 'sentence'])

``tagger.tag(words)`` will return a list of 2-tuples of the form ``[(word, tag)]``.


Analyzing Tagger Coverage
-------------------------

The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger on a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.

Here's an example using the NLTK default tagger on the treebank corpus::
	``python analyze_tagger_coverage.py treebank``

To get detailed metrics on each tag, you can use the ``--metrics`` option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See `NLTK Default Tagger Treebank Tag Coverage <http://streamhacker.com/2011/01/24/nltk-default-tagger-treebank-tag-coverage/>`_ and `NLTK Default Tagger CoNLL2000 Tag Coverage <http://streamhacker.com/2011/01/25/nltk-default-tagger-conll2000-tag-coverage/>`_ for examples and statistics.

To analyze the coverage of a different tagger, use the ``--tagger`` option with a path to the pickled tagger::
	``python analyze_tagger_coverage.py treebank --tagger /path/to/tagger.pickle``

To analyze coverage on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
	``python analyze_tagger_coverage.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

For a complete list of usage options::
	``python analyze_tagger_coverage.py --help``


Analyzing a Tagged Corpus
-------------------------

The ``analyze_tagged_corpus.py`` script will show the following statistics about a tagged corpus:

 * total number of words
 * number of unique words
 * number of tags
 * the number of times each tag occurs

Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.

To analyze the treebank corpus::
	``python analyze_tagged_corpus.py treebank``

To sort the output by tag count from highest to lowest::
	``python analyze_tagged_corpus.py treebank --sort count --reverse``

To see simplified tags, instead of standard tags::
	``python analyze_tagged_corpus.py treebank --simplify_tags``

To analyze a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
	``python analyze_tagged_corpus.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

For a complete list of usage options::
	``python analyze_tagged_corpus.py --help``


Training IOB Chunkers
---------------------

The ``train_chunker.py`` script can use any corpus included with NLTK that implements a ``chunked_sents()`` method.

Train the default sequential backoff tagger based chunker on the treebank_chunk corpus::
	``python train_chunker.py treebank_chunk``

To train a NaiveBayes classifier based chunker::
	``python train_chunker.py treebank_chunk --classifier NaiveBayes``

To train on the conll2000 corpus::
	``python train_chunker.py conll2000``

To train on a custom corpus, whose fileids end in ".pos", using a `ChunkedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.chunked.ChunkedCorpusReader-class.html>`_::
	``python train_chunker.py /path/to/corpus --reader nltk.corpus.reader.chunked.ChunkedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

You can also restrict the files used with the ``--fileids`` option::
	``python train_chunker.py conll2000 --fileids train.txt``

For a complete list of usage options::
	``python train_chunker.py --help``