doc updates for analyze scripts

japerk · japerk · commit faa2a6548612 · 2011-08-10T21:16:53.000-07:00
diff --git a/docs/analyze_chunked_corpus.rst b/docs/analyze_chunked_corpus.rst
@@ -0,0 +1,24 @@
+Analyzing a Chunked Corpus
+--------------------------
+
+The ``analyze_chunked_corpus.py`` script will show the following statistics about a chunked corpus:
+
+ * total number of words
+ * number of unique words
+ * number of tags
+ * number of IOB tags
+ * the number of times each tag and IOB tag occurs
+
+To analyze the treebank corpus::
+	``python analyze_chunked_corpus.py treebank_chunk``
+
+To sort the output by tag count from highest to lowest::
+	``python analyze_chunked_corpus.py treebank_chunk --sort count --reverse``
+
+To analyze a custom corpus using a ``ChunkedCorpusReader``::
+	``python analyze_chunked_corpus.py /path/to/corpus --reader nltk.corpus.reader.ChunkedCorpusReader``
+
+The corpus path can be absolute, or relative to a nltk_data directory.
+
+For a complete list of usage options::
+	``python analyze_chunked_corpus.py --help``
diff --git a/docs/analyze_tagger_coverage.rst b/docs/analyze_tagger_coverage.rst
@@ -1,17 +1,17 @@
 Analyzing Tagger Coverage
 -------------------------
 
-The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger on a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.
+The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger over a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.
 
 Here's an example using the NLTK default tagger on the treebank corpus::
 	``python analyze_tagger_coverage.py treebank``
 
 To get detailed metrics on each tag, you can use the ``--metrics`` option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See `NLTK Default Tagger Treebank Tag Coverage <http://streamhacker.com/2011/01/24/nltk-default-tagger-treebank-tag-coverage/>`_ and `NLTK Default Tagger CoNLL2000 Tag Coverage <http://streamhacker.com/2011/01/25/nltk-default-tagger-conll2000-tag-coverage/>`_ for examples and statistics.
 
-To analyze the coverage of a different tagger, use the ``--tagger`` option with a path to the pickled tagger::
+The default tagger used is NLTK's default tagger. To analyze the coverage using a different tagger, use the ``--tagger`` option with a path to the pickled tagger, as in::
 	``python analyze_tagger_coverage.py treebank --tagger /path/to/tagger.pickle``
 
-To analyze coverage on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
+You can also analyze tagger coverage over a custom corpus. For example, with a corpus whose fileids end in ".pos", you can use a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
 	``python analyze_tagger_coverage.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``
 
 The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.
diff --git a/tests/analyze_chunked_corpus.sh b/tests/analyze_chunked_corpus.sh
@@ -37,4 +37,10 @@ it_anayzes_treebank_tagged() {
 11994 unique words
 47 tags
 1 IOBs"
+}
+
+it_analyzes_treebank_chunk_sort_count_reverse() {
+	two_lines=$(./analyze_chunked_corpus.py treebank_chunk --sort count --reverse 2>&1 | head -n 10 | tail -n 2)
+	test "$two_lines" "=" "NN           13181   12832
+IN            9970      26"
 }
diff --git a/tests/analyze_tagged_corpus.sh b/tests/analyze_tagged_corpus.sh
@@ -43,4 +43,10 @@ it_anayzes_treebank_simplified_tags() {
 100676 total words
 12408 unique words
 31 tags"
+}
+
+it_analyzes_treebank_sort_count_reverse() {
+	two_lines=$(./analyze_tagged_corpus.py treebank --sort count --reverse 2>&1 | head -n 9 | tail -n 2)
+	test "$two_lines" "=" "NN           13166
+IN            9857"
 }
diff --git a/train_classifier.py b/train_classifier.py
@@ -27,7 +27,7 @@
 	~/nltk_data/classifiers''')
 parser.add_argument('--no-pickle', action='store_true', default=False,
 	help="don't pickle and save the classifier")
-parser.add_argument('--classifier', '--algorithm', default='NaiveBayes', nargs='+',
+parser.add_argument('--classifier', '--algorithm', default=['NaiveBayes'], nargs='+',
 	choices=nltk_trainer.classification.args.classifier_choices,
 	help='''Classifier algorithm to use, defaults to %(default)s. Maxent uses the
 	default Maxent training algorithm, either CG or iis.''')