Skip to content

Commit faa2a65

Browse files
committed
doc updates for analyze scripts
1 parent ca62cf9 commit faa2a65

File tree

5 files changed

+40
-4
lines changed

5 files changed

+40
-4
lines changed

docs/analyze_chunked_corpus.rst

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Analyzing a Chunked Corpus
2+
--------------------------
3+
4+
The ``analyze_chunked_corpus.py`` script will show the following statistics about a chunked corpus:
5+
6+
* total number of words
7+
* number of unique words
8+
* number of tags
9+
* number of IOB tags
10+
* the number of times each tag and IOB tag occurs
11+
12+
To analyze the treebank corpus::
13+
``python analyze_chunked_corpus.py treebank_chunk``
14+
15+
To sort the output by tag count from highest to lowest::
16+
``python analyze_chunked_corpus.py treebank_chunk --sort count --reverse``
17+
18+
To analyze a custom corpus using a ``ChunkedCorpusReader``::
19+
``python analyze_chunked_corpus.py /path/to/corpus --reader nltk.corpus.reader.ChunkedCorpusReader``
20+
21+
The corpus path can be absolute, or relative to a nltk_data directory.
22+
23+
For a complete list of usage options::
24+
``python analyze_chunked_corpus.py --help``

docs/analyze_tagger_coverage.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
Analyzing Tagger Coverage
22
-------------------------
33

4-
The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger on a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.
4+
The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger over a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.
55

66
Here's an example using the NLTK default tagger on the treebank corpus::
77
``python analyze_tagger_coverage.py treebank``
88

99
To get detailed metrics on each tag, you can use the ``--metrics`` option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See `NLTK Default Tagger Treebank Tag Coverage <http://streamhacker.com/2011/01/24/nltk-default-tagger-treebank-tag-coverage/>`_ and `NLTK Default Tagger CoNLL2000 Tag Coverage <http://streamhacker.com/2011/01/25/nltk-default-tagger-conll2000-tag-coverage/>`_ for examples and statistics.
1010

11-
To analyze the coverage of a different tagger, use the ``--tagger`` option with a path to the pickled tagger::
11+
The default tagger used is NLTK's default tagger. To analyze the coverage using a different tagger, use the ``--tagger`` option with a path to the pickled tagger, as in::
1212
``python analyze_tagger_coverage.py treebank --tagger /path/to/tagger.pickle``
1313

14-
To analyze coverage on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
14+
You can also analyze tagger coverage over a custom corpus. For example, with a corpus whose fileids end in ".pos", you can use a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
1515
``python analyze_tagger_coverage.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``
1616

1717
The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

tests/analyze_chunked_corpus.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,4 +37,10 @@ it_anayzes_treebank_tagged() {
3737
11994 unique words
3838
47 tags
3939
1 IOBs"
40+
}
41+
42+
it_analyzes_treebank_chunk_sort_count_reverse() {
43+
two_lines=$(./analyze_chunked_corpus.py treebank_chunk --sort count --reverse 2>&1 | head -n 10 | tail -n 2)
44+
test "$two_lines" "=" "NN 13181 12832
45+
IN 9970 26"
4046
}

tests/analyze_tagged_corpus.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,10 @@ it_anayzes_treebank_simplified_tags() {
4343
100676 total words
4444
12408 unique words
4545
31 tags"
46+
}
47+
48+
it_analyzes_treebank_sort_count_reverse() {
49+
two_lines=$(./analyze_tagged_corpus.py treebank --sort count --reverse 2>&1 | head -n 9 | tail -n 2)
50+
test "$two_lines" "=" "NN 13166
51+
IN 9857"
4652
}

train_classifier.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
~/nltk_data/classifiers''')
2828
parser.add_argument('--no-pickle', action='store_true', default=False,
2929
help="don't pickle and save the classifier")
30-
parser.add_argument('--classifier', '--algorithm', default='NaiveBayes', nargs='+',
30+
parser.add_argument('--classifier', '--algorithm', default=['NaiveBayes'], nargs='+',
3131
choices=nltk_trainer.classification.args.classifier_choices,
3232
help='''Classifier algorithm to use, defaults to %(default)s. Maxent uses the
3333
default Maxent training algorithm, either CG or iis.''')

0 commit comments

Comments
 (0)