Skip to content
André Pires edited this page Nov 21, 2017 · 54 revisions

Installing NLTK

  1. Install NLTK: run sudo pip install -U nltk
  2. Install Numpy (optional): run sudo pip install -U numpy
  3. Test installation: run python then type import nltk

For more information check here.

NLTK dataset format

The input dataset for training the NER model must contain a token per line, and having the POS tag and the NER tag with IOB tagging scheme. Leave blank line to separate sentences. Example from CoNLL 2002, dutch corpus:

"Eddy Bonte is woordvoerder van Hogeschool."

Eddy N B-PER

Bonte N I-PER

is V O

woordvoerder N O

van Prep O

diezelfde Pron O

Hogeschool N B-ORG

. Punc O

Convert HAREM dataset in NLTK input

In order to train a NER model in NLTK, the HAREM dataset had to be converted to the right format.

Steps:

  1. Tokenize and POS tagging
    1. Remove all tags from HAREM dataset, in order to tokenize file
    2. Tokenize dataset using nltk.tokenize.word_tokenizer (script)
    3. Use the resulting tokenized text, perform POS tagging
      1. Train POS model using floresta corpus from nltk.corpus (script)
      2. Tag resulting file from step 2 (script)
  2. Tokenize keeping the entities
    1. Tokenize dataset using nltk.tokenize.word_tokenizer (script)
    2. Join tokenized entity tags
    3. Transform dataset, matching the entity tags using regex, giving tags to each token
      1. For token after entity tag give B-tag
      2. For token after first entity token (previous step), or after an Inside (I) token, give I-tag
      3. Give O tag for other tokens
  3. Join POS tagged file with entity tagged file (script)
    1. Iterate through both files simultaneously
    2. For each line set token POS-tag IOB-entity-tag

The tokenizing with and without categories was done simultaneously. Also, the golden test data was created in the conll format (script)

Notes: In order to get sentence segmentation, first it was performed sentence segmentation using nltk in the OpenNLP format. Then the file was used where a special tag (--SENTENCE--) was added and then replaced with a blank line.

Train NER model with NLTK

To train a NER chunking model, NLTK trainer was used.

Steps:

  1. Download NLTK trainer
  2. (Opcional) Put training data in nltk-data folder
  3. Run command for training using train_chunker.py (command script):
    1. python train_chunker.py <path-to-training-file> [--fileids <fileids>] [--reader <reader>] [--classifier <classifier]
      1. path-to-training-file: specifies path to training file (or files) relative to nltk-data folder or current path. (file has to be in UTF-8 encoding)
      2. fileids: regex expression to match the files inside the path-to-training-file, (if no expression is given, all files will be used)
      3. reader: specify the reader for the corpus. In my case, since the corpus was in the Conll2002 iob format I chose nltk.corpus.reader.conll.ConllChunkCorpusReader. Note: For this reader I had to specify the categories used in the __init__.py file from nltk-trainer, using script
      4. classifier: specify the classifier to use, options: Maxent, DecisionTree, NaiveBayes

Note: conversion from 'ISO-8859-1' to 'UTF-8' encoding: iconv -f ISO-8859-1 -t UTF-8 <input> > <output>

Check folder for more information.

Perform NER

Steps:

  1. Load chunker model using pickle; pickle.load(open(model_path))
  2. Load input dataset (already tokenized and POS-tagged, done in training step)
  3. Perform NER; chunker.parse(tagged)
  4. The parser returns the result in a tree format, which was converted to the CoNLL format using nltk.chunk.util.tree2conlltags(ner_result)
  5. Output to file

Check script here.

Average results:

Check all results here.

Results after 4 repeats:

NaiveBayes

Level Precision Recall F-measure
Categories 30.58% 31.38% 30.97%
Types 29.66% 28.01% 28.82%
Subtypes 21.15% 22.72% 21.91%
Filtered 29.55% 35.17% 32.12%

MaxEnt

Level Precision Recall F-measure
Categories 18.19% 0.58% 1.13%
Types 9.84% 1.07% 1.93%
Subtypes 0.19% 0.28% 0.23%
Filtered 13.36% 0.30% 0.60%

DecisionTree

Level Precision Recall F-measure
Categories 21.84% 25.72% 23.62%
Types 25.37% 24.34% 24.84%
Subtypes 27.71% 35.81% 31.25%
Filtered 21.27% 31.49% 25.39%

Note: to ensure correct results in evaluation, I used a script to show if there are any differences in the output and golden data, in terms of tokenization. In case of difference, I manually changed the files.

Training time

Classifier Categories Types Subtypes Filtered All (without filtered)
Naive Bayes 2s 2s 2s 2s 4m19s
Maximum Entropy 1m56s 5m23s 4m25s 1m12s 7h50m
Decision Tree 5m55s 5m54s 5m52s 5m58s 11h47m

All with filtered: 24h30

Hyperparameter study

For this tool, I decided to check the influence of multiple hyperparameters: max iterations and min_lldelta for MaxEnt (it also allows min_ll); entropy cutoff, depth cutoff and support cutoff for DecisionTree. The results are the following:

MaxEnt

Max_iter (default: 10)

Value Categories Types Subtypes Filtered
10 1.11% 1.68% 0.29% 0.56%
All (10-120) 1.11% 1.68% 0.29% 0.56%

min_lldelta (default: 0.1)

Value Categories Types Subtypes Filtered
0 22.28% 1.68% 0.29% 23.97%
0.0000001 22.28% 1.68% 0.29% 23.97%
0.000001 22.28% 1.68% 0.29% 23.97%
0.00001 22.28% 1.68% 0.29% 23.97%
0.0001 22.28% 1.68% 0.29% 23.97%
0.001 22.28% 1.68% 0.29% 23.97%
0.01 22.28% 1.68% 0.29% 23.55%
0.05 1.11% 1.68% 0.29% 0.56%
0.1 1.11% 1.68% 0.29% 0.56%
0.15 1.11% 1.68% 0.29% 0.56%
0.2 1.11% 1.68% 0.29% 0.56%

min_lldelta - with iterations = 100 (default: 0.1)

Value Categories Types Subtypes Filtered
0 35.24% 1.68% 0.29% 38.30%
0.0000001 35.24% 1.68% 0.29% 38.30%
0.000001 35.24% 1.68% 0.29% 38.30%
0.00001 35.24% 1.68% 0.29% 38.30%
0.0001 35.24% 1.68% 0.29% 38.30%
0.001 32.69% 1.68% 0.29% 35.30%
0.01 24.40% 1.68% 0.29% 23.55%
0.05 1.11% 1.68% 0.29% 0.56%
0.1 1.11% 1.68% 0.29% 0.56%
0.15 1.11% 1.68% 0.29% 0.56%
0.2 1.11% 1.68% 0.29% 0.56%

DecisionTree

support_cutoff (default: 10)

Value Categories Types Subtypes Filtered
3 26.12% 24.25% 32.59% 28.87%
7 26.14% 24.25% 32.62% 28.85%
8 26.14% 24.25% 32.61% 28.85%
9 26.14% 24.24% 32.61% 28.85%
10 26.14% 24.24% 32.61% 28.85%
11 26.14% 24.25% 32.63% 28.85%
12 26.14% 24.28% 32.60% 28.83%
13 26.13% 24.30% 32.63% 28.84%
14 26.13% 24.31% 32.63% 28.84%
15 26.17% 24.28% 32.50% 28.86%
16 26.18% 24.27% 32.50% 28.87%
17 26.18% 24.27% 32.46% 28.86%
18 26.16% 24.29% 32.46% 28.84%
19 26.16% 24.27% 32.47% 28.84%
20 26.14% 24.28% 32.47% 28.84%

depth_cutoff (default: 100)

Value Categories Types Subtypes Filtered
2 26.02% 24.15% 32.54% 28.65%
100 26.14% 24.24% 32.61% 28.85%
5, 10-120 (All) 26.14% 24.24% 32.61% 28.85%

entropy_cutoff (default: 0.05)

Value Categories Types Subtypes Filtered
0.03 26.14% 24.24% 32.60% 28.82%
0.04 26.14% 24.24% 32.61% 28.85%
0.05 26.14% 24.24% 32.61% 28.85%
0.06 26.19% 24.24% 32.61% 28.85%
0.07 26.19% 24.25% 32.62% 28.85%
0.08 26.36% 24.29% 32.69% 28.85%
0.09 26.36% 24.29% 32.70% 28.85%
0.10 26.36% 24.29% 32.70% 28.83%
0.11 26.36% 24.28% 32.70% 28.77%
0.12 26.36% 24.28% 32.65% 28.58%
0.13 26.36% 24.28% 32.65% 28.59%

Results for SIGARRA News Corpus

Repeated holdout

Classifier Precision Recall F-measure Params
NaiveBayes 52.88% 60.75% 56.54% -
DecisionTree 60.37% 69.44% 64.59% Entropy_cutoff=0.08, Support_cutoff=16
DecisionTree 60.50% 69.53% 64.70% default
MaxEnt 64.75% 52.95% 58.26% Iterations=100, min_lldelta=0
MaxEnt 14.91% 2.65% 4.51% default

Repeated 10-fold cross validation

Classifier Precision Recall F-measure Params
NaiveBayes 54.47% 62.86% 58.36% -
DecisionTree 55.93% 70.21% 62.26% Entropy_cutoff=0.08, Support_cutoff=16
DecisionTree 56.03% 70.32% 62.37% default
MaxEnt 45.30% 33.47% 38.49% Iterations=100, min_lldelta=0
MaxEnt 16.29% 3.03% 5.11% default

Note: MaxEnt could better (close to 75%), but overflowed in repeat-1.

Resources

Get the generated models in the Resources page.