NLTK

Installing NLTK

Install NLTK: run sudo pip install -U nltk
Install Numpy (optional): run sudo pip install -U numpy
Test installation: run python then type import nltk

For more information check here.

NLTK dataset format

The input dataset for training the NER model must contain a token per line, and having the POS tag and the NER tag with IOB tagging scheme. Leave blank line to separate sentences. Example from CoNLL 2002, dutch corpus:

"Eddy Bonte is woordvoerder van Hogeschool."

Eddy N B-PER

Bonte N I-PER

is V O

woordvoerder N O

van Prep O

diezelfde Pron O

Hogeschool N B-ORG

. Punc O

Convert HAREM dataset in NLTK input

In order to train a NER model in NLTK, the HAREM dataset had to be converted to the right format.

Steps:

Tokenize and POS tagging
1. Remove all tags from HAREM dataset, in order to tokenize file
2. Tokenize dataset using nltk.tokenize.word_tokenizer (script)
3. Use the resulting tokenized text, perform POS tagging
  1. Train POS model using floresta corpus from nltk.corpus (script)
  2. Tag resulting file from step 2 (script)
Tokenize keeping the entities
1. Tokenize dataset using nltk.tokenize.word_tokenizer (script)
2. Join tokenized entity tags
3. Transform dataset, matching the entity tags using regex, giving tags to each token
  1. For token after entity tag give B-tag
  2. For token after first entity token (previous step), or after an Inside (I) token, give I-tag
  3. Give O tag for other tokens
Join POS tagged file with entity tagged file (script)
1. Iterate through both files simultaneously
2. For each line set token POS-tag IOB-entity-tag

The tokenizing with and without categories was done simultaneously. Also, the golden test data was created in the conll format (script)

Notes: In order to get sentence segmentation, first it was performed sentence segmentation using nltk in the OpenNLP format. Then the file was used where a special tag (--SENTENCE--) was added and then replaced with a blank line.

Train NER model with NLTK

To train a NER chunking model, NLTK trainer was used.

Steps:

Download NLTK trainer
(Opcional) Put training data in nltk-data folder
Run command for training using train_chunker.py (command script):
1. python train_chunker.py <path-to-training-file> [--fileids <fileids>] [--reader <reader>] [--classifier <classifier]
  1. path-to-training-file: specifies path to training file (or files) relative to nltk-data folder or current path. (file has to be in UTF-8 encoding)
  2. fileids: regex expression to match the files inside the path-to-training-file, (if no expression is given, all files will be used)
  3. reader: specify the reader for the corpus. In my case, since the corpus was in the Conll2002 iob format I chose nltk.corpus.reader.conll.ConllChunkCorpusReader. Note: For this reader I had to specify the categories used in the __init__.py file from nltk-trainer, using script
  4. classifier: specify the classifier to use, options: Maxent, DecisionTree, NaiveBayes

Note: conversion from 'ISO-8859-1' to 'UTF-8' encoding: iconv -f ISO-8859-1 -t UTF-8 <input> > <output>

Check folder for more information.

Perform NER

Steps:

Load chunker model using pickle; pickle.load(open(model_path))
Load input dataset (already tokenized and POS-tagged, done in training step)
Perform NER; chunker.parse(tagged)
The parser returns the result in a tree format, which was converted to the CoNLL format using nltk.chunk.util.tree2conlltags(ner_result)
Output to file

Check script here.

Average results:

Check all results here.

Results after 4 repeats:

NaiveBayes

Level	Precision	Recall	F-measure
Categories	30.58%	31.38%	30.97%
Types	29.66%	28.01%	28.82%
Subtypes	21.15%	22.72%	21.91%
Filtered	29.55%	35.17%	32.12%

MaxEnt

Level	Precision	Recall	F-measure
Categories	18.19%	0.58%	1.13%
Types	9.84%	1.07%	1.93%
Subtypes	0.19%	0.28%	0.23%
Filtered	13.36%	0.30%	0.60%

DecisionTree

Level	Precision	Recall	F-measure
Categories	21.84%	25.72%	23.62%
Types	25.37%	24.34%	24.84%
Subtypes	27.71%	35.81%	31.25%
Filtered	21.27%	31.49%	25.39%

Note: to ensure correct results in evaluation, I used a script to show if there are any differences in the output and golden data, in terms of tokenization. In case of difference, I manually changed the files.

Training time

Classifier	Categories	Types	Subtypes	Filtered	All (without filtered)
Naive Bayes	2s	2s	2s	2s	4m19s
Maximum Entropy	1m56s	5m23s	4m25s	1m12s	7h50m
Decision Tree	5m55s	5m54s	5m52s	5m58s	11h47m

All with filtered: 24h30

Hyperparameter study

For this tool, I decided to check the influence of multiple hyperparameters: max iterations and min_lldelta for MaxEnt (it also allows min_ll); entropy cutoff, depth cutoff and support cutoff for DecisionTree. The results are the following:

MaxEnt

Max_iter (default: 10)

Value	Categories	Types	Subtypes	Filtered
10	1.11%	1.68%	0.29%	0.56%
All (10-120)	1.11%	1.68%	0.29%	0.56%

min_lldelta (default: 0.1)

Value	Categories	Types	Subtypes	Filtered
0	22.28%	1.68%	0.29%	23.97%
0.0000001	22.28%	1.68%	0.29%	23.97%
0.000001	22.28%	1.68%	0.29%	23.97%
0.00001	22.28%	1.68%	0.29%	23.97%
0.0001	22.28%	1.68%	0.29%	23.97%
0.001	22.28%	1.68%	0.29%	23.97%
0.01	22.28%	1.68%	0.29%	23.55%
0.05	1.11%	1.68%	0.29%	0.56%
0.1	1.11%	1.68%	0.29%	0.56%
0.15	1.11%	1.68%	0.29%	0.56%
0.2	1.11%	1.68%	0.29%	0.56%

min_lldelta - with iterations = 100 (default: 0.1)

Value	Categories	Types	Subtypes	Filtered
0	35.24%	1.68%	0.29%	38.30%
0.0000001	35.24%	1.68%	0.29%	38.30%
0.000001	35.24%	1.68%	0.29%	38.30%
0.00001	35.24%	1.68%	0.29%	38.30%
0.0001	35.24%	1.68%	0.29%	38.30%
0.001	32.69%	1.68%	0.29%	35.30%
0.01	24.40%	1.68%	0.29%	23.55%
0.05	1.11%	1.68%	0.29%	0.56%
0.1	1.11%	1.68%	0.29%	0.56%
0.15	1.11%	1.68%	0.29%	0.56%
0.2	1.11%	1.68%	0.29%	0.56%

DecisionTree

support_cutoff (default: 10)

Value	Categories	Types	Subtypes	Filtered
3	26.12%	24.25%	32.59%	28.87%
7	26.14%	24.25%	32.62%	28.85%
8	26.14%	24.25%	32.61%	28.85%
9	26.14%	24.24%	32.61%	28.85%
10	26.14%	24.24%	32.61%	28.85%
11	26.14%	24.25%	32.63%	28.85%
12	26.14%	24.28%	32.60%	28.83%
13	26.13%	24.30%	32.63%	28.84%
14	26.13%	24.31%	32.63%	28.84%
15	26.17%	24.28%	32.50%	28.86%
16	26.18%	24.27%	32.50%	28.87%
17	26.18%	24.27%	32.46%	28.86%
18	26.16%	24.29%	32.46%	28.84%
19	26.16%	24.27%	32.47%	28.84%
20	26.14%	24.28%	32.47%	28.84%

depth_cutoff (default: 100)

Value	Categories	Types	Subtypes	Filtered
2	26.02%	24.15%	32.54%	28.65%
100	26.14%	24.24%	32.61%	28.85%
5, 10-120 (All)	26.14%	24.24%	32.61%	28.85%

entropy_cutoff (default: 0.05)

Value	Categories	Types	Subtypes	Filtered
0.03	26.14%	24.24%	32.60%	28.82%
0.04	26.14%	24.24%	32.61%	28.85%
0.05	26.14%	24.24%	32.61%	28.85%
0.06	26.19%	24.24%	32.61%	28.85%
0.07	26.19%	24.25%	32.62%	28.85%
0.08	26.36%	24.29%	32.69%	28.85%
0.09	26.36%	24.29%	32.70%	28.85%
0.10	26.36%	24.29%	32.70%	28.83%
0.11	26.36%	24.28%	32.70%	28.77%
0.12	26.36%	24.28%	32.65%	28.58%
0.13	26.36%	24.28%	32.65%	28.59%

Results for SIGARRA News Corpus

Repeated holdout

Classifier	Precision	Recall	F-measure	Params
NaiveBayes	52.88%	60.75%	56.54%	-
DecisionTree	60.37%	69.44%	64.59%	Entropy_cutoff=0.08, Support_cutoff=16
DecisionTree	60.50%	69.53%	64.70%	default
MaxEnt	64.75%	52.95%	58.26%	Iterations=100, min_lldelta=0
MaxEnt	14.91%	2.65%	4.51%	default

Repeated 10-fold cross validation

Classifier	Precision	Recall	F-measure	Params
NaiveBayes	54.47%	62.86%	58.36%	-
DecisionTree	55.93%	70.21%	62.26%	Entropy_cutoff=0.08, Support_cutoff=16
DecisionTree	56.03%	70.32%	62.37%	default
MaxEnt	45.30%	33.47%	38.49%	Iterations=100, min_lldelta=0
MaxEnt	16.29%	3.03%	5.11%	default