-
Notifications
You must be signed in to change notification settings - Fork 20
NLTK
- Install NLTK: run
sudo pip install -U nltk
- Install Numpy (optional): run
sudo pip install -U numpy
- Test installation: run
python
then typeimport nltk
For more information check here.
The input dataset for training the NER model must contain a token per line, and having the POS tag and the NER tag with IOB tagging scheme. Leave blank line to separate sentences. Example from CoNLL 2002, dutch corpus:
"Eddy Bonte is woordvoerder van Hogeschool."
Eddy N B-PER
Bonte N I-PER
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N B-ORG
. Punc O
In order to train a NER model in NLTK, the HAREM dataset had to be converted to the right format.
Steps:
- Tokenize and POS tagging
- Tokenize keeping the entities
- Tokenize dataset using nltk.tokenize.word_tokenizer (script)
- Join tokenized entity tags
- Transform dataset, matching the entity tags using regex, giving tags to each token
- For token after entity tag give B-tag
- For token after first entity token (previous step), or after an Inside (I) token, give I-tag
- Give O tag for other tokens
- Join POS tagged file with entity tagged file (script)
- Iterate through both files simultaneously
- For each line set token POS-tag IOB-entity-tag
The tokenizing with and without categories was done simultaneously. Also, the golden test data was created in the conll format (script)
Notes: In order to get sentence segmentation, first it was performed sentence segmentation using nltk in the OpenNLP format. Then the file was used where a special tag (--SENTENCE--) was added and then replaced with a blank line.
To train a NER chunking model, NLTK trainer was used.
Steps:
- Download NLTK trainer
- (Opcional) Put training data in nltk-data folder
- Run command for training using train_chunker.py (command script):
-
python train_chunker.py <path-to-training-file> [--fileids <fileids>] [--reader <reader>] [--classifier <classifier]
- path-to-training-file: specifies path to training file (or files) relative to nltk-data folder or current path. (file has to be in UTF-8 encoding)
- fileids: regex expression to match the files inside the path-to-training-file, (if no expression is given, all files will be used)
- reader: specify the reader for the corpus. In my case, since the corpus was in the Conll2002 iob format I chose nltk.corpus.reader.conll.ConllChunkCorpusReader. Note: For this reader I had to specify the categories used in the __init__.py file from nltk-trainer, using script
- classifier: specify the classifier to use, options: Maxent, DecisionTree, NaiveBayes
-
Note: conversion from 'ISO-8859-1' to 'UTF-8' encoding: iconv -f ISO-8859-1 -t UTF-8 <input> > <output>
Check folder for more information.
Steps:
- Load chunker model using pickle;
pickle.load(open(model_path))
- Load input dataset (already tokenized and POS-tagged, done in training step)
- Perform NER;
chunker.parse(tagged)
- The parser returns the result in a tree format, which was converted to the CoNLL format using
nltk.chunk.util.tree2conlltags(ner_result)
- Output to file
Check script here.
Check all results here.
Results after 4 repeats:
Level | Precision | Recall | F-measure |
---|---|---|---|
Categories | 30.58% | 31.38% | 30.97% |
Types | 29.66% | 28.01% | 28.82% |
Subtypes | 21.15% | 22.72% | 21.91% |
Filtered | 29.55% | 35.17% | 32.12% |
Level | Precision | Recall | F-measure |
---|---|---|---|
Categories | 18.19% | 0.58% | 1.13% |
Types | 9.84% | 1.07% | 1.93% |
Subtypes | 0.19% | 0.28% | 0.23% |
Filtered | 13.36% | 0.30% | 0.60% |
Level | Precision | Recall | F-measure |
---|---|---|---|
Categories | 21.84% | 25.72% | 23.62% |
Types | 25.37% | 24.34% | 24.84% |
Subtypes | 27.71% | 35.81% | 31.25% |
Filtered | 21.27% | 31.49% | 25.39% |
Note: to ensure correct results in evaluation, I used a script to show if there are any differences in the output and golden data, in terms of tokenization. In case of difference, I manually changed the files.
Classifier | Categories | Types | Subtypes | Filtered | All (without filtered) |
---|---|---|---|---|---|
Naive Bayes | 2s | 2s | 2s | 2s | 4m19s |
Maximum Entropy | 1m56s | 5m23s | 4m25s | 1m12s | 7h50m |
Decision Tree | 5m55s | 5m54s | 5m52s | 5m58s | 11h47m |
All with filtered: 24h30
For this tool, I decided to check the influence of multiple hyperparameters: max iterations and min_lldelta for MaxEnt (it also allows min_ll); entropy cutoff, depth cutoff and support cutoff for DecisionTree. The results are the following:
Max_iter (default: 10)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
10 | 1.11% | 1.68% | 0.29% | 0.56% |
All (10-120) | 1.11% | 1.68% | 0.29% | 0.56% |
min_lldelta (default: 0.1)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
0 | 22.28% | 1.68% | 0.29% | 23.97% |
0.0000001 | 22.28% | 1.68% | 0.29% | 23.97% |
0.000001 | 22.28% | 1.68% | 0.29% | 23.97% |
0.00001 | 22.28% | 1.68% | 0.29% | 23.97% |
0.0001 | 22.28% | 1.68% | 0.29% | 23.97% |
0.001 | 22.28% | 1.68% | 0.29% | 23.97% |
0.01 | 22.28% | 1.68% | 0.29% | 23.55% |
0.05 | 1.11% | 1.68% | 0.29% | 0.56% |
0.1 | 1.11% | 1.68% | 0.29% | 0.56% |
0.15 | 1.11% | 1.68% | 0.29% | 0.56% |
0.2 | 1.11% | 1.68% | 0.29% | 0.56% |
min_lldelta - with iterations = 100 (default: 0.1)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
0 | 35.24% | 1.68% | 0.29% | 38.30% |
0.0000001 | 35.24% | 1.68% | 0.29% | 38.30% |
0.000001 | 35.24% | 1.68% | 0.29% | 38.30% |
0.00001 | 35.24% | 1.68% | 0.29% | 38.30% |
0.0001 | 35.24% | 1.68% | 0.29% | 38.30% |
0.001 | 32.69% | 1.68% | 0.29% | 35.30% |
0.01 | 24.40% | 1.68% | 0.29% | 23.55% |
0.05 | 1.11% | 1.68% | 0.29% | 0.56% |
0.1 | 1.11% | 1.68% | 0.29% | 0.56% |
0.15 | 1.11% | 1.68% | 0.29% | 0.56% |
0.2 | 1.11% | 1.68% | 0.29% | 0.56% |
support_cutoff (default: 10)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
3 | 26.12% | 24.25% | 32.59% | 28.87% |
7 | 26.14% | 24.25% | 32.62% | 28.85% |
8 | 26.14% | 24.25% | 32.61% | 28.85% |
9 | 26.14% | 24.24% | 32.61% | 28.85% |
10 | 26.14% | 24.24% | 32.61% | 28.85% |
11 | 26.14% | 24.25% | 32.63% | 28.85% |
12 | 26.14% | 24.28% | 32.60% | 28.83% |
13 | 26.13% | 24.30% | 32.63% | 28.84% |
14 | 26.13% | 24.31% | 32.63% | 28.84% |
15 | 26.17% | 24.28% | 32.50% | 28.86% |
16 | 26.18% | 24.27% | 32.50% | 28.87% |
17 | 26.18% | 24.27% | 32.46% | 28.86% |
18 | 26.16% | 24.29% | 32.46% | 28.84% |
19 | 26.16% | 24.27% | 32.47% | 28.84% |
20 | 26.14% | 24.28% | 32.47% | 28.84% |
depth_cutoff (default: 100)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
2 | 26.02% | 24.15% | 32.54% | 28.65% |
100 | 26.14% | 24.24% | 32.61% | 28.85% |
5, 10-120 (All) | 26.14% | 24.24% | 32.61% | 28.85% |
entropy_cutoff (default: 0.05)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
0.03 | 26.14% | 24.24% | 32.60% | 28.82% |
0.04 | 26.14% | 24.24% | 32.61% | 28.85% |
0.05 | 26.14% | 24.24% | 32.61% | 28.85% |
0.06 | 26.19% | 24.24% | 32.61% | 28.85% |
0.07 | 26.19% | 24.25% | 32.62% | 28.85% |
0.08 | 26.36% | 24.29% | 32.69% | 28.85% |
0.09 | 26.36% | 24.29% | 32.70% | 28.85% |
0.10 | 26.36% | 24.29% | 32.70% | 28.83% |
0.11 | 26.36% | 24.28% | 32.70% | 28.77% |
0.12 | 26.36% | 24.28% | 32.65% | 28.58% |
0.13 | 26.36% | 24.28% | 32.65% | 28.59% |
Repeated holdout
Classifier | Precision | Recall | F-measure | Params |
---|---|---|---|---|
NaiveBayes | 52.88% | 60.75% | 56.54% | - |
DecisionTree | 60.37% | 69.44% | 64.59% | Entropy_cutoff=0.08, Support_cutoff=16 |
DecisionTree | 60.50% | 69.53% | 64.70% | default |
MaxEnt | 64.75% | 52.95% | 58.26% | Iterations=100, min_lldelta=0 |
MaxEnt | 14.91% | 2.65% | 4.51% | default |
Repeated 10-fold cross validation
Classifier | Precision | Recall | F-measure | Params |
---|---|---|---|---|
NaiveBayes | 54.47% | 62.86% | 58.36% | - |
DecisionTree | 55.93% | 70.21% | 62.26% | Entropy_cutoff=0.08, Support_cutoff=16 |
DecisionTree | 56.03% | 70.32% | 62.37% | default |
MaxEnt | 45.30% | 33.47% | 38.49% | Iterations=100, min_lldelta=0 |
MaxEnt | 16.29% | 3.03% | 5.11% | default |
Note: MaxEnt could better (close to 75%), but overflowed in repeat-1.
Get the generated models in the Resources page.