Evaluation

In order to have a better, stable, comparable estimation of the output of the NER from all the tools, all output was converted to the conll format and evaluated using conlleval script. Also, repeated 10-fold cross validation was performed and then an average of the results was obtained.

All results can be accessed here.

Steps

Get output from tools
Get golden data for each fold
Join both (script)
Evaluate each fold (scripts)
Compute average for each repeat (script)
Compute global average (script)

Computing average algorithm

Get result files for each fold
Save each result into a list with the accuracy and each category
Create dictionary for each category
- Using categories as keys, and a list with precision, recall and fb1
For each result
- Save averages
- Save measures for each category
Calculate average
Calculate macro-average for fb1
Print to file