nlp evaluate for NER gives different result from span offset based evaluation #7103

timothyjlaurent · 2021-02-17T20:35:44Z

timothyjlaurent
Feb 17, 2021

I'm trying to compare the SpaCy 3 NER model performance an NER evaluation script that uses span offset based method.

When I run nlp.evaluate(ds) on my evaluation corpus it generally looks good:

...
 'ents_per_type': {'RELATIVE': {'p': 0.7666666666666667,
   'r': 0.732484076433121,
   'f': 0.749185667752443},
  'INDICATION': {'p': 0.7682539682539683,
   'r': 0.8175675675675675,
   'f': 0.7921440261865794},
  'ANATOMY': {'p': 0.9279661016949152,
   'r': 0.9087136929460581,
   'f': 0.9182389937106918},
  'AGE': {'p': 0.9485981308411215,
   'r': 0.8903508771929824,
   'f': 0.9185520361990951},
  'LINEAGE': {'p': 0.9699248120300752,
   'r': 0.9148936170212766,
   'f': 0.9416058394160584},
...

However, if I use a span offset-based method to calculate classification metrics:

from collections import defaultdict
from typing import Iterable, Tuple, DefaultDict
EntityTuple = Tuple[str, int, int, int]  # label, doc#, 

def ner_confusion_matrix_from_entity_tuples(
    true_entities: Iterable[EntityTuple], pred_entities: Iterable[EntityTuple]
):

    true_dict: DefaultDict[str, EntityIndex] = defaultdict(set)
    pred_dict: DefaultDict[str, EntityIndex] = defaultdict(set)

    for e in true_entities:
        true_dict[e[0]].add(tuple(list(e[1:])))
    for e in pred_entities:
        pred_dict[e[0]].add(tuple(list(e[1:])))

    d: Dict[str, Dict[str, float]] = {}

    for type_name, cur_true_entities in true_dict.items():
        cur_pred_entities = pred_dict[type_name]

        nb_correct = len(cur_true_entities & cur_pred_entities)
        nb_pred = len(cur_pred_entities)
        nb_true = len(cur_true_entities)

        d[type_name] = {"correct": nb_correct, "pred": nb_pred, "true": nb_true}

    return d


def docs_to_entity_tuples(docs):
    tups = []
    for i, doc in enumerate(docs):
        for ent in doc.ents:
            tups.append((ent.label_, i, ent.start, ent.end))
    return tups


pred_tups = docs_to_entity_tuples(preds)
true_tups = docs_to_entity_tuples((ex.reference for ex in ds))

cm = ner_confusion_matrix_from_entity_tuples(true_tups, pred_tups)

and then calculate classification metrics the results are much worse:


{'labels': {'RELATIVE': {'precision': 0.62,
   'recall': 0.5923566878980892,
   'f1-score': 0.6058631921824105,
   'f0.5-score': 0.6142668428005283,
   'f2-score': 0.5976863753213367,
   'support': 314},
  'INDICATION': {'precision': 0.5142857142857142,
   'recall': 0.5472972972972973,
   'f1-score': 0.5302782324058919,
   'f0.5-score': 0.5205655526992288,
   'f2-score': 0.5403602401601066,
   'support': 296},
  'ANATOMY': {'precision': 0.6271186440677966,
   'recall': 0.6141078838174274,
   'f1-score': 0.620545073375262,
   'f0.5-score': 0.6244725738396625,
   'f2-score': 0.6166666666666667,
   'support': 241},
  'AGE': {'precision': 0.6448598130841121,
   'recall': 0.6052631578947368,
   'f1-score': 0.6244343891402714,
   'f0.5-score': 0.6365313653136532,
   'f2-score': 0.6127886323268206,
   'support': 228},
  'LINEAGE': {'precision': 0.6390977443609023,
   'recall': 0.6028368794326241,
   'f1-score': 0.6204379562043796,
   'f0.5-score': 0.6315007429420505,
   'f2-score': 0.6097560975609756,
   'support': 141},

I suspect that there is some alignment issue that is causing the lower metric values.

Ideally, I'd like a way to :

for each NER prediction, know whether it is correct or not
for each NER in my reference corpus, know whether it has been predicted correctly or not

With that, I will be able to plug it into our classification report with additional metrics (f0.5, f2) and average metrics (weighted, macro, micro).

timothyjlaurent · 2021-02-18T01:34:21Z

timothyjlaurent
Feb 18, 2021
Author

Ok I borrowed a bit for the get_ner_prf to generate entity tuples that I can plug into our evaluation code:

def make_aligned_entity_tuples(examples: Iterable(Examples)) -> Tuple[Set[EntityTuple], Set[EntityTuple]]:
    true_entity_tuples = set()
    pred_entity_tuples = set()
    
    for doc_i, eg in enumerate(examples):
        if not eg.y.has_annotation("ENT_IOB"):
            continue
        true_entity_tuples.update((e.label_, doc_i, e.start, e.end) for e in eg.y.ents)
        align_x2y = eg.alignment.x2y
        for pred_ent in eg.x.ents:
            indices = align_x2y[pred_ent.start : pred_ent.end].dataXd.ravel()
            if len(indices):
                g_span = eg.y[indices[0] : indices[-1] + 1]
                # Check we aren't missing annotation on this span. If so,
                # our prediction is neither right nor wrong, we just
                # ignore it.
                if all(token.ent_iob != 0 for token in g_span):
                    pred_entity_tuples.add((pred_ent.label_, doc_i, indices[0], indices[-1] + 1))

    return true_entity_tuples, pred_entity_tuples

This gives the "correct" metrics:

{'labels': {'RELATIVE': {'precision': 0.7666666666666667,
   'recall': 0.732484076433121,
   'f1-score': 0.749185667752443,
   'f0.5-score': 0.7595772787318362,
   'f2-score': 0.7390745501285346,
   'support': 314},
  'ANATOMY': {'precision': 0.9279661016949152,
   'recall': 0.9087136929460581,
   'f1-score': 0.9182389937106918,
   'f0.5-score': 0.9240506329113924,
   'f2-score': 0.9124999999999999,
   'support': 241},
  'AGE': {'precision': 0.9485981308411215,
   'recall': 0.8903508771929824,
   'f1-score': 0.9185520361990951,
   'f0.5-score': 0.9363468634686348,
   'f2-score': 0.9014209591474245,
   'support': 228},

3 replies

davesuketu215 Feb 18, 2021

Also, I have a query:
What is a good measure to evaluate the goodness (confidence) of the model in predicting entities on unseen, un-annotated test data?

Is there any spaCy standard API available for that?

adrianeboyd Feb 18, 2021

As a note, I think the original code probably would have worked fine if you'd used character offsets instead of token offsets. (I think the main reason we use token offsets internally is historical, and we could also switch our scorer to use character alignment, but I'd have to look carefully to be sure.)

adrianeboyd Feb 18, 2021

@davesuketu215 : This kind of question would be better suited for a new thread, see also #5917.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

nlp evaluate for NER gives different result from span offset based evaluation #7103

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

nlp evaluate for NER gives different result from span offset based evaluation #7103

Uh oh!

timothyjlaurent Feb 17, 2021

Replies: 1 comment · 3 replies

Uh oh!

timothyjlaurent Feb 18, 2021 Author

Uh oh!

davesuketu215 Feb 18, 2021

Uh oh!

adrianeboyd Feb 18, 2021

Uh oh!

adrianeboyd Feb 18, 2021

timothyjlaurent
Feb 17, 2021

Replies: 1 comment 3 replies

timothyjlaurent
Feb 18, 2021
Author