nlp evaluate for NER gives different result from span offset based evaluation #7103
timothyjlaurent
started this conversation in
Help: Best practices
Replies: 1 comment 3 replies
-
Ok I borrowed a bit for the def make_aligned_entity_tuples(examples: Iterable(Examples)) -> Tuple[Set[EntityTuple], Set[EntityTuple]]:
true_entity_tuples = set()
pred_entity_tuples = set()
for doc_i, eg in enumerate(examples):
if not eg.y.has_annotation("ENT_IOB"):
continue
true_entity_tuples.update((e.label_, doc_i, e.start, e.end) for e in eg.y.ents)
align_x2y = eg.alignment.x2y
for pred_ent in eg.x.ents:
indices = align_x2y[pred_ent.start : pred_ent.end].dataXd.ravel()
if len(indices):
g_span = eg.y[indices[0] : indices[-1] + 1]
# Check we aren't missing annotation on this span. If so,
# our prediction is neither right nor wrong, we just
# ignore it.
if all(token.ent_iob != 0 for token in g_span):
pred_entity_tuples.add((pred_ent.label_, doc_i, indices[0], indices[-1] + 1))
return true_entity_tuples, pred_entity_tuples This gives the "correct" metrics: {'labels': {'RELATIVE': {'precision': 0.7666666666666667,
'recall': 0.732484076433121,
'f1-score': 0.749185667752443,
'f0.5-score': 0.7595772787318362,
'f2-score': 0.7390745501285346,
'support': 314},
'ANATOMY': {'precision': 0.9279661016949152,
'recall': 0.9087136929460581,
'f1-score': 0.9182389937106918,
'f0.5-score': 0.9240506329113924,
'f2-score': 0.9124999999999999,
'support': 241},
'AGE': {'precision': 0.9485981308411215,
'recall': 0.8903508771929824,
'f1-score': 0.9185520361990951,
'f0.5-score': 0.9363468634686348,
'f2-score': 0.9014209591474245,
'support': 228}, |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to compare the SpaCy 3 NER model performance an NER evaluation script that uses span offset based method.
When I run
nlp.evaluate(ds)
on my evaluation corpus it generally looks good:However, if I use a span offset-based method to calculate classification metrics:
and then calculate classification metrics the results are much worse:
I suspect that there is some alignment issue that is causing the lower metric values.
Ideally, I'd like a way to :
With that, I will be able to plug it into our classification report with additional metrics (f0.5, f2) and average metrics (weighted, macro, micro).
Beta Was this translation helpful? Give feedback.
All reactions