Custom Sentencer causes poor ner training performance #12873
-
Seeing some odd behavior with NER and a custom sentencer. The sentencer looks like this and basically sets the next token after \r or \n or any combination of \r\n, \n, \n\n, etc.
Using a very basic pipeline and config generated from the quickstart widget. pipeline = ["tok2vec","custom_boundaries","ner"] [components.custom_boundaries] Without custom_boundaries NER F1 score is 34.55.
With the custom_boundaries component the F1 score is horrible in comparison at 0.45:
Here is an example of what our sentence structure should look like in comparison to the default behavior.
I find this quite confusing as our entities traverse multiple sentences/lines. I would not have expected NER to perform much better when the sentence boundaries are wrong? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
The Your spans don't sound like named entities, so |
Beta Was this translation helpful? Give feedback.
The
ner
component has been developed for traditional named entities, which are typically short noun phrases that never cross sentence boundaries. There's a hard-coded constraint in thener
component to not predict any entities across sentence boundaries.Your spans don't sound like named entities, so
ner
might not be the best choice, but if it's working fine otherwise, then a simple solution is to reorder the pipeline components so that the sentence boundaries are set afterner
. But you might also want to consider testing other components likespancat
that are more flexible in terms of handling longer spans that don't look like short noun phrases.