Custom Sentencer causes poor ner training performance #12873

micmizer · 2023-07-29T21:41:08Z

micmizer
Jul 29, 2023

Seeing some odd behavior with NER and a custom sentencer. The sentencer looks like this and basically sets the next token after \r or \n or any combination of \r\n, \n, \n\n, etc.

@Language.component("custom_boundaries")
def sentence_splitter(doc):
    delimiter_pattern = re.compile(r"(\r?\n\s*)+|(\n\s*)+")
    for i, token in enumerate(doc):
        if delimiter_pattern.fullmatch(token.text):
            doc[i + 1].is_sent_start = True
    
    for token in doc:
        if not token.is_sent_start:
            token.is_sent_start = False
    
    return doc

Using a very basic pipeline and config generated from the quickstart widget.

pipeline = ["tok2vec","custom_boundaries","ner"]

[components.custom_boundaries]
factory = "custom_boundaries"

Without custom_boundaries NER F1 score is 34.55.

E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    131.86    0.00    0.00    0.00    0.00
...
...
 21    5800      47232.62    605.52   34.55   30.77   39.40    0.35

With the custom_boundaries component the F1 score is horrible in comparison at 0.45:

E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    131.86    0.00    0.00    0.00    0.00
...
...
  8    3400     425175.18   1953.12    0.45    0.36    0.60    0.00

Here is an example of what our sentence structure should look like in comparison to the default behavior.

Default: ['some sentence.\r\n\r\nAnother sentence\n\nA', 'third sentence\n\nA fourth\nand a fifth\n\n\nthe end.']

custom_boundaries: ['some sentence.\r\n\r\n', 'Another sentence\n\n', 'A third sentence\n\n', 'A fourth\n', 'and a fifth\n\n\n', 'the end.']

Tokenized sentences:

['some', 'sentence.', '\r\n\r\n']

['Another', 'sentence', '\n\n']

['A', 'third', 'sentence', '\n\n']

['A', 'fourth', '\n']

['and', 'a', 'fifth', '\n\n\n']

['the', 'end.']

I find this quite confusing as our entities traverse multiple sentences/lines. I would not have expected NER to perform much better when the sentence boundaries are wrong?

Answered by adrianeboyd

Jul 31, 2023

The ner component has been developed for traditional named entities, which are typically short noun phrases that never cross sentence boundaries. There's a hard-coded constraint in the ner component to not predict any entities across sentence boundaries.

Your spans don't sound like named entities, so ner might not be the best choice, but if it's working fine otherwise, then a simple solution is to reorder the pipeline components so that the sentence boundaries are set after ner. But you might also want to consider testing other components like spancat that are more flexible in terms of handling longer spans that don't look like short noun phrases.

View full answer

adrianeboyd · 2023-07-31T06:31:37Z

adrianeboyd
Jul 31, 2023

The ner component has been developed for traditional named entities, which are typically short noun phrases that never cross sentence boundaries. There's a hard-coded constraint in the ner component to not predict any entities across sentence boundaries.

Your spans don't sound like named entities, so ner might not be the best choice, but if it's working fine otherwise, then a simple solution is to reorder the pipeline components so that the sentence boundaries are set after ner. But you might also want to consider testing other components like spancat that are more flexible in terms of handling longer spans that don't look like short noun phrases.

8 replies

adrianeboyd Aug 2, 2023

Yeah, such long spans are going to be tricky, and guessing from the labels, this is also a bit outside of what many NLP models are designed to handle (just not natural language exactly). Even with more data, I wouldn't expect to get good performance from ner for these kinds of spans.

And you're right that this isn't a lot of training data. For data with non-overlapping named entities and not other types of spans, ner will nearly always outperform spancat for the same amount of training data because it's designed for this exact task.

Maybe it would be easier if you restructured your task to be a line-by-line classification of the text? Then the suggester can suggest lines and spancat is only categorizing them? (This would be similar conceptually to running textcat on each line individually.)

With your original spans and spancat, I don't think the n-gram suggester is going to work at all for this, so your best options are using span_finder and/or span_ruler. It sounds like you might be able to identify a large percentage of the spans with regex patterns?

The configuration for span_finder can be a bit tricky, make sure you have it added to [training.annotating_components]. If it's helpful, here's a demo project that compares different types of suggesters including span_finder with example configs:

https://github.com/adrianeboyd/workshop-dh2023/tree/main/litbank

micmizer Aug 5, 2023
Author

Yeah, such long spans are going to be tricky, and guessing from the labels, this is also a bit outside of what many NLP models are designed to handle (just not natural language exactly).

Our corpus is pretty challenging in that regard. Typically 25% of a given doc is junk we want to remove. Sometimes 90%. It makes it very challenging. Regular regex can typically get all of the headers but LOGS and EMAIL_SIG it usually just ends up leaving fragments of words and/or characters. The goal with NER or spancat would be to remove the entire line (sentence) for all three labels.

Maybe it would be easier if you restructured your task to be a line-by-line classification of the text? Then the suggester can suggest lines and spancat is only categorizing them? (This would be similar conceptually to running textcat on each line individually.)

Are you suggesting tweaking my annotations to be a single sentence instead of the entire doc and then creating an individual doc for each sentence and allowing spancat to make predictions that way?

It sounds like you might be able to identify a large percentage of the spans with regex patterns?

For headers and email_sig a good portion is already, or can be caught by the entity/span ruler.

adrianeboyd Aug 7, 2023

With textcat you would need to create individual docs for each sentence (and lose the surrounding context), so I would suggest using a suggester that suggests lines (through sentences if that's easy for you; you can copy the sentence suggester from spacy-experimental if you like -- I wouldn't use the spacy-experimental package, just copy the sentence suggester as custom code) and having spancat only label each line.

Convert your annotation to be by line instead of for long spans, so:

LOG[a b c d 
a b c d
a b d e]
text text text
SIG[a a a a]

->

LOG[a b c d]
LOG[a b c d]
LOG[a b c e]
text text text 
SIG[a a a a]

If the model has to identify the start token, end token, and the label for potentially long spans, this is a pretty hard task.

If you convert this to a line-labeling task, you provide the start and end tokens with a rule-based component and all the statistical model has to do is label each line, so the problem is much easier and I think you're much more likely to see usable performance.

You do have to be really careful that your annotated spans match your suggested spans, but if you're splitting on newlines this should be relatively easy to do.

micmizer Aug 8, 2023
Author

Since our patterns encompass the entire line for multiple lines i was able to easily split the annotations into 1 per line as you suggested above. This had a tremendous impact on NER. F1 score for example going from 14 to 80~. Pretty happy with those results and the actual predictions it is making.

I have been having some problems training spancat and even just span_finder by itself. Not sure exactly what the problem was but it seems to have resolved itself after re-generating my training data. Something was silently failing.

Side question. Should spans ruler, span_finder and spancat always have the same spans_key? I have only tested with them all set to "sc". but when i train a pipeline like: ['tok2vec', 'span_ruler', 'span_finder'] i only get training scores for spans_sc_x and nothing for span_finder?

E    #       LOSS TOK2VEC  LOSS SPAN_...  SPANS_RULER_F  SPANS_RULER_P  SPANS_RULER_R  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SPANS_FIND...  SPANS_FIND...  SPANS_FIND...  SCORE 
---  ------  ------------  -------------  -------------  -------------  -------------  ----------  ----------  ----------  -------------  -------------  -------------  ------
 28   12600          3.35         772.54           0.00           0.00           0.00       51.84       36.64       88.59           0.00           0.00           0.00    0.09
 29   12800          3.13         699.62           0.00           0.00           0.00       51.86       36.62       88.83           0.00           0.00           0.00    0.09
 30   13000          3.45         666.09           0.00           0.00           0.00       51.94       36.65       89.10           0.00           0.00           0.00    0.09
 30   13200          3.42         725.44           0.00           0.00           0.00       51.75       36.53       88.67           0.00           0.00           0.00    0.09

so I would suggest using a suggester that suggests lines (through sentences if that's easy for you; you can copy the sentence suggester from spacy-experimental if you like -- I wouldn't use the spacy-experimental package, just copy the sentence suggester as custom code) and having spancat only label each line.

I have only used preset_spans_suggester in my prior attempts. Are you referring to sentence_suggester function here

adrianeboyd Aug 9, 2023

preset_spans_suggester is easier than dealing with the sentence suggester, so that's fine.

If it doesn't seem to be training or you're seeing all 0s, some of the main places to check:

the evaluation requires span_ruler or span_finder to have the same spans_key as in the training data
double-check the spans_key settings used for spancat including the suggester
make sure that the components that are required to provide spans for spancat are all in annotating_components

Be aware that the spans set by span_ruler and span_finder wouldn't be combined automatically. You might be accidentally overwriting spans at some point in your pipeline. spancat also overwrites the spans_key spans.

If you want to use span_finder + span_ruler for preset spans, add span_ruler after span_finder and set overwrite=False.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Sentencer causes poor ner training performance #12873

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Custom Sentencer causes poor ner training performance #12873

micmizer Jul 29, 2023

Replies: 1 comment · 8 replies

adrianeboyd Jul 31, 2023

adrianeboyd Aug 2, 2023

micmizer Aug 5, 2023 Author

adrianeboyd Aug 7, 2023

micmizer Aug 8, 2023 Author

adrianeboyd Aug 9, 2023

micmizer
Jul 29, 2023

Replies: 1 comment 8 replies

adrianeboyd
Jul 31, 2023

micmizer Aug 5, 2023
Author

micmizer Aug 8, 2023
Author