Custom spaCy NER model not making expected predictions #12739
-
IssueA custom NER model (trained to identify certain code numbers) does not produce predictions in certain documents. I have tried multiple variations of these texts from row#3 onwards, in an attempt to pinpoint what piece of text is causing differences in predictions. The issue here is- if the model recognizes a given code as the correct inference in one document, why is it not able to identify another similar looking code number as the correct inference, in another similar document. Model Inputs and Corresponding OutputsEnvironment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Unfortunately there's no satisfying answer to this. The model relies contextual representations that incorporate information from up to four words of context on either side of the target token. The entity recognizer then goes through the words of the document as a state machine, and makes decisions about how to construct the entities based on the prior state and the contextual tokens. When I'm trying to debug the entity recogniser I basically step through the decisions in different ways to try to look at what the state is at a particular decision. The utilities for that aren't documented, and I wouldn't suggest it's the best approach for you to try to understand the behaviour of your system. Instead, a good mental model to have is basically that the classifier is sensitive to the context as well as just the phrase itself. If you want to be sure that particular phrases are tagged consistently, you could build rule-based matchers, perhaps by running your existing model over a bunch of text, and extracting out lists of phrases which have been tagged at least once. Optionally you could include a manual review step here before you add them to the matcher rules, using an annotation tool like Prodigy. Other ways you could try to address this are to look at the training and parameterisation of your custom model. If you update to a more recent version of spaCy, you might find some improvement in accuracy (although this isn't guaranteed). Other general advice includes using word vector representations that are well suited for your domain, and pretraining the contextual representations. If you're using the CPU model this could be done with the Finally, it's worth noting that your input texts aren't really sentential content. It's not uncommon to use NLP tooling on items like yours, just because it's not regular sentences doesn't mean it's somehow trivial to process. But it's worth keeping in mind that your data isn't like normal text, and so some standard recommendations like always preferring statistical approaches to rule-based approaches might not apply to you. The same can be said for choosing word vectors or transformer models. Performance on your task might be quite different from performance on standard benchmarks. |
Beta Was this translation helpful? Give feedback.
Unfortunately there's no satisfying answer to this. The model relies contextual representations that incorporate information from up to four words of context on either side of the target token. The entity recognizer then goes through the words of the document as a state machine, and makes decisions about how to construct the entities based on the prior state and the contextual tokens.
When I'm trying to debug the entity recogniser I basically step through the decisions i…