Querying the Knowledge Base and Use Cases for wiki_entity_linker #5138
-
I have already created an issue thread for a problem I was having with one of the scripts for wiki_entity_linker taking a long time to train an NEL model. This thread is more so that I can better understand the general use cases of the spacy NEL feature. I have been able to use NER models to extract named entities from a variety of text documents. A problem I have repeatedly encountered though is that these models don't seem to recognize entities with different aliases as the same. Researching this problem is how I discovered spacy's in-development NEL feature. I have been testing a simple NEL model which I trained using the tools spacy provides. I have noticed that different aliases referring to the same entity are given different knowledge-base IDs. For example, 'Bernie Sanders' and 'Sanders', despite referring to the same person, have different knowledge-base IDs. Interestingly, 'Mark Zuckerberg' and 'Zuckerberg' have the same knowledge-base id. This is more in-line with the kind of results I expect, but not as universal as I hoped. I am not sure if there is something I am missing about how this feature works, or if this is simply a result of the limit I have set on training and testing data? The model I am testing was trained on 40000 articles for 3 epochs. I am training a new model now on 165,000 articles, but I wanted to make sure I am not misunderstanding how this feature works. I also wanted to know if there is a way I can query the knowledge-base I created , and see what entities are listed under the same knowledge-base id. Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
There's a few different aspects to this. First, the fact whether or not a certain alias gets recognized, depends on the training data. If it always saw "Bernie Sanders" and never just "Sanders" for that specific politician, it wouldn't know to disambiguate it correctly. Next, even if it does know that both are synonyms for the same guy, then they will have different prior probabilities. You can imagine that for instance in general, "Bernie Sanders" refers to that politician 95% of the time, while of all the cases that are just "Sanders", maybe only 40% of them refer to the politician. So depending on the exact alias, the ambiguity may be higher or lower. Additionally, ofcourse, the context matters. If you have a long article which mentions both "Bernie Sanders" and just "Sanders", then depending on the actual sentence of each instance, the disambiguation may be different. The context is important. Now, what can we do about all that ? I definitely recommend you train on a larger dataset: that will help to collect more instances in the KB. Note that you can set the limits on building the KB and the limits on training the NEL individually (I think you know this), so you can have a large pre-trained KB that you only need to create once. And finally, here's the ways in which you can query the knowledge base: |
Beta Was this translation helpful? Give feedback.
There's a few different aspects to this.
First, the fact whether or not a certain alias gets recognized, depends on the training data. If it always saw "Bernie Sanders" and never just "Sanders" for that specific politician, it wouldn't know to disambiguate it correctly.
Next, even if it does know that both are synonyms for the same guy, then they will have different prior probabilities. You can imagine that for instance in general, "Bernie Sanders" refers to that politician 95% of the time, while of all the cases that are just "Sanders", maybe only 40% of them refer to the politician. So depending on the exact alias, the ambiguity may be higher or lower.
Additionally, ofcourse, the contex…