Querying the Knowledge Base and Use Cases for wiki_entity_linker #5138

kprice-aktors · 2020-03-12T08:10:33Z

kprice-aktors
Mar 12, 2020

I have already created an issue thread for a problem I was having with one of the scripts for wiki_entity_linker taking a long time to train an NEL model. This thread is more so that I can better understand the general use cases of the spacy NEL feature.

I have been able to use NER models to extract named entities from a variety of text documents. A problem I have repeatedly encountered though is that these models don't seem to recognize entities with different aliases as the same. Researching this problem is how I discovered spacy's in-development NEL feature.

I have been testing a simple NEL model which I trained using the tools spacy provides. I have noticed that different aliases referring to the same entity are given different knowledge-base IDs. For example, 'Bernie Sanders' and 'Sanders', despite referring to the same person, have different knowledge-base IDs. Interestingly, 'Mark Zuckerberg' and 'Zuckerberg' have the same knowledge-base id. This is more in-line with the kind of results I expect, but not as universal as I hoped. I am not sure if there is something I am missing about how this feature works, or if this is simply a result of the limit I have set on training and testing data? The model I am testing was trained on 40000 articles for 3 epochs. I am training a new model now on 165,000 articles, but I wanted to make sure I am not misunderstanding how this feature works.

I also wanted to know if there is a way I can query the knowledge-base I created , and see what entities are listed under the same knowledge-base id.

Info about spaCy

Python version: 2.7.17
Platform: Linux-5.3.0-40-generic-x86_64-with-Ubuntu-19.10-eoan
spaCy version: 2.2.3

Answered by svlandeg

Mar 13, 2020

There's a few different aspects to this.

First, the fact whether or not a certain alias gets recognized, depends on the training data. If it always saw "Bernie Sanders" and never just "Sanders" for that specific politician, it wouldn't know to disambiguate it correctly.

Next, even if it does know that both are synonyms for the same guy, then they will have different prior probabilities. You can imagine that for instance in general, "Bernie Sanders" refers to that politician 95% of the time, while of all the cases that are just "Sanders", maybe only 40% of them refer to the politician. So depending on the exact alias, the ambiguity may be higher or lower.

Additionally, ofcourse, the contex…

View full answer

svlandeg · 2020-03-13T15:53:09Z

svlandeg
Mar 13, 2020
Maintainer

There's a few different aspects to this.

First, the fact whether or not a certain alias gets recognized, depends on the training data. If it always saw "Bernie Sanders" and never just "Sanders" for that specific politician, it wouldn't know to disambiguate it correctly.

Next, even if it does know that both are synonyms for the same guy, then they will have different prior probabilities. You can imagine that for instance in general, "Bernie Sanders" refers to that politician 95% of the time, while of all the cases that are just "Sanders", maybe only 40% of them refer to the politician. So depending on the exact alias, the ambiguity may be higher or lower.

Additionally, ofcourse, the context matters. If you have a long article which mentions both "Bernie Sanders" and just "Sanders", then depending on the actual sentence of each instance, the disambiguation may be different. The context is important.

Now, what can we do about all that ?
A direction we're exploring is combining NEL with coreference resolution. That means that you'd link all occurrences of the same person/location/... into a coreference chain, and then you'd want to disambiguate each entity of the chain to the same database entity. I've done some preliminary experiments with this, but this is still ongoing. The problem is that coreference resolution is also a challenging task by itself.

I definitely recommend you train on a larger dataset: that will help to collect more instances in the KB. Note that you can set the limits on building the KB and the limits on training the NEL individually (I think you know this), so you can have a large pre-trained KB that you only need to create once.

And finally, here's the ways in which you can query the knowledge base:
candidates = kb.get_candidates(alias) gives you all the KB IDs that a certain alias may resolve to. You can find more info about the relevant methods in the API docs here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying the Knowledge Base and Use Cases for wiki_entity_linker #5138

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Querying the Knowledge Base and Use Cases for wiki_entity_linker #5138

kprice-aktors Mar 12, 2020

Info about spaCy

Replies: 1 comment

svlandeg Mar 13, 2020 Maintainer

kprice-aktors
Mar 12, 2020

svlandeg
Mar 13, 2020
Maintainer