Unsupervised named entity clustering using transitivity #12953

lukasgrahl · 2023-09-04T06:08:21Z

lukasgrahl
Sep 4, 2023

Across my own work I developed a local named entity clustering method aiming at identifying local synonyms without an external knowledge base. As measure of relatedness I am using vector embeddings. I was wondering whether this might be an extension to the spacy EntityLinker or otherwise of interest?

The problem I was facing:

I developed an unsupervised topic recognition model, aiming at recognising new often unknown events (e.g. covid-19) early on. For this purpose, I used spacy's proper noun tag PROPNN on top of NER) to build mention density time series by entity. The spacy NER turned out to be to narrow for new topics.
In order to make the density time series more accurate I need to link expressions, that on an article level can be considered synonyms. Newspapers would sometimes refer to European Central Bank” and “European Institution” interchangeably in order to make texts more readable. In this case linking expressions on an article level was important, as the two are not global synonyms. Another example were names such as “Donald Trump” “President Trump”, which also had to be linked. I was therefore looking for a local unsupervised clustering technique, not relying on an external data base.

My solution

I used spacy vector embeddings as measure of similarity, only analysing term pairs exceeding a certain threshold (e.g. 0.8). In a next step I used transitivity as a clustering criterion. Transitivity imposes that all expressions in a cluster must share the same similarity strength or above. This method outputs few but meaningful cluster for each article. Moreover, these clusters are not overlapping by nature.
My code performs clustering in three steps. First combinations of pairs and their similarity score are gathered in a list. In order check for transitivity potential clusters need to be identified. This is a finite recursive problem as on word in a pair is potentially linked to another pair and so and so forth. I am considering a pool of pairs, for a given pairs I then check which other pairs are associated, taking them out of the pool. Once all related pairs are gathered the cluster candidate is complete. This procedure is then applied to all pairs remaining in the pool until the pool of pairs is empty. In a third and last step the cluster candidates are checked for transitivity using matrix multiplication.

The code can be found here: https://github.com/lukasgrahl/miscellaneous/blob/main/src/transitivity.py
An example can be found here: https://github.com/lukasgrahl/miscellaneous/blob/main/notebooks/transitivity.ipynb

An example

Using this article from the Irish Times: https://www.irishtimes.com/world/europe/2023/08/31/ukraine-war-latest/
For this article I obtained these clusters:

February 2022, February
Volodymyr Zelenskiy, Zelenskiy, Volodymyr
3,000, 15,000

svlandeg · 2023-09-05T13:23:12Z

svlandeg
Sep 5, 2023
Maintainer

Thanks for sharing, this looks nice! I could imagine it being useful as well for coreference resolution, where ideally you'd like to cluster noun phrases together with the pronouns that refer to them, as well. And then all that could indeed be used as input for an entity linker, that could resolve the cluster in one go instead of making individual decisions.

2 replies

lukasgrahl Sep 8, 2023
Author

Hi, thanks for your answer. This sounds interesting and I would be happy to try an implementation if that was of interest. Would you have any resources on what features are used to calculate the relationship scores for conference resolution? I am an economist by training, that way I could do some reading up.

svlandeg Sep 8, 2023
Maintainer

Our current coref solution is part of spacy-experimental, you can find it here. There's also an example project and a blog post that could interest you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupervised named entity clustering using transitivity #12953

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Unsupervised named entity clustering using transitivity #12953

lukasgrahl Sep 4, 2023

The problem I was facing:

My solution

An example

Replies: 1 comment · 2 replies

svlandeg Sep 5, 2023 Maintainer

lukasgrahl Sep 8, 2023 Author

svlandeg Sep 8, 2023 Maintainer

lukasgrahl
Sep 4, 2023

Replies: 1 comment 2 replies

svlandeg
Sep 5, 2023
Maintainer

lukasgrahl Sep 8, 2023
Author

svlandeg Sep 8, 2023
Maintainer