Unsupervised named entity clustering using transitivity #12953
lukasgrahl
started this conversation in
New Features & Project Ideas
Replies: 1 comment 2 replies
-
Thanks for sharing, this looks nice! I could imagine it being useful as well for coreference resolution, where ideally you'd like to cluster noun phrases together with the pronouns that refer to them, as well. And then all that could indeed be used as input for an entity linker, that could resolve the cluster in one go instead of making individual decisions. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Across my own work I developed a local named entity clustering method aiming at identifying local synonyms without an external knowledge base. As measure of relatedness I am using vector embeddings. I was wondering whether this might be an extension to the spacy EntityLinker or otherwise of interest?
The problem I was facing:
I developed an unsupervised topic recognition model, aiming at recognising new often unknown events (e.g. covid-19) early on. For this purpose, I used spacy's proper noun tag PROPNN on top of NER) to build mention density time series by entity. The spacy NER turned out to be to narrow for new topics.
In order to make the density time series more accurate I need to link expressions, that on an article level can be considered synonyms. Newspapers would sometimes refer to European Central Bank” and “European Institution” interchangeably in order to make texts more readable. In this case linking expressions on an article level was important, as the two are not global synonyms. Another example were names such as “Donald Trump” “President Trump”, which also had to be linked. I was therefore looking for a local unsupervised clustering technique, not relying on an external data base.
My solution
I used spacy vector embeddings as measure of similarity, only analysing term pairs exceeding a certain threshold (e.g. 0.8). In a next step I used transitivity as a clustering criterion. Transitivity imposes that all expressions in a cluster must share the same similarity strength or above. This method outputs few but meaningful cluster for each article. Moreover, these clusters are not overlapping by nature.
My code performs clustering in three steps. First combinations of pairs and their similarity score are gathered in a list. In order check for transitivity potential clusters need to be identified. This is a finite recursive problem as on word in a pair is potentially linked to another pair and so and so forth. I am considering a pool of pairs, for a given pairs I then check which other pairs are associated, taking them out of the pool. Once all related pairs are gathered the cluster candidate is complete. This procedure is then applied to all pairs remaining in the pool until the pool of pairs is empty. In a third and last step the cluster candidates are checked for transitivity using matrix multiplication.
The code can be found here: https://github.com/lukasgrahl/miscellaneous/blob/main/src/transitivity.py
An example can be found here: https://github.com/lukasgrahl/miscellaneous/blob/main/notebooks/transitivity.ipynb
An example
Using this article from the Irish Times: https://www.irishtimes.com/world/europe/2023/08/31/ukraine-war-latest/
For this article I obtained these clusters:
Beta Was this translation helpful? Give feedback.
All reactions