Italian NER clarification #6489

LivingDeadCloud · 2020-12-03T09:11:14Z

LivingDeadCloud
Dec 3, 2020

Hello everyone

First of all, I'm kind of a noob of spaCy and NLP in general, so please be gentle if I'm asking trivial questions or not using proper names :)

I coded a little about NER in english, and now I'm starting to use spaCy for NER in italian medical records. At the moment I'm trying to use it for very simple sentences, but I noticed that "simple" entities, like dates for example, are not recognized.
My sentence is "Claudio Crema, il nuovo CEO di Apple, ha deciso di comprare Amazon lo scorso Dicembre per 1 miliardo di dollari" (which, in english, means "Claudio Crema, Apple new CEO, decided to buy Amazon last December for 1 billion dollars", just so you know the general meaning of the sentence).
The recognized entities are:
Claudio Crema = PER
Apple = ORG
Amazon = ORG
Dicembre = MISC - here I was expecting "scorso Dicembre" = DATE
Moreover, I also noticed that "1 miliardo di dollari" is completely ignored, while I was expecting to be classified as "MONEY".
I dug a bit in the documentation and I found here that the only NER entities for italian are "LOC", "MISC", "ORG", "PER". So, if I understood correctly, everything that is not in these four categories will not be recognized as entity, or it will be recognized as "MISC" if I'm lucky?
I saw that it is possible to add custom Named Entities, maybe that is an option to extend italian NER entities?

Thanks a lot for your time and patience!

Your Environment

Operating System: Ubuntu
Python Version Used: 3.8.3
spaCy Version Used: 2.3.4
Environment Information: Linux-5.4.0-53-generic-x86_64-with-glibc2.10

Answered by adrianeboyd

Dec 3, 2020

Hi, the English and Italian models are trained on unrelated datasets (OntoNotes vs. WikiNER) that don't use the same label schemes.

If you have data with entity annotation, you can train a new model or extend an existing model with new types. Here are the basics for how to get started: https://spacy.io/usage/training#ner

The WikiNER corpus is available under a CC BY 4.0 license, so if it made sense for your task (if Wikipedia-style texts are similar to the texts you want to process), you could annotate your own additional entity types on this data and then train or extend a model using examples that contain both the old and new types. Otherwise I'm not familiar with what's available for I…

View full answer

adrianeboyd · 2020-12-03T13:47:07Z

adrianeboyd
Dec 3, 2020

Hi, the English and Italian models are trained on unrelated datasets (OntoNotes vs. WikiNER) that don't use the same label schemes.

If you have data with entity annotation, you can train a new model or extend an existing model with new types. Here are the basics for how to get started: https://spacy.io/usage/training#ner

The WikiNER corpus is available under a CC BY 4.0 license, so if it made sense for your task (if Wikipedia-style texts are similar to the texts you want to process), you could annotate your own additional entity types on this data and then train or extend a model using examples that contain both the old and new types. Otherwise I'm not familiar with what's available for Italian, you may want to see if there are other existing datasets could be useful.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Italian NER clarification #6489

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Italian NER clarification #6489

LivingDeadCloud Dec 3, 2020

Your Environment

Replies: 1 comment

adrianeboyd Dec 3, 2020

LivingDeadCloud
Dec 3, 2020

adrianeboyd
Dec 3, 2020