SpaCy entity recognition improvement in Spanish/Catalan #6223
Unanswered
alexgg94
asked this question in
Help: Other Questions
Replies: 2 comments
-
You will need to train a new NER component using training data with the specific classes you are trying to predict. The pretrained spanish models only know how to predict these classes: LOC, MISC, ORG, PER. |
Beta Was this translation helpful? Give feedback.
0 replies
-
To add to that, you can find the information on the pretrained Spanish models here (expand "label scheme"), and more information on (re)training an NER model here. Let us know if you run into specific issues! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
For a given text, I'm trying to extract the maximum information about all the words in it. More specifically, I'm trying to classify the words in some classes: Date, Amount, People Name / Company Name, and People/ Company Identifier (also known in Spain as DNI/NIe/NIF).
A sample text to work with would be the following one:
'PROVEEDOR INMOVILIZADOS\n\nC/ del Mar, 101\n08000 — Barcecelona\n\nA55666777\nCLIENTEA\nc/ Muntanya, 300\n28080 — MADRIZ\nB77766655\nN° Factura: 01/20\nData: 9/10/2020\n\nConcepte Import\n- Articulo 1 1,250.00\n- Articulo 2 357.45\n- Articulo 3 652.17\n\nBase Imponible 2,259.62\nA la vista. 21% L.V.A. 474,52\n\nTOTAL 2,734.14 €'
As first step, I load the text into doc as this:
nlp = spacy.load('es_core_news_lg')
doc = nlp(text)
Doc content seems to be perfect:
PROVEEDOR INMOVILIZADOS
C/ del Mar, 101
08000 — Barcecelona
A55666777
CLIENTEA
c/ Muntanya, 300
28080 — MADRIZ
B77766655
N° Factura: 01/20
Data: 9/10/2020
Concepte Import
Base Imponible 2,259.62
A la vista. 21% L.V.A. 474,52
TOTAL 2,734.14 €
But then, when I trying to get doc's ents, this is the result:
PROVEEDOR MISC
C/ del Mar LOC
Barcecelona LOC
A55666777 MISC
CLIENTEA MISC
Muntanya LOC
MADRIZ MISC
B77766655 MISC
Data MISC
Concepte Import MISC
Articulo MISC
Articulo MISC
Articulo 3 652.17
Base Imponible MISC
A la vista MISC
Am I doing something wrong? Is there a way to improve the recognition?
Your Environment
spaCy version: 2.3.2
Platform: Windows-10-10.0.18362-SP0
Python version:** 3.7.1
Beta Was this translation helpful? Give feedback.
All reactions