Does spacy's NER models generalize - How can I make a NER model to detect correctly new words? #6594

echatzikyriakidis · 2020-12-18T10:10:32Z

echatzikyriakidis
Dec 18, 2020

I have created a German model and I test it with:

"Boris Johnson wurde in Google gearbeitet"

The result I get is two entities detected for Boris Johnson and Google and they are correct.

Both Boris Johnson and Google were in my dataset.

However, when I test the model and replace Google with something else, e.g., Yahoo it doesn't work.

Why is that? I read that spaCy models generalize and learn from local features and surrounding context.

How can I make a model to generalize and detect new pieces of text. New names, new companies etc. Not only the ones that exist in the training dataset.

mehmetilker · 2020-12-18T14:00:48Z

mehmetilker
Dec 18, 2020

I am having same problem above with SpaCy 3rc02 as well.

I have used this project as template: https://github.com/explosion/projects/tree/v3/pipelines/ner_wikiner
Only difference is that my data contains text and entities {"text": "...", "entities": [(5, 8, 'ORG'), ...]} (converted to v3 binary format to train) while wikiner contains POS as well (Word/NN/B-ORG etc...)

Training ended with %97 accuracy. If text contains entities from training dataset, it works otherwise no result for similar cases above...

I have tried to add word vectors as well. Nothing has changed.

6 replies

echatzikyriakidis Dec 18, 2020
Author

@mehmetilker Also, I would like to note that I also use the same data structure for annotated entities in text using start and end indexes.

mehmetilker Dec 20, 2020

I think initial data structure is not important. At the end it will be binary format to be able to train in v3.
I have ready to use script to export in jsonl format ({"text": "...", "entities": [(5, 8, 'ORG'), ...]})
But v3 does not have convert function from the jsonl to binary so I had first converted to json format with v2 and then binary format with v3.

I have around 10 labels and 130K examples. Mostly PERSON and little WORK_OF_ART but should make too much problem I guess, as long as dataset is big enough.

I do not have a solution for now, still experimenting....

mehmetilker Dec 20, 2020

Here is my label distribution to give an idea...

[
('CAPITAL', 7734),
('CARDINAL', 12373),
('CITY', 39507),
('CODE_NAME', 227),
('COMPANY', 7996),
('COUNTRY', 103551),
('DATE', 7595),
('DAY', 1125),
('EVENT', 12804),
('FAC', 6257),
('HASHTAG', 18),
('LAW', 2809),
('LOC', 9255),
('MISC', 36754),
('MONEY', 7892),
('MONTH', 2758),
('NATIONALITY', 11780),
('NORP', 8464),
('NOUNPHRASE', 7135),
('ORG', 78927),
('PERCENT', 4833),
('PERSON', 20547),
('PLATE', 1),
('PRODUCT', 564),
('QUANTITY', 27057),
('TIME', 550),
('TITLE', 62353),
('URL', 185),
('USERNAME', 1),
('WORK_OF_ART', 459)]

echatzikyriakidis Dec 20, 2020
Author

What is the meaning of the number next to label? For example, you have 7734 capitals across all documents?

mehmetilker Dec 21, 2020

That's right. Number of CAPITAL, LOC etc... across of all documents...

svlandeg · 2020-12-18T16:19:04Z

svlandeg
Dec 18, 2020
Maintainer

I read that spaCy models generalize and learn from local features and surrounding context.
How can I make a model to generalize and detect new pieces of text. New names, new companies etc. Not only the ones that exist in the training dataset.

The models should in fact already do that. How big is the training set you're training on? You want to inspect specifically the number of training instances per class, and the variety in them. The bigger and more varied your training data, the more general your model will be.

To determine whether or not you're overfitting, it would be a good idea to get a dev dataset that is independent of your training dataset, and measure the performance (F-score, accuracy, ...) on that dev dataset while you're training. When your training loss keeps going down but your dev performance gets worse, that's the point where you're overfitting.

4 replies

echatzikyriakidis Dec 18, 2020
Author

@svlandeg Thank you!

I have done exactly that.

I have splitted my dataset to train/validation/test with ratios 75%, 15%, 15%.

Using these sets:

I train various models (with different hyperparams) only on training set.

I measure F-score metric on validation set.

I select the model that performed best on validation set and retrain it on all data.

At the end I just report the F-score metric on test set.

I have trained it for 40 epochs, the train loss decreases and validation F-score increases.

I have 98-99% validation F-score on all entities.

I have 20k documents and I train in whole documents (web articles).

I will try with more documents and also paragraph split them. Do you think If I paragraph split them will perform better?

Also the following is interesting:

"You want to inspect specifically the number of training instances per class, and the variety in them. The bigger and more varied your training data"

because I know that my entity types are not distributed equally. I have more LOCATIONS than PERSONS, etc. Also, I need to ensure that all splits (train/validation/test sets) represent all entities.

Can you explain this more? Or even provide some statistical code that measures the variance of entities across articles.

I have the usual data structure with texts and "entities" with start end indexes.

echatzikyriakidis Dec 18, 2020
Author

@svlandeg We can have also a short call whenever you are available to show you my notebooks and do any sanity checks on the model. Your feedback will be valuable!

svlandeg Dec 18, 2020
Maintainer

I'm afraid I don't quite have the bandwidth to provide 1-1 support! Also, we think it's really helpful to have the support in online threads for others to benefit from as well.

echatzikyriakidis Dec 18, 2020
Author

Yes. So how we could solve that? Do you want to provide some code or some results?

bratao · 2020-12-21T23:52:36Z

bratao
Dec 21, 2020

@echatzikyriakidis I think that the best way is to try with external word embeddings, trained with a huge corpus.
Maybe you can also get some luck with subword features and using a beam search.
The trained network must have some clue about what make Yahoo a company if it never saw it. It is the capitalization? It is the context?

7 replies

echatzikyriakidis Aug 10, 2021
Author

Hi @sudarshan-koirala,

No, unfortunately the project I was working on has stopped and never looked at it more. It could be great if you could find a solution on this.

Best,
Efstathios

echatzikyriakidis Aug 10, 2021
Author

Hi @sudarshan-koirala,

No, unfortunately the project I was working on has stopped and never looked at it more. It could be great if you could find a solution on this.

Best,
Efstathios

sudarshan-koirala Aug 10, 2021

@echatzikyriakidis thanks for the reply. Will post here, if I find the solution. @polm can you provide some suggestions to tackle this problem.

polm Aug 11, 2021

@sudarshan-koirala Do not @ individual maintainers to get their attention, it's disruptive.

sudarshan-koirala Aug 11, 2021

sorry Paul, won't do it in the future. Despite that, any help from any maintainers would he appreciated as this issue still exists .. thanks

polm · 2021-08-12T05:44:26Z

polm
Aug 12, 2021

As someone said further upthread, the NER model should absolutely be learning
to handle unseen tokens by default. If you have proper validation data it
should be able to confirm this; if you have validation data, and your model
does well on the validation but not in practice, it's possible your total
training data doesn't have a lot of entity variation.

That said, in general you can try to use data augmentation to improve this. For
example, suppose you're just working on recognizing company names. Make a list
of company names and given your training data, randomly swap out the existing
company names to create more training examples. To make even more examples,
apply random changes to company names, like deleting letters or using weird but
plausible capitalization. This can help make up for a lack of training data.

When understanding why a particular entity is found or not, it's good to keep
in mind what spaCy uses as input:

context (text surrounding the token)
the literal token text
token shape, prefix, suffix

You can look at each of these for a given example and ask yourself which might
cause a problem. If all of them look fine then maybe your model doesn't have
enough data or hasn't seen a similar example. Let's walk through them briefly:

Is the context OK? If you replaced the entity with a blank in the sentence,
could you still predict the type? Would another type, or a non-entity noun,
also be acceptable?

John works at [BLANK]

This would not be a GPE, and would probably be an ORG. But it could also be a
generic noun, like "the library" or "night", or a LOC, like "the Blue Dog
Cafe". So it may look simple but can actually be a little ambiguous - if it
were "John was just hired by [BLANK]" it would be less ambiguous.

Is the literal token OK? This is usually easy to check. Unknown tokens are
tricky but shouldn't be a dealbreaker, depending on their shape (which we'll
cover next). Tokens that are common but usually not entities are harder though.
To give an excessive case, if we said this the model would have a hard time:

John works at The

Maybe John was hired by a company called "The", but the model will have a hard
time understanding that (just like I would honestly). You probably won't have
many things that bad, but "Alphabet" can be hard for similar reasons. Check if
your model has trouble with similar cases. If it does, this is relatively hard
to address, but you can intentionally augment data with capitalized common nouns as
company names, for example.

Is the shape what you would expect? To avoid just memorizing tokens, spaCy
internally uses a variety of "word shape features". The most basic ones are
prefixes and suffixes - the first few and last few characters of a word get
their own representation in the model. The full word shape reduces a word to
its outline, so "Google" is like "Xxxxxx" and "McCann" is like "XxXxxx" and
"O'Leary" is like "X'Xxxxx" or something. This lets spaCy learn about things
like capitalization, in-word punctuation, or the use of numbers. (The actual
details are a little more complicated.)

The takeaway is that if your text is not formatted like the
training data - maybe it's from text messages and words aren't capitalized -
the model will not see uncapitalized words and assume they're not entities or
something like that. This can be rather easily fixed by augmenting data at
training time to randomly vary case. (We already do some of this in the default
models.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does spacy's NER models generalize - How can I make a NER model to detect correctly new words? #6594

{{title}}

Replies: 4 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Does spacy's NER models generalize - How can I make a NER model to detect correctly new words? #6594

Replies: 4 comments · 17 replies

echatzikyriakidis Dec 18, 2020 Author

echatzikyriakidis Dec 20, 2020 Author

svlandeg Dec 18, 2020 Maintainer

echatzikyriakidis Dec 18, 2020 Author

echatzikyriakidis Dec 18, 2020 Author

svlandeg Dec 18, 2020 Maintainer

echatzikyriakidis Dec 18, 2020 Author

echatzikyriakidis Aug 10, 2021 Author

echatzikyriakidis Aug 10, 2021 Author

Replies: 4 comments 17 replies

echatzikyriakidis Dec 18, 2020
Author

echatzikyriakidis Dec 20, 2020
Author

svlandeg
Dec 18, 2020
Maintainer

echatzikyriakidis Dec 18, 2020
Author

echatzikyriakidis Dec 18, 2020
Author

svlandeg Dec 18, 2020
Maintainer

echatzikyriakidis Dec 18, 2020
Author

echatzikyriakidis Aug 10, 2021
Author

echatzikyriakidis Aug 10, 2021
Author