Distilling NER from transformer based models into spacy #4867

mejobhoot · 2020-01-02T18:52:11Z

mejobhoot
Jan 2, 2020

I am trying to distil NER knowledge from transformer based models into Spacy (original, non-transformer offering V2.0+). I am doing this to improve accuracy keeping Spacy's speed. The target is to use the distilled model in production applications. There are 2 changes to make ..

Instead of one-hot labels, need to pass high-T softmax outputs for all labels for each token from an external model. I guess I will have to change get_batch_loss method in nn_parser.pyx?
Change the softmax on the 'upper' (affine) model to use high-T softmax. This change will probably be made in thinc/neural/_classes/affine.py?

I dont see anything in the internet that tells me this was ever done before, I am wondering why! Are the above the right locations to make changes and has anybody tried to see if this works?

Your Environment

Operating System: Windows
Python Version Used: 3.6.9
spaCy Version Used: 2.2.3
Environment Information:

Regards.

honnibal · 2020-01-02T19:29:07Z

honnibal
Jan 2, 2020
Maintainer

Hey,

This is definitely something I've been interested in, but haven't got around to working on yet. So very keen to support you!

As a first approximation, have you tried just doing the simple thing and treating the transformer output as gold-standard? If you use an ensemble of two transformer models and maybe some post-process rules it might be close enough.

I think you might not need to change the Affine output at all: the scores don't matter during the forward pass, so it will only change the backpropagation. So I think only changing the get_batch_loss method will be necessary.

When making the change to this, it's worth paying attention to what the objective is really doing...It's a bit subtle. The parser and NER models are trained with an imitation learning objective: they're trying to predict whether each possible action they could take next will introduce new errors. We softmax across the set of actually zero-cost actions in order to calculate the gradient of the loss.

It's not that straight-forward to see how to modify this objective to take into account label uncertainty from your supervising model, because spaCy isn't predicting a distibution of scores over tags --- we're predicting a partial ordering over transition actions.

If the above sounds strange to you, the main motivation is that we get to condition on structures built over arbitrarily far back transitions. This is especially helpful for the parser, but it's good for the NER too: our features include tokens for the first and last word of the currently open entity, which is not something you could express easily with a limited-order CRF. In order to make these features meaningful, the parser has to visit a realistic sample of transition sequences during training, including sequences that result from previous transition errors.

0 replies

Oceanbao · 2020-01-16T15:54:07Z

Oceanbao
Jan 16, 2020

Greetings @honnibal and I apologise to add in my own question related to Distilling here - I've been Googling it and this is the only similar post.

So my question is Distilling a transformer text classification model into spaCy's TextCat. Specifically, would it be possible to customise the TextCat model (inside thinc I would've thought) to take continuous labels (probabilities from a BERT model, for example) as target? Effectively changing the classification into regression with RMSE as loss I suppose.

The reason is similar that I would like to put the spaCy model in production for efficiency.

I hope my description is clear. Thank you very much.

0 replies

mejobhoot · 2020-01-23T17:57:48Z

mejobhoot
Jan 23, 2020
Author

Hi Matt (@honnibal ),
It took me a while to understand what you meant. Many thanks for your note.

spacy/syntax/transition_system.set_costs is assigning costs to each label based on rules (ex. B-PERSON cannot follow a B-PERSON). We ideally look for zero-cost situations and there can be more than 1 for a given token (that means, more than 1 possible label for a token). spacy/syntax/_parser_model.cpu_log_loss then does something like the attached. Interesting to note that the loss for incorrect labels will be positive, for correct probable labels - negative. These losses are then used to back-propagate and learn. Hopefully this analysis is correct.

For our distillation situation, if we do not compute per token cost, and instead of calculating the softmax on 'b' (zero cost situations), we supply the label probabilities from the supervising model then the loss scores will be calculated as softmax(score[i]) - label_probability[i] (note that label_probability[i] is already a softmax. This new score will then be used in the back-propagation logic. The fallout of this approach will be that we will ignore a very good logic set out in set_costs where prohibited labels are given large loss scores (d_score).

Let me know if this sounds strange to you and you have something else to suggest.

Looking into the logic implemented in Spacy, let me tell you how much I am in awe right now. Fantastic job! Thanks for developing spacy.

Best regards,
Rajdeep

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distilling NER from transformer based models into spacy #4867

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Distilling NER from transformer based models into spacy #4867

mejobhoot Jan 2, 2020

Your Environment

Replies: 3 comments

honnibal Jan 2, 2020 Maintainer

Oceanbao Jan 16, 2020

mejobhoot Jan 23, 2020 Author

mejobhoot
Jan 2, 2020

honnibal
Jan 2, 2020
Maintainer

Oceanbao
Jan 16, 2020

mejobhoot
Jan 23, 2020
Author