Distilling NER from transformer based models into spacy #4867
Replies: 3 comments
-
Hey, This is definitely something I've been interested in, but haven't got around to working on yet. So very keen to support you! As a first approximation, have you tried just doing the simple thing and treating the transformer output as gold-standard? If you use an ensemble of two transformer models and maybe some post-process rules it might be close enough. I think you might not need to change the When making the change to this, it's worth paying attention to what the objective is really doing...It's a bit subtle. The parser and NER models are trained with an imitation learning objective: they're trying to predict whether each possible action they could take next will introduce new errors. We softmax across the set of actually zero-cost actions in order to calculate the gradient of the loss. It's not that straight-forward to see how to modify this objective to take into account label uncertainty from your supervising model, because spaCy isn't predicting a distibution of scores over tags --- we're predicting a partial ordering over transition actions. If the above sounds strange to you, the main motivation is that we get to condition on structures built over arbitrarily far back transitions. This is especially helpful for the parser, but it's good for the NER too: our features include tokens for the first and last word of the currently open entity, which is not something you could express easily with a limited-order CRF. In order to make these features meaningful, the parser has to visit a realistic sample of transition sequences during training, including sequences that result from previous transition errors. |
Beta Was this translation helpful? Give feedback.
-
Greetings @honnibal and I apologise to add in my own question related to Distilling here - I've been Googling it and this is the only similar post. So my question is Distilling a transformer text classification model into spaCy's TextCat. Specifically, would it be possible to customise the TextCat model (inside thinc I would've thought) to take continuous labels (probabilities from a BERT model, for example) as target? Effectively changing the classification into regression with RMSE as loss I suppose. The reason is similar that I would like to put the spaCy model in production for efficiency. I hope my description is clear. Thank you very much. |
Beta Was this translation helpful? Give feedback.
-
Hi Matt (@honnibal ), spacy/syntax/transition_system.set_costs is assigning costs to each label based on rules (ex. B-PERSON cannot follow a B-PERSON). We ideally look for zero-cost situations and there can be more than 1 for a given token (that means, more than 1 possible label for a token). spacy/syntax/_parser_model.cpu_log_loss then does something like the attached. Interesting to note that the loss for incorrect labels will be positive, for correct probable labels - negative. These losses are then used to back-propagate and learn. Hopefully this analysis is correct. For our distillation situation, if we do not compute per token cost, and instead of calculating the softmax on 'b' (zero cost situations), we supply the label probabilities from the supervising model then the loss scores will be calculated as softmax(score[i]) - label_probability[i] (note that label_probability[i] is already a softmax. This new score will then be used in the back-propagation logic. The fallout of this approach will be that we will ignore a very good logic set out in set_costs where prohibited labels are given large loss scores (d_score). Let me know if this sounds strange to you and you have something else to suggest. Looking into the logic implemented in Spacy, let me tell you how much I am in awe right now. Fantastic job! Thanks for developing spacy. Best regards, |
Beta Was this translation helpful? Give feedback.
-
I am trying to distil NER knowledge from transformer based models into Spacy (original, non-transformer offering V2.0+). I am doing this to improve accuracy keeping Spacy's speed. The target is to use the distilled model in production applications. There are 2 changes to make ..
I dont see anything in the internet that tells me this was ever done before, I am wondering why! Are the above the right locations to make changes and has anybody tried to see if this works?
Your Environment
Regards.
Beta Was this translation helpful? Give feedback.
All reactions