Papers for knowledge distillation for NLP and ASR (mainly focus on BERT-like models).
π represents important papers.
π represents NLP.
π΅ represents ASR.
-
Response-Based Knowledge Distilling the Knowledge in a Neural Network
-
Feature-Based Knowledge FitNets: Hints for Thin Deep Nets
-
Relation-Based Knowledge A gift from knowledge distillation: Fast optimization, network minimization and transfer learning
-
π PKD: Patient Knowledge Distillation for BERT Model Compression
-
ππ DistilBERT: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
-
ππ TinyBERT: TinyBERT: Distilling BERT for Natural Language Understanding
-
π IR-KD: Knowledge Distillation from Internal Representations
-
π MobileBERT: MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
-
π CKD: Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers
-
π ALP-KD: ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
-
π Co-DIR: Contrastive Distillation on Intermediate Representations for Language Model Compression
-
ππ΅ DistilHubert: DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT
-
π RAIL-KD: RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation
-
π CoFi: Structured Pruning Learns Compact and Accurate Models
-
π΅ FitHubert: FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning
-
π΅ LightHubert: LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT