Distillation

Introduction

1.1 Distillation

1.2 Knowledge Distillation

1.3 Intermediate Layer Knowledge Distillation
usage

2.1 Pytorch Script

2.2 Tensorflow Script

2.3 Create an Instance of Metric

2.4 Create an Instance of Criterion(Optional)

2.5 Create an Instance of DistillationConfig

2.6 Distill with Trainer

Introduction

Distillation

Distillation is a widely-used approach to perform network compression, which transfers knowledge from a large model to a smaller one without significant loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device). Graph shown below is the workflow of the distillation, the teacher model will take the same input that feed into the student model to produce the output that contains knowledge of the teacher model to instruct the student model.

Knowledge Distillation

Knowledge distillation is proposed in Distilling the Knowledge in a Neural Network. It leverages the logits (the input of softmax in the classification tasks) of teacher and student model to minimize the the difference between their predicted class distributions, this can be done by minimizing the below loss function.

$$L_{KD} = D(z_t, z_s)$$

Where $D$ is a distance measurement, e.g. Euclidean distance and Kullback–Leibler divergence, $z_t$ and $z_s$ are the logits of teacher and student model, or predicted distributions from softmax of the logits in case the distance is measured in terms of distribution.

Intermediate Layer Knowledge Distillation

There are more information contained in the teacher model beside its logits, for example, the output features of the teacher model's intermediate layers often been used to guide the student model, as in Patient Knowledge Distillation for BERT Model Compression and MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. The general loss function for this approach can be summarized as follow.

$$L_{KD} = \sum\limits_i D(T_t^{n_i}(F_t^{n_i}), T_s^{m_i}(F_s^{m_i}))$$

Where $D$ is a distance measurement as before, $F_t^{n_i}$ the output feature of the $n_i$'s layer of the teacher model, $F_s^{m_i}$ the output feature of the $m_i$'s layer of the student model. Since the dimensions of $F_t^{n_i}$ and $F_s^{m_i}$ are usually different, the transformations $T_t^{n_i}$ and $T_s^{m_i}$ are needed to match dimensions of the two features. Specifically, the transformation can take the forms like identity, linear transformation, 1X1 convolution etc.

usage

Pytorch Script:

from intel_extension_for_transformers.transformers.trainer import NLPTrainer
from neural_compressor.config import DistillationConfig
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(......)
trainer = NLPTrainer(......)
metric = metrics.Metric(name="eval_accuracy")
trainer.metrics = metric
d_conf = DistillationConfig(teacher_model=teacher_model, criterion=criterion)
model = trainer.distill(distillation_config=d_conf)

Please refer to example for the details.

Create an Instance of Metric

The Metric defines which metric will be used to measure the performance of tuned models.

example:
```
metric = metrics.Metric(name="eval_accuracy")
```
Please refer to metrics document for the details.

Create an Instance of Criterion(Optional)

The criterion used in training phase.

KnowledgeDistillationLossConfig arguments:

Argument	Type	Description	Default value
temperature	Float	parameter for KnowledgeDistillationLoss	1.0
loss_types	List of string	Type of loss	['CE', 'CE']
loss_weight_ratio	List of float	weight ratio of loss	[0.5, 0.5]

IntermediateLayersKnowledgeDistillationLossConfig arguments:

Argument	Type	Description	Default value
loss_types	List of string	Type of loss	['CE', 'CE']
loss_weight_ratio	List of float	weight ratio of loss	[0.5, 0.5]
layer_mappings	List	parameter for IntermediateLayersLoss	[]
add_origin_loss	bool	parameter for IntermediateLayersLoss	False

example:

criterion = KnowledgeDistillationLossConfig()

Create an Instance of DistillationConfig

The DistillationConfig contains all the information related to the model distillation behavior. If you created Metric and Criterion instance, then you can create an instance of DistillationConfig. Metric and pruner_config is optional.

arguments:

Argument Type Description Default value

teacher_model torch.nn.Module teacher model object None

criterion Criterion criterion of training KnowledgeLoss object

example:

d_conf = DistillationConfig(teacher_model=teacher_model, criterion=criterion)

Distill with Trainer

Distill with Trainer NLPTrainer inherits from transformers.Trainer, so you can create a trainer as in examples of Transformers. Then you can distill model with trainer.distill function.
```
model = trainer.distill(distillation_config=d_conf)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distillation.md

distillation.md

Distillation

Introduction

Distillation

Knowledge Distillation

Intermediate Layer Knowledge Distillation

usage

Pytorch Script:

Create an Instance of Metric

Create an Instance of Criterion(Optional)

Create an Instance of DistillationConfig

Distill with Trainer

Argument	Type	Description	Default value
teacher_model	torch.nn.Module	teacher model object	None
criterion	Criterion	criterion of training	KnowledgeLoss object

Files

distillation.md

Latest commit

History

distillation.md

File metadata and controls

Distillation

Introduction

Distillation

Knowledge Distillation

Intermediate Layer Knowledge Distillation

usage

Pytorch Script:

Create an Instance of Metric

Create an Instance of Criterion(Optional)

Create an Instance of DistillationConfig

Distill with Trainer