Knowledge Distillation

This article is based on research from Hinton’s 2015 paper, ‘Distilling the Knowledge in a Neural Network¹’, and incorporates subsequent developments in knowledge distillation techniques.

1. Background and Motivation

In the machine learning lifecycle, the requirements during the training phase differ significantly from those during the deployment phase. Training involves extracting complex data structures from large datasets, a process that can be time-consuming and typically doesn’t require real-time operation.

In contrast, the deployment phase, especially when serving a large number of users, often demands high real-time performance and low latency.

Traditional approaches like training ensembles of models or deploying a single, very large model often fail to meet these stringent deployment requirements. Therefore, methods proposed by researchers like Hinton and Rich Caruana focus on model “compression”, effectively transferring knowledge from a large, cumbersome model to a smaller, more efficient one.

2. The Role of Temperature

For introduction about temperature, please read my blog Temperature in LLM.

As highlighted in Hinton’s paper, the probability that a model classifies an image of a BMW as a garbage truck might be extremely low, but it’s still significantly higher than classifying it as a carrot. Using a higher temperature $T$ during distillation helps to amplify and transfer these subtle relationships (the “dark knowledge”) between less likely classes from the teacher to the student. This concept of temperature is also crucial in large language model (LLM) generation.

3. Knowledge Distillation Method

The fundamental knowledge distillation approach utilizes a larger “teacher” network and a smaller “student” network. The goal is to enable the student network (with fewer parameters) to learn and replicate the knowledge embedded within the teacher network.

3.1 Knowledge Distillation Process

Train a Teacher Model: First, train a complex “teacher” network. This could be a single large model or an ensemble of multiple models, optimized for high accuracy.
Distill Knowledge to Student Model: Use the output of the trained teacher network to supervise the training of a simpler “student” network. Crucially, the teacher’s outputs are “softened” by using a temperature $T > 1$ in its softmax layer. These soft labels (probability distributions) are used as targets for the student model. This allows the student to learn the teacher’s “dark knowledge,” including nuanced information about similarities between classes.

The distillation training process is typically guided by a combined loss function:

\mathcal{L} = \alpha \mathcal{L}_{\text{soft}} + \beta \mathcal{L}_{\text{hard}}

When it comes to inference, the temperature is set to $1$ in accordance with the standard inference procedure.

In knowledge distillation, student model not only learn the correct classification result but also the probability distributions from the teacher model. This is why even a small student model can achieve good results.

The probability distribution contains information about the different classes. In handwritten digit recognition, the most probable prediction for a digit 2 is class 2, but the probability of a digit 3 is also close to 2, meaning those two digits are difficult to distinguish.

References

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. ↩

1. Background and Motivation

2. The Role of Temperature

3. Knowledge Distillation Method

3.1 Knowledge Distillation Process

References

Footnotes