Papers Explained 381: KL Divergence VS MSE for Knowledge Distillation | by Ritvik Rastogi

Sometimes, KD makes use of the Kullback-Leibler (KL) divergence loss between the softened likelihood distributions of the instructor and pupil fashions, with the temperature scaling hyperparameter τ .

The authors theoretically show that because the temperature scaling hyperparameter (τ) will increase, the KL divergence loss focuses extra on logit matching, whereas as τ approaches 0, it emphasizes label matching. Empirical outcomes counsel that logit matching is positively correlated with efficiency enchancment basically. Primarily based on this commentary, the authors suggest an alternate KD loss perform: the imply squared error (MSE) between the logit vectors, permitting the coed mannequin to straight be taught from the instructor mannequin’s logits.

The paper reveals that the MSE loss outperforms the KL divergence loss, primarily because of variations within the penultimate layer representations between the 2 loss features. Moreover, the authors show that sequential distillation can additional improve efficiency, and utilizing KD with a small τ might help mitigate label noise.

Picture classification on CIFAR-100 with a household of Large-ResNet (WRN) and ImageNet with a household of of ResNet (RN).

We examine the coaching and check accuracies in keeping with the change in α in L and τ in L_KL.

Grid maps of accuracies in keeping with the change of α and τ on CIFAR-100.

First, we empirically observe that the generalization error of a pupil mannequin decreases as α in L will increase. Because of this “smooth” targets are extra environment friendly than “arduous” targets in coaching a pupil if “smooth” targets are extracted from a well-trained instructor.

This result’s in line with prior research that addressed the efficacy of “smooth” targets. Subsequently, we give attention to the scenario the place “smooth” targets are used to coach a pupil mannequin solely, that’s, α = 1.0, within the the rest of this paper.

When α = 1.0, the generalization error of the coed mannequin decreases as τ in L_KL will increase.

These constant tendencies in keeping with the 2 hyperparameters, α and τ , are the identical throughout varied teacher-student pairs.

Particularly, a bigger τ is linked to a bigger L_KL, making the logit vector of the coed just like that of the instructor (i.e., logit matching). Therefore, “smooth” targets are being absolutely used as τ will increase.

However, when τ is near 0, the gradient of L_KL doesn’t contemplate the logit distributions and solely identifies whether or not the coed and the instructor share the identical output (i.e., label matching), which transfers restricted info.

As well as, there’s a scaling problem when τ approaches 0. As τ decreases, L_KL more and more loses its high quality and finally turns into much less concerned in studying. The scaling drawback might be simply fastened by multiplying 1/τ by LKL when τ is near zero.

We empirically in contrast the targets L_KL and L_MSE by way of efficiency features and measured the gap between the logit distributions.

Prime-1 check accuracies on CIFAR-100. WRN-28–4 is used as a instructor

Distillation with L_MSE is the perfect coaching scheme for varied teacher-student pairs, We additionally discovered the consitent enhancements in ensemble distillations.

Furthermore, the mannequin skilled with L_MSE has related or beetter efficiency when in comparison with present KD strategies.

Check accuracy of assorted KD strategies on CIFAR-100. All pupil fashions share the identical instructor mannequin as WRN-28–4.

The logit distribution of the coed with a big τ is nearer to that of the instructor than with a small τ when L_KL is used. Furthermore, L_MSE is extra environment friendly in transferring the instructor’s info to a pupil than L_KL.

Optimizing L_MSE aligns the coed’s logit with the instructor’s logit. However when τ turns into considerably giant L_KL makes the sudent’s logit imply deviate from that of the instructor’s logit imply

(a) Probabilistic density perform (pdf) for ||z s − z t ||2 on CIFAR-100 coaching dataset; (b) The pdf for the 2-norm of prelogit (i.e., ||r s ||2) on CIFAR-100 coaching dataset. We use a (instructor, pupil) pair of (WRN-28–4, WRN-16–2).

When the coed s is skilled with L_KL with infinite τ or with L_MSE, each representations try to comply with the form of the instructor’s representations however differ within the diploma of cohesion. . Subsequently, L_MSE can shrink the representations greater than L_KL together with the instructor.

Visualizations of pre-logits on CIFAR-100 in keeping with the change of loss perform. Right here, we use the courses “apple,” “aquarium fish,” and “child.”

Results of a Noisy Instructor

We examine the consequences of a loud instructor (i.e., a mannequin poorly fitted to the coaching dataset). It’s believed that the label matching (L_KL with a small τ ) is extra applicable than the logit matching (L_KL with a big τ or the L_MSE) below a loud instructor. It is because label matching neglects the unfavorable info of the outputs of an untrained instructor.

Prime-1 check accuracies on CIFAR-100. WRN-28–4 is used as a instructor for LKL and LMSE. Right here, the instructor (WRN-28–4) was not absolutely skilled. The coaching accuracy of the instructor community is 53.77%.

Check accuracy on the ImageNet dataset. We used a (instructor, pupil) pair of (ResNet-152, ResNet-50). The coaching accuracy of the instructor community is 81.16%.

Sequential KD (giant community → medium community → small community) shouldn’t be conducive to generalization. In different phrases, the perfect strategy is a direct distillation from the medium mannequin to the small mannequin.

When L_KL with τ = 3 is used to coach the small community iteratively, the direct distillation from the intermediate community to the small community is best (i.e., WRN-16–4 → WRN-16–2, 74.84%) than the sequential distillation (i.e., WRN-28–4 → WRN-16–4 → WRN-16- 2, 74.52%) and direct distillation from a big community to a small community (i.e., WRN-28–4 → WRN-16–2, 74.24%). The identical pattern happens in L_MSE iterations.

However, we discover that the medium-sized instructor can enhance the efficiency of a smaller-scale pupil when L_KL and L_MSE are used sequentially.

Check accuracies of sequential information distillation. In every entry, we word the target perform that used for the coaching. ‘X’ signifies that distillation was not utilized in coaching.

Fashionable deep neural networks even try to memorize samples completely therefore, the instructor would possibly switch corrupted information to the coed on this scenario. Subsequently, it’s thought that logit matching may not be the perfect technique when the instructor is skilled utilizing a loud label dataset.

The most effective generalization efficiency is achieved after we use L_KL with τ ≤ 1.0

Check accuracy graph as τ adjustments on CIFAR-100. We use the (instructor, pupil) as (WRN-28–4, WRN-16–2)

As anticipated, logit matching would possibly switch the instructor’s overconfidence, even for incorrect predictions. Nonetheless, the right goal derived from each logit matching and label matching allows related results of label smoothing, as studied in. Subsequently, L_KL with τ = 0.5 seems to considerably mitigate the issue of noisy labels.

As τ goes to 0, the skilled pupil has the label matching property. In distinction, as τ goes to ∞, the skilled pupil has the logit matching property.
Nonetheless, L_KL with a sufficiently giant τ can not obtain full logit matching. To attain this purpose, we proposed a direct logit studying framework utilizing L_MSE and improved the efficiency primarily based on this loss perform.
Mannequin skilled with L_MSE adopted the instructor’s penultimate layer representations greater than that with L_KL.
Sequential distillation could be a higher technique when the capability hole between the instructor and the coed is giant.
Within the noisy label setting, utilizing L_KL with τ close to 1 mitigates the efficiency degradation somewhat than excessive logit matching, akin to L_KL with τ = ∞ or L_MSE.

Evaluating Kullback-Leibler Divergence and Imply Squared Error Loss in Data Distillation 2105.08919

Take a look at all of the threads on this sequence here

Source link

09389212898

AI Just Dated Ancient Scrolls Without Destroying Them. That’s Kind of a Miracle! | by Mallory Twiss | Jun, 2025

The LLM Control Trilogy: From Tuning to Architecture, an Insider’s Look at Taming AI | by Jessweb3 | Jessweb3 Notes | Jun, 2025

How to Build the Ultimate Partner Network for Your Startup

Faster Models with Graph Fusion: How Deep Learning Frameworks Optimize Your Computation | by Arik Poznanski | May, 2025

Many Businesses May be Overpaying for This Common Software

A Farewell to APMs — The Future of Observability is MCP tools

AI for Dumdum: How Machines Learn | by Rachel Tumulak | May, 2025

Most Popular

Most Canadians feel tips are too high: survey

Built for the Curious. AI won’t take your job. But your fear… | by Ayesha sidhikha | Apr, 2025

Fhhjfjfjf

Our Picks

From Data to Stories: Code Agents for KPI Narratives

This Is the One Question AI Can’t Answer For You

30 Most Asked PySpark Questions on Date Functions: Part 5| Solved | by B V Sarath Chandra | Apr, 2025

Papers Explained 381: KL Divergence VS MSE for Knowledge Distillation | by Ritvik Rastogi | Jun, 2025

Results of a Noisy Instructor

Related Posts