Close Menu
    Trending
    • 09389212898
    • Amazon Layoffs Impact Books Division: Goodreads, Kindle
    • Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value
    • AI Just Dated Ancient Scrolls Without Destroying Them. That’s Kind of a Miracle! | by Mallory Twiss | Jun, 2025
    • Descending The Corporate Ladder: A Solution To A Better Life
    • How Shoott Found a Customer Base It Wasn’t Expecting
    • The Role of Luck in Sports: Can We Measure It?
    • The LLM Control Trilogy: From Tuning to Architecture, an Insider’s Look at Taming AI | by Jessweb3 | Jessweb3 Notes | Jun, 2025
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Machine Learning»Papers Explained 381: KL Divergence VS MSE for Knowledge Distillation | by Ritvik Rastogi | Jun, 2025
    Machine Learning

    Papers Explained 381: KL Divergence VS MSE for Knowledge Distillation | by Ritvik Rastogi | Jun, 2025

    FinanceStarGateBy FinanceStarGateJune 6, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Sometimes, KD makes use of the Kullback-Leibler (KL) divergence loss between the softened likelihood distributions of the instructor and pupil fashions, with the temperature scaling hyperparameter τ .

    The authors theoretically show that because the temperature scaling hyperparameter (τ) will increase, the KL divergence loss focuses extra on logit matching, whereas as τ approaches 0, it emphasizes label matching. Empirical outcomes counsel that logit matching is positively correlated with efficiency enchancment basically. Primarily based on this commentary, the authors suggest an alternate KD loss perform: the imply squared error (MSE) between the logit vectors, permitting the coed mannequin to straight be taught from the instructor mannequin’s logits.

    The paper reveals that the MSE loss outperforms the KL divergence loss, primarily because of variations within the penultimate layer representations between the 2 loss features. Moreover, the authors show that sequential distillation can additional improve efficiency, and utilizing KD with a small τ might help mitigate label noise.

    Picture classification on CIFAR-100 with a household of Large-ResNet (WRN) and ImageNet with a household of of ResNet (RN).

    We examine the coaching and check accuracies in keeping with the change in α in L and τ in L_KL.

    Grid maps of accuracies in keeping with the change of α and τ on CIFAR-100.

    First, we empirically observe that the generalization error of a pupil mannequin decreases as α in L will increase. Because of this “smooth” targets are extra environment friendly than “arduous” targets in coaching a pupil if “smooth” targets are extracted from a well-trained instructor.

    This result’s in line with prior research that addressed the efficacy of “smooth” targets. Subsequently, we give attention to the scenario the place “smooth” targets are used to coach a pupil mannequin solely, that’s, α = 1.0, within the the rest of this paper.

    When α = 1.0, the generalization error of the coed mannequin decreases as τ in L_KL will increase.

    These constant tendencies in keeping with the 2 hyperparameters, α and τ , are the identical throughout varied teacher-student pairs.

    Particularly, a bigger τ is linked to a bigger L_KL, making the logit vector of the coed just like that of the instructor (i.e., logit matching). Therefore, “smooth” targets are being absolutely used as τ will increase.

    However, when τ is near 0, the gradient of L_KL doesn’t contemplate the logit distributions and solely identifies whether or not the coed and the instructor share the identical output (i.e., label matching), which transfers restricted info.

    As well as, there’s a scaling problem when τ approaches 0. As τ decreases, L_KL more and more loses its high quality and finally turns into much less concerned in studying. The scaling drawback might be simply fastened by multiplying 1/τ by LKL when τ is near zero.

    We empirically in contrast the targets L_KL and L_MSE by way of efficiency features and measured the gap between the logit distributions.

    Prime-1 check accuracies on CIFAR-100. WRN-28–4 is used as a instructor

    Distillation with L_MSE is the perfect coaching scheme for varied teacher-student pairs, We additionally discovered the consitent enhancements in ensemble distillations.

    Furthermore, the mannequin skilled with L_MSE has related or beetter efficiency when in comparison with present KD strategies.

    Check accuracy of assorted KD strategies on CIFAR-100. All pupil fashions share the identical instructor mannequin as WRN-28–4.

    The logit distribution of the coed with a big τ is nearer to that of the instructor than with a small τ when L_KL is used. Furthermore, L_MSE is extra environment friendly in transferring the instructor’s info to a pupil than L_KL.

    Optimizing L_MSE aligns the coed’s logit with the instructor’s logit. However when τ turns into considerably giant L_KL makes the sudent’s logit imply deviate from that of the instructor’s logit imply

    (a) Probabilistic density perform (pdf) for ||z s − z t ||2 on CIFAR-100 coaching dataset; (b) The pdf for the 2-norm of prelogit (i.e., ||r s ||2) on CIFAR-100 coaching dataset. We use a (instructor, pupil) pair of (WRN-28–4, WRN-16–2).

    When the coed s is skilled with L_KL with infinite τ or with L_MSE, each representations try to comply with the form of the instructor’s representations however differ within the diploma of cohesion. . Subsequently, L_MSE can shrink the representations greater than L_KL together with the instructor.

    Visualizations of pre-logits on CIFAR-100 in keeping with the change of loss perform. Right here, we use the courses “apple,” “aquarium fish,” and “child.”

    Results of a Noisy Instructor

    We examine the consequences of a loud instructor (i.e., a mannequin poorly fitted to the coaching dataset). It’s believed that the label matching (L_KL with a small τ ) is extra applicable than the logit matching (L_KL with a big τ or the L_MSE) below a loud instructor. It is because label matching neglects the unfavorable info of the outputs of an untrained instructor.

    Prime-1 check accuracies on CIFAR-100. WRN-28–4 is used as a instructor for LKL and LMSE. Right here, the instructor (WRN-28–4) was not absolutely skilled. The coaching accuracy of the instructor community is 53.77%.
    Check accuracy on the ImageNet dataset. We used a (instructor, pupil) pair of (ResNet-152, ResNet-50). The coaching accuracy of the instructor community is 81.16%.

    Sequential KD (giant community → medium community → small community) shouldn’t be conducive to generalization. In different phrases, the perfect strategy is a direct distillation from the medium mannequin to the small mannequin.

    When L_KL with τ = 3 is used to coach the small community iteratively, the direct distillation from the intermediate community to the small community is best (i.e., WRN-16–4 → WRN-16–2, 74.84%) than the sequential distillation (i.e., WRN-28–4 → WRN-16–4 → WRN-16- 2, 74.52%) and direct distillation from a big community to a small community (i.e., WRN-28–4 → WRN-16–2, 74.24%). The identical pattern happens in L_MSE iterations.

    However, we discover that the medium-sized instructor can enhance the efficiency of a smaller-scale pupil when L_KL and L_MSE are used sequentially.

    Check accuracies of sequential information distillation. In every entry, we word the target perform that used for the coaching. ‘X’ signifies that distillation was not utilized in coaching.

    Fashionable deep neural networks even try to memorize samples completely therefore, the instructor would possibly switch corrupted information to the coed on this scenario. Subsequently, it’s thought that logit matching may not be the perfect technique when the instructor is skilled utilizing a loud label dataset.

    The most effective generalization efficiency is achieved after we use L_KL with τ ≤ 1.0

    Check accuracy graph as τ adjustments on CIFAR-100. We use the (instructor, pupil) as (WRN-28–4, WRN-16–2)

    As anticipated, logit matching would possibly switch the instructor’s overconfidence, even for incorrect predictions. Nonetheless, the right goal derived from each logit matching and label matching allows related results of label smoothing, as studied in. Subsequently, L_KL with τ = 0.5 seems to considerably mitigate the issue of noisy labels.

    • As τ goes to 0, the skilled pupil has the label matching property. In distinction, as τ goes to ∞, the skilled pupil has the logit matching property.
    • Nonetheless, L_KL with a sufficiently giant τ can not obtain full logit matching. To attain this purpose, we proposed a direct logit studying framework utilizing L_MSE and improved the efficiency primarily based on this loss perform.
    • Mannequin skilled with L_MSE adopted the instructor’s penultimate layer representations greater than that with L_KL.
    • Sequential distillation could be a higher technique when the capability hole between the instructor and the coed is giant.
    • Within the noisy label setting, utilizing L_KL with τ close to 1 mitigates the efficiency degradation somewhat than excessive logit matching, akin to L_KL with τ = ∞ or L_MSE.

    Evaluating Kullback-Leibler Divergence and Imply Squared Error Loss in Data Distillation 2105.08919

    Take a look at all of the threads on this sequence here



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleMicro-Retirement? Quit Your Job Before You’re a Millionaire
    Next Article Your Competitors Are Winning with PR — You Just Don’t See It Yet
    FinanceStarGate

    Related Posts

    Machine Learning

    09389212898

    June 6, 2025
    Machine Learning

    AI Just Dated Ancient Scrolls Without Destroying Them. That’s Kind of a Miracle! | by Mallory Twiss | Jun, 2025

    June 6, 2025
    Machine Learning

    The LLM Control Trilogy: From Tuning to Architecture, an Insider’s Look at Taming AI | by Jessweb3 | Jessweb3 Notes | Jun, 2025

    June 6, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Build the Ultimate Partner Network for Your Startup

    May 13, 2025

    Faster Models with Graph Fusion: How Deep Learning Frameworks Optimize Your Computation | by Arik Poznanski | May, 2025

    May 7, 2025

    Many Businesses May be Overpaying for This Common Software

    March 19, 2025

    A Farewell to APMs — The Future of Observability is MCP tools

    May 2, 2025

    AI for Dumdum: How Machines Learn | by Rachel Tumulak | May, 2025

    May 14, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Most Canadians feel tips are too high: survey

    March 13, 2025

    Built for the Curious. AI won’t take your job. But your fear… | by Ayesha sidhikha | Apr, 2025

    April 15, 2025

    Fhhjfjfjf

    April 13, 2025
    Our Picks

    From Data to Stories: Code Agents for KPI Narratives

    May 29, 2025

    This Is the One Question AI Can’t Answer For You

    April 26, 2025

    30 Most Asked PySpark Questions on Date Functions: Part 5| Solved | by B V Sarath Chandra | Apr, 2025

    April 6, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.