Close Menu
    Trending
    • Cognitive Stretching in AI: How Specific Prompts Change Language Model Response Patterns | by Response Lab | Jun, 2025
    • Recogni and DataVolt Partner on Energy-Efficient AI Cloud Infrastructure
    • What I Learned From my First Major Crisis as a CEO
    • Vision Transformer on a Budget
    • Think You Know AI? Nexus Reveals What Everyone Should Really Know | by Thiruvarudselvam suthesan | Jun, 2025
    • How Cloud Innovations Empower Hospitality Professionals
    • Disney Is Laying Off Hundreds of Workers Globally
    • LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»How to Measure Real Model Accuracy When Labels Are Noisy
    Artificial Intelligence

    How to Measure Real Model Accuracy When Labels Are Noisy

    FinanceStarGateBy FinanceStarGateApril 11, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    fact is rarely excellent. From scientific measurements to human annotations used to coach deep studying fashions, floor fact at all times has some quantity of errors. ImageNet, arguably probably the most well-curated picture dataset has 0.3% errors in human annotations. Then, how can we consider predictive fashions utilizing such inaccurate labels?

    On this article, we discover learn how to account for errors in check information labels and estimate a mannequin’s “true” accuracy.

    Instance: picture classification

    Let’s say there are 100 photos, every containing both a cat or a canine. The pictures are labeled by human annotators who’re identified to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we practice a picture classifier on a few of this information and discover that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what’s the “true” accuracy of the mannequin (Aᵗʳᵘᵉ)? A few observations first:

    1. Throughout the 90% of predictions that the mannequin bought “proper,” some examples could have been incorrectly labeled, which means each the mannequin and the bottom fact are incorrect. This artificially inflates the measured accuracy.
    2. Conversely, throughout the 10% of “incorrect” predictions, some may very well be instances the place the mannequin is true and the bottom fact label is incorrect. This artificially deflates the measured accuracy.

    Given these issues, how a lot can the true accuracy differ?

    Vary of true accuracy

    True accuracy of mannequin for completely correlated and completely uncorrelated errors of mannequin and label. Determine by creator.

    The true accuracy of our mannequin is determined by how its errors correlate with the errors within the floor fact labels. If our mannequin’s errors completely overlap with the bottom fact errors (i.e., the mannequin is incorrect in precisely the identical means as human labelers), its true accuracy is:

    Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%

    Alternatively, if our mannequin is incorrect in precisely the alternative means as human labelers (excellent unfavourable correlation), its true accuracy is:

    Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%

    Or extra usually:

    Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

    It’s necessary to notice that the mannequin’s true accuracy will be each decrease and better than its reported accuracy, relying on the correlation between mannequin errors and floor fact errors.

    Probabilistic estimate of true accuracy

    In some instances, inaccuracies amongst labels are randomly unfold among the many examples and never systematically biased towards sure labels or areas of the function area. If the mannequin’s inaccuracies are unbiased of the inaccuracies within the labels, we will derive a extra exact estimate of its true accuracy.

    Once we measure Aᵐᵒᵈᵉˡ (90%), we’re counting instances the place the mannequin’s prediction matches the bottom fact label. This will occur in two situations:

    1. Each mannequin and floor fact are right. This occurs with likelihood Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
    2. Each mannequin and floor fact are incorrect (in the identical means). This occurs with likelihood (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).

    Underneath independence, we will categorical this as:

    Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

    Rearranging the phrases, we get:

    Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)

    In our instance, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is throughout the vary of 86% to 94% that we derived above.

    The independence paradox

    Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our instance, we get

    Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this under.

    True accuracy as a perform of mannequin’s reported accuracy when floor fact accuracy = 96%. Determine by creator.

    Unusual, isn’t it? If we assume that mannequin’s errors are uncorrelated with floor fact errors, then its true accuracy Aᵗʳᵘᵉ is at all times larger than the 1:1 line when the reported accuracy is > 0.5. This holds true even when we differ Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

    Mannequin’s “true” accuracy as a perform of its reported accuracy and floor fact accuracy. Determine by creator.

    Error correlation: why fashions typically wrestle the place people do

    The independence assumption is essential however typically doesn’t maintain in apply. If some photos of cats are very blurry, or some small canine appear like cats, then each the bottom fact and mannequin errors are prone to be correlated. This causes Aᵗʳᵘᵉ to be nearer to the decrease sure (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the higher sure.

    Extra usually, mannequin errors are typically correlated with floor fact errors when:

    1. Each people and fashions wrestle with the identical “troublesome” examples (e.g., ambiguous photos, edge instances)
    2. The mannequin has discovered the identical biases current within the human labeling course of
    3. Sure courses or examples are inherently ambiguous or difficult for any classifier, human or machine
    4. The labels themselves are generated from one other mannequin
    5. There are too many courses (and thus too many various methods of being incorrect)

    Finest practices

    The true accuracy of a mannequin can differ considerably from its measured accuracy. Understanding this distinction is essential for correct mannequin analysis, particularly in domains the place acquiring excellent floor fact is not possible or prohibitively costly.

    When evaluating mannequin efficiency with imperfect floor fact:

    1. Conduct focused error evaluation: Look at examples the place the mannequin disagrees with floor fact to establish potential floor fact errors.
    2. Think about the correlation between errors: If you happen to suspect correlation between mannequin and floor fact errors, the true accuracy is probably going nearer to the decrease sure (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
    3. Receive a number of unbiased annotations: Having a number of annotators may help estimate floor fact accuracy extra reliably.

    Conclusion

    In abstract, we discovered that:

    1. The vary of potential true accuracy is determined by the error fee within the floor fact
    2. When errors are unbiased, the true accuracy is commonly larger than measured for fashions higher than random likelihood
    3. In real-world situations, errors are hardly ever unbiased, and the true accuracy is probably going nearer to the decrease sure



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleTL;DW (YouTube Learning Aid). Introduction | by Diya Saha | Apr, 2025
    Next Article Free Webinar | April 30: Maximize Your Marketing Impact on a Shoestring Budget
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    Vision Transformer on a Budget

    June 3, 2025
    Artificial Intelligence

    LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries

    June 3, 2025
    Artificial Intelligence

    Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning

    June 2, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    Machine Learning. Machine learning has taken the market… | by Leadergroup | Mar, 2025

    March 23, 2025

    How to use SageMaker Pipelines and AWS Batch for Asynchronous Distributed Data Processing | by Urmi Ghosh | Thomson Reuters Labs | Feb, 2025

    February 28, 2025

    Accelerate Your Growth: How Machine Learning Is Revolutionizing Skill Acquisition | by Tyler McGrath | Feb, 2025

    February 5, 2025

    How MacKenzie Scott’s Billions Have Impacted Nonprofits

    February 25, 2025

    Optimizing AI/ML Inference Workloads for Production: A Practical Guide | by Nicholas Thoni | Mar, 2025

    March 13, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Mastering Logistic Regression: The Complete Guide with Python Code | by Amit kharche | Apr, 2025

    April 8, 2025

    Free Webinar | March 11: 3 Biggest Mistakes Entrepreneurs Make (And How to Fix Them)

    February 20, 2025

    Advanced Rag Techniques- Elevating LLM Interactions with Intelligent Routing | by Guarav Bansal | May, 2025

    May 24, 2025
    Our Picks

    Is AI Worth the Investment? Calculate Your Real ROI

    February 4, 2025

    Invest in the AI That Will Make Chatbots Obsolete

    March 25, 2025

    The AI Hype Index: DeepSeek mania, vibe coding, and cheating at chess

    March 26, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.