Close Menu
    Trending
    • Here’s How Scaling a Business Really Works
    • A Review of AccentFold: One of the Most Important Papers on African ASR
    • 📧 I Didn’t Expect This: How Email Attacks Hijacked the Cyber Insurance World 💥🛡️ | by LazyHacker | May, 2025
    • Many Small Business Owners Are Still ‘Optimistic’: Survey
    • Log Link vs Log Transformation in R — The Difference that Misleads Your Entire Data Analysis
    • Knowledge Distillation: Making Powerful AI Smaller and Faster | by TeqnoVerse | May, 2025
    • 3 AI Tools to Help You Start a Profitable Solo Business
    • What My GPT Stylist Taught Me About Prompting Better
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Log Link vs Log Transformation in R — The Difference that Misleads Your Entire Data Analysis
    Artificial Intelligence

    Log Link vs Log Transformation in R — The Difference that Misleads Your Entire Data Analysis

    FinanceStarGateBy FinanceStarGateMay 10, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    distributions are probably the most generally used, loads of real-world information sadly will not be regular. When confronted with extraordinarily skewed information, it’s tempting for us to make the most of log transformations to normalize the distribution and stabilize the variance. I just lately labored on a undertaking analyzing the vitality consumption of coaching AI fashions, utilizing information from Epoch AI [1]. There is no such thing as a official information on vitality utilization of every mannequin, so I calculated it by multiplying every mannequin’s energy draw with its coaching time. The brand new variable, Vitality (in kWh), was extremely right-skewed, together with some excessive and overdispersed outliers (Fig. 1).

    Determine 1. Histogram of Vitality Consumption (kWh)

    To deal with this skewness and heteroskedasticity, my first intuition was to use a log transformation to the Vitality variable. The distribution of log(Vitality) appeared rather more regular (Fig. 2), and a Shapiro-Wilk take a look at confirmed the borderline normality (p ≈ 0.5).

    Determine 2. Histogram of log of Vitality Consumption (kWh)

    Modeling Dilemma: Log Transformation vs Log Hyperlink

    The visualization appeared good, however once I moved on to modeling, I confronted a dilemma: Ought to I mannequin the log-transformed response variable (log(Y) ~ X), or ought to I mannequin the unique response variable utilizing a log hyperlink operate (Y ~ X, hyperlink = “log")? I additionally thought-about two distributions — Gaussian (regular) and Gamma distributions — and mixed every distribution with each log approaches. This gave me 4 totally different fashions as under, all fitted utilizing R’s Generalized Linear Fashions (GLM):

    all_gaussian_log_link 

    Mannequin Comparability: AIC and Diagnostic Plots

    I in contrast the 4 fashions utilizing Akaike Info Criterion (AIC), which is an estimator of prediction error. Usually, the decrease the AIC, the higher the mannequin suits.

    AIC(all_gaussian_log_link, all_gaussian_log_transform, all_gamma_log_link, all_gamma_log_transform)
    
                               df       AIC
    all_gaussian_log_link      25 2005.8263
    all_gaussian_log_transform 25  311.5963
    all_gamma_log_link         25 1780.8524
    all_gamma_log_transform    25  352.5450

    Among the many 4 fashions, fashions utilizing log-transformed outcomes have a lot decrease AIC values than those utilizing log hyperlinks. For the reason that distinction in AIC between log-transformed and log-link fashions was substantial (311 and 352 vs 1780 and 2005), I additionally examined the diagnostics plots to additional validate that log-transformed fashions match higher:

    Determine 4. Diagnostic plots for the log-linked Gaussian mannequin. The Residuals vs Fitted plot suggests linearity regardless of just a few outliers. Nevertheless, the Q-Q plot exhibits noticeable deviations from the theoretical line, suggesting non-normality.
    Determine 5. Diagnostics plots for the log-transformed Gaussian mannequin. The Q-Q plot exhibits a significantly better match, supporting normality. Nevertheless, the Residuals vs Fitted plot has a dip to -2, which can recommend non-linearity. 
    Determine 6. Diagnostic plots for the log-linked Gamma mannequin. The Q-Q plot appears okay, but the Residuals vs Fitted plot exhibits clear indicators of non-linearity
    Determine 7. Diagnostic plots for the log-transformed Gamma mannequin. The Residuals vs Fitted plot appears good, with a small dip of -0.25 firstly. Nevertheless, the Q-Q plot exhibits some deviation at each tails.

    Primarily based on the AIC values and diagnostic plots, I made a decision to maneuver ahead with the log-transformed Gamma mannequin, because it had the second-lowest AIC worth and its Residuals vs Fitted plot appears higher than that of the log-transformed Gaussian mannequin. 
    I proceeded to discover which explanatory variables have been helpful and which interactions could have been vital. The ultimate mannequin I chosen was:

    glm(system = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity + 
        Training_hardware + 0, household = Gamma(), information = df)

    Decoding Coefficients

    Nevertheless, once I began deciphering the mannequin’s coefficients, one thing felt off. Since solely the response variable was log-transformed, the consequences of the predictors are multiplicative, and we have to exponentiate the coefficients to transform them again to the unique scale. A one-unit improve in 𝓍 multiplies the end result 𝓎 by exp(β), or every further unit in 𝓍 results in a (exp(β) — 1) × 100 % change in 𝓎 [2]. 

    Wanting on the outcomes desk of the mannequin under, we’ve Training_time_hour, Hardware_quantity, and their interplay time period Training_time_hour:Hardware_quantity are steady variables, so their coefficients characterize slopes. In the meantime, since I specified +0 within the mannequin system, all ranges of the explicit Training_hardware act as intercepts, which means that every {hardware} sort acted because the intercept β₀ when its corresponding dummy variable was energetic. 

    > glm(system = log(Energy_kWh) ~ Training_time_hour * Hardware_quantity + 
        Training_hardware + 0, household = Gamma(), information = df)
    
    Coefficients:
                                                     Estimate Std. Error t worth Pr(>|t|)    
    Training_time_hour                             -1.587e-05  3.112e-06  -5.098 5.76e-06 ***
    Hardware_quantity                              -5.121e-06  1.564e-06  -3.275  0.00196 ** 
    Training_hardwareGoogle TPU v2                  1.396e-01  2.297e-02   6.079 1.90e-07 ***
    Training_hardwareGoogle TPU v3                  1.106e-01  7.048e-03  15.696  

    When changing the slopes to % change in response variable, the impact of every steady variable was nearly zero, even barely adverse:

    All of the intercepts have been additionally transformed again to simply round 1 kWh on the unique scale. The outcomes didn’t make any sense as at the very least one of many slopes ought to develop together with the big vitality consumption. I questioned if utilizing the log-linked mannequin with the identical predictors could yield totally different outcomes, so I match the mannequin once more:

    glm(system = Energy_kWh ~ Training_time_hour * Hardware_quantity + 
        Training_hardware + 0, household = Gamma(hyperlink = "log"), information = df)
    
    Coefficients:
                                                     Estimate Std. Error t worth Pr(>|t|)    
    Training_time_hour                              1.818e-03  1.640e-04  11.088 7.74e-15 ***
    Hardware_quantity                               7.373e-04  1.008e-04   7.315 2.42e-09 ***
    Training_hardwareGoogle TPU v2                  7.136e+00  7.379e-01   9.670 7.51e-13 ***
    Training_hardwareGoogle TPU v3                  1.004e+01  3.156e-01  31.808  

    This time, Training_time and Hardware_quantity would improve the entire vitality consumption by 0.18% per further hour and 0.07% per further chip, respectively. In the meantime, their interplay would lower the vitality use by 2 × 10⁵%. These outcomes made extra sense as Training_time can attain as much as 7000 hours and Hardware_quantity as much as 16000 items.

    To visualise the variations higher, I created two plots evaluating the predictions (proven as dashed traces) from each fashions. The left panel used the log-transformed Gamma GLM mannequin, the place the dashed traces have been almost flat and near zero, nowhere close to the fitted stable traces of uncooked information. Then again, the appropriate panel used log-linked Gamma GLM mannequin, the place the dashed traces aligned rather more intently with the precise fitted traces. 

    test_data %
      mutate(
        pred_energy1 = exp(predict(glm3, newdata = test_data)),
        pred_energy2 = predict(glm3_alt, newdata = test_data, sort = "response"),
      )
    y_limits 
    Determine 8. Relationship between {hardware} amount and log of vitality consumption throughout coaching time teams. In each panels, uncooked information is proven as factors, stable traces characterize fitted values from linear fashions, and dashed traces characterize predicted values from generalized linear fashions. The left panel makes use of a log-transformed Gamma GLM, whereas the appropriate panel makes use of a log-linked Gamma GLM with the identical predictors.

    Why Log Transformation Fails

    To grasp the rationale why the log-transformed mannequin can’t seize the underlying results because the log-linked one, let’s stroll by means of what occurs after we apply a log transformation to the response variable:

    Let’s say Y is the same as some operate of X plus the error time period:

    Once we apply a log remodeling to Y, we are literally compressing each f(X) and the error:

    Which means we’re modeling an entire new response variable, log(Y). Once we plug in our personal operate g(X)— in my case g(X) = Training_time_hour*Hardware_quantity + Training_hardware — it’s attempting to seize the mixed results of each the “shrunk” f(X) and error time period.

    In distinction, after we use a log hyperlink, we’re nonetheless modeling the unique Y, not the remodeled model. As an alternative, the mannequin exponentiates our personal operate g(X) to foretell Y.

    The mannequin then minimizes the distinction between the precise Y and the expected Y. That means, the error phrases stays intact on the unique scale:

    Conclusion

    Log-transforming a variable will not be the identical as utilizing a log hyperlink, and it might not at all times yield dependable outcomes. Below the hood, a log transformation alters the variable itself and distorts each the variation and noise. Understanding this delicate mathematical distinction behind your fashions is simply as essential as looking for the best-fitting mannequin. 


    [1] Epoch AI. Knowledge on Notable AI Fashions. Retrieved from https://epoch.ai/data/notable-ai-models

    [2] College of Virginia Library. Decoding Log Transformations in a Linear Mannequin. Retrieved from https://library.virginia.edu/data/articles/interpreting-log-transformations-in-a-linear-model



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleKnowledge Distillation: Making Powerful AI Smaller and Faster | by TeqnoVerse | May, 2025
    Next Article Many Small Business Owners Are Still ‘Optimistic’: Survey
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    A Review of AccentFold: One of the Most Important Papers on African ASR

    May 10, 2025
    Artificial Intelligence

    What My GPT Stylist Taught Me About Prompting Better

    May 10, 2025
    Artificial Intelligence

    How Not to Write an MCP Server

    May 9, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Now’s Your Chance to Get a MacBook Air for Just $200

    May 1, 2025

    Understanding The Formula: Normal Distribution | by Karthikeyan K | Mar, 2025

    March 16, 2025

    5 Effective Strategies to Reduce Your Tax Liability in 2025

    April 4, 2025

    Paper Forms Are Dead. This No-Code Form Builder Brings You into the Modern, Digital Era.

    March 20, 2025

    How AI is Transforming the Future of Podcasting

    February 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    ViT from scratch. Foreword | by Tyler Yu | May, 2025

    May 9, 2025

    Chobani Is Building a Billion Dollar Dairy Factory in NY

    April 23, 2025

    How to Master the 5 Pillars of Entrepreneurial Excellence

    April 1, 2025
    Our Picks

    The importance of contingency planning as you age

    February 10, 2025

    How This Software Can Help You Boost Your Real Estate Profits

    March 9, 2025

    Deploying Machine Learning Models with FastAPI | by Abhishek Shaw | Mar, 2025

    March 28, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.