Linear Regression in Time Series: Sources of Spurious Regression

1. Introduction

It’s pretty clear that most of our work will be automated by AI in the future. This will be possible because many researchers and professionals are working hard to make their work available online. These contributions not only help us understand fundamental concepts but also refine AI models, ultimately freeing up time to focus on other activities.

However, there is one concept that remains misunderstood, even among experts. It is spurious regression in time series analysis. This issue arises when regression models suggest strong relationships between variables, even when none exist. It is typically observed in time series regression equations that seem to have a high degree of fit — as indicated by a high R² (coefficient of multiple correlation) — but with an extremely low Durbin-Watson statistic (d), signaling strong autocorrelation in the error terms.

What is particularly surprising is that almost all econometric textbooks warn about the danger of autocorrelated errors, yet this issue persists in many published papers. Granger and Newbold (1974) recognized a number of examples. For example, they discovered printed equations with R² = 0.997 and the Durbin-Watson statistic (d) equal to 0.53. Essentially the most excessive discovered is an equation with R² = 0.999 and d = 0.093.

It’s particularly problematic in economics and finance, the place many key variables exhibit autocorrelation or serial correlation between adjoining values, notably if the sampling interval is small, corresponding to every week or a month, resulting in deceptive conclusions if not dealt with accurately. For instance, right now’s GDP is strongly correlated with the GDP of the earlier quarter. Our submit gives an in depth clarification of the outcomes from Granger and Newbold (1974) and Python simulation (see part 7) replicating the important thing outcomes introduced of their article.

Whether or not you’re an economist, information scientist, or analyst working with time sequence information, understanding this subject is essential to making sure your fashions produce significant outcomes.

To stroll you thru this paper, the following part will introduce the random stroll and the ARIMA(0,1,1) course of. In part 3, we’ll clarify how Granger and Newbold (1974) describe the emergence of nonsense regressions, with examples illustrated in part 4. Lastly, we’ll present the best way to keep away from spurious regressions when working with time sequence information.

2. Easy presentation of a Random Stroll and ARIMA(0,1,1) Course of

2.1 Random Stroll

Let 𝐗ₜ be a time sequence. We are saying that 𝐗ₜ follows a random stroll if its illustration is given by:

𝐗ₜ = 𝐗ₜ₋₁ + 𝜖ₜ. (1)

The place 𝜖ₜ is a white noise. It may be written as a sum of white noise, a helpful kind for simulation. It’s a non-stationary time sequence as a result of its variance relies on the time t.

2.2 ARIMA(0,1,1) Course of

The ARIMA(0,1,1) course of is given by:

𝐗ₜ = 𝐗ₜ₋₁ + 𝜖ₜ − 𝜃 𝜖ₜ₋₁. (2)

the place 𝜖ₜ is a white noise. The ARIMA(0,1,1) course of is non-stationary. It may be written as a sum of an impartial random stroll and white noise:

𝐗ₜ = 𝐗₀ + random stroll + white noise. (3) This kind is helpful for simulation.

These non-stationary sequence are sometimes employed as benchmarks towards which the forecasting efficiency of different fashions is judged.

3. Random stroll can result in Nonsense Regression

First, let’s recall the Linear Regression mannequin. The linear regression mannequin is given by:

𝐘 = 𝐗𝛽 + 𝜖. (4)

The place 𝐘 is a T × 1 vector of the dependent variable, 𝛽 is a Okay × 1 vector of the coefficients, 𝐗 is a T × Okay matrix of the impartial variables containing a column of ones and (Okay−1) columns with T observations on every of the (Okay−1) impartial variables, that are stochastic however distributed independently of the T × 1 vector of the errors 𝜖. It’s usually assumed that:

𝐄(𝜖) = 0, (5)

and

𝐄(𝜖𝜖′) = 𝜎²𝐈. (6)

the place 𝐈 is the id matrix.

A take a look at of the contribution of impartial variables to the reason of the dependent variable is the F-test. The null speculation of the take a look at is given by:

𝐇₀: 𝛽₁ = 𝛽₂ = ⋯ = 𝛽ₖ₋₁ = 0, (7)

And the statistic of the take a look at is given by:

𝐅 = (𝐑² / (𝐊−1)) / ((1−𝐑²) / (𝐓−𝐊)). (8)

the place 𝐑² is the coefficient of dedication.

If we wish to assemble the statistic of the take a look at, let’s assume that the null speculation is true, and one tries to suit a regression of the shape (Equation 4) to the degrees of an financial time sequence. Suppose subsequent that these sequence usually are not stationary or are extremely autocorrelated. In such a scenario, the take a look at process is invalid since 𝐅 in (Equation 8) shouldn’t be distributed as an F-distribution underneath the null speculation (Equation 7). In truth, underneath the null speculation, the errors or residuals from (Equation 4) are given by:

𝜖ₜ = 𝐘ₜ − 𝐗𝛽₀ ; t = 1, 2, …, T. (9)

And can have the identical autocorrelation construction as the unique sequence 𝐘.

Some concept of the distribution downside can come up within the scenario when:

𝐘ₜ = 𝛽₀ + 𝐗ₜ𝛽₁ + 𝜖ₜ. (10)

The place 𝐘ₜ and 𝐗ₜ comply with impartial first-order autoregressive processes:

𝐘ₜ = 𝜌 𝐘ₜ₋₁ + 𝜂ₜ, and 𝐗ₜ = 𝜌* 𝐗ₜ₋₁ + 𝜈ₜ. (11)

The place 𝜂ₜ and 𝜈ₜ are white noise.

We all know that on this case, 𝐑² is the sq. of the correlation between 𝐘ₜ and 𝐗ₜ. They use Kendall’s outcome from the article Knowles (1954), which expresses the variance of 𝐑:

𝐕𝐚𝐫(𝐑) = (1/T)* (1 + 𝜌𝜌*) / (1 − 𝜌𝜌*). (12)

Since 𝐑 is constrained to lie between -1 and 1, if its variance is bigger than 1/3, the distribution of 𝐑 can’t have a mode at 0. This means that 𝜌𝜌* > (T−1) / (T+1).

Thus, for instance, if T = 20 and 𝜌 = 𝜌*, a distribution that’s not unimodal at 0 will likely be obtained if 𝜌 > 0.86, and if 𝜌 = 0.9, 𝐕𝐚𝐫(𝐑) = 0.47. So the 𝐄(𝐑²) will likely be near 0.47.

It has been proven that when 𝜌 is near 1, 𝐑² will be very excessive, suggesting a powerful relationship between 𝐘ₜ and 𝐗ₜ. Nevertheless, in actuality, the 2 sequence are fully impartial. When 𝜌 is close to 1, each sequence behave like random walks or near-random walks. On prime of that, each sequence are extremely autocorrelated, which causes the residuals from the regression to even be strongly autocorrelated. Consequently, the Durbin-Watson statistic 𝐝 will likely be very low.

Because of this a excessive 𝐑² on this context ought to by no means be taken as proof of a real relationship between the 2 sequence.

To discover the potential for acquiring a spurious regression when regressing two impartial random walks, a sequence of simulations proposed by Granger and Newbold (1974) will likely be performed within the subsequent part.

4. Simulation outcomes utilizing Python.

On this part, we’ll present utilizing simulations that utilizing the regression mannequin with impartial random walks bias the estimation of the coefficients and the speculation exams of the coefficients are invalid. The Python code that can produce the outcomes of the simulation will likely be introduced in part 6.

A regression equation proposed by Granger and Newbold (1974) is given by:

𝐘ₜ = 𝛽₀ + 𝐗ₜ𝛽₁ + 𝜖ₜ

The place 𝐘ₜ and 𝐗ₜ had been generated as impartial random walks, every of size 50. The values 𝐒 = |𝛽̂₁| / √(𝐒𝐄̂(𝛽̂₁)), representing the statistic for testing the importance of 𝛽₁, for 100 simulations will likely be reported within the desk beneath.

**Desk 1: Regressing two impartial random walks**

The null speculation of no relationship between 𝐘ₜ and 𝐗ₜ is rejected on the 5% stage if 𝐒 > 2. This desk exhibits that the null speculation (𝛽 = 0) is wrongly rejected in a couple of quarter (71 occasions) of all instances. That is awkward as a result of the 2 variables are impartial random walks, that means there’s no precise relationship. Let’s break down why this occurs.

If 𝛽̂₁ / 𝐒𝐄̂ follows a 𝐍(0,1), the anticipated worth of 𝐒, its absolute worth, needs to be √2 / π ≈ 0.8 (√2/π is the imply of absolutely the worth of an ordinary regular distribution). Nevertheless, the simulation outcomes present a mean of 4.59, that means the estimated 𝐒 is underestimated by an element of:

4.59 / 0.8 = 5.7

In classical statistics, we normally use a t-test threshold of round 2 to test the importance of a coefficient. Nevertheless, these outcomes present that, on this case, you would want to make use of a threshold of 11.4 to correctly take a look at for significance:

2 × (4.59 / 0.8) = 11.4

Interpretation: We’ve simply proven that together with variables that don’t belong within the mannequin — particularly random walks — can result in fully invalid significance exams for the coefficients.

To make their simulations even clearer, Granger and Newbold (1974) ran a sequence of regressions utilizing variables that comply with both a random stroll or an ARIMA(0,1,1) course of.

Right here is how they arrange their simulations:

They regressed a dependent sequence 𝐘ₜ on m sequence 𝐗ⱼ,ₜ (with j = 1, 2, …, m), various m from 1 to five. The dependent sequence 𝐘ₜ and the impartial sequence 𝐗ⱼ,ₜ comply with the identical sorts of processes, and so they examined 4 instances:

Case 1 (Ranges): 𝐘ₜ and 𝐗ⱼ,ₜ comply with random walks.
Case 2 (Variations): They use the primary variations of the random walks, that are stationary.
Case 3 (Ranges): 𝐘ₜ and 𝐗ⱼ,ₜ comply with ARIMA(0,1,1).
Case 4 (Variations): They use the primary variations of the earlier ARIMA(0,1,1) processes, that are stationary.

Every sequence has a size of fifty observations, and so they ran 100 simulations for every case.

All error phrases are distributed as 𝐍(0,1), and the ARIMA(0,1,1) sequence are derived because the sum of the random stroll and impartial white noise. The simulation outcomes, based mostly on 100 replications with sequence of size 50, are summarized within the subsequent desk.

**Desk 2: Regressions of a sequence on m impartial ‘explanatory’ sequence.**

Interpretation of the outcomes :

It’s seen that the likelihood of not rejecting the null speculation of no relationship between 𝐘ₜ and 𝐗ⱼ,ₜ turns into very small when m ≥ 3 when regressions are made with random stroll sequence (rw-levels). The 𝐑² and the imply Durbin-Watson enhance. Comparable outcomes are obtained when the regressions are made with ARIMA(0,1,1) sequence (arima-levels).
When white noise sequence (rw-diffs) are used, classical regression evaluation is legitimate because the error sequence will likely be white noise and least squares will likely be environment friendly.
Nevertheless, when the regressions are made with the variations of ARIMA(0,1,1) sequence (arima-diffs) or first-order shifting common sequence MA(1) course of, the null speculation is rejected, on common:

(10 + 16 + 5 + 6 + 6) / 5 = 8.6

which is bigger than 5% of the time.

In case your variables are random walks or near them, and also you embrace pointless variables in your regression, you’ll typically get fallacious outcomes. Excessive 𝐑² and low Durbin-Watson values don’t affirm a real relationship however as a substitute point out a possible spurious one.

5. The way to keep away from spurious regression in time sequence

It’s actually exhausting to provide you with a whole listing of how to keep away from spurious regressions. Nevertheless, there are a number of good practices you possibly can comply with to decrease the danger as a lot as attainable.

If one performs a regression evaluation with time sequence information and finds that the residuals are strongly autocorrelated, there’s a major problem in terms of deciphering the coefficients of the equation. To test for autocorrelation within the residuals, one can use the Durbin-Watson take a look at or the Portmanteau take a look at.

Based mostly on the research above, we are able to conclude that if a regression evaluation carried out with economical variables produces strongly autocorrelated residuals, that means a low Durbin-Watson statistic, then the outcomes of the evaluation are more likely to be spurious, regardless of the worth of the coefficient of dedication R² noticed.

In such instances, you will need to perceive the place the mis-specification comes from. In response to the literature, misspecification normally falls into three classes : (i) the omission of a related variable, (ii) the inclusion of an irrelevant variable, or (iii) autocorrelation of the errors. More often than not, mis-specification comes from a mixture of these three sources.

To keep away from spurious regression in a time sequence, a number of suggestions will be made:

The primary advice is to pick the best macroeconomic variables which can be more likely to clarify the dependent variable. This may be executed by reviewing the literature or consulting specialists within the subject.
The second advice is to stationarize the sequence by taking first variations. Most often, the primary variations of macroeconomic variables are stationary and nonetheless simple to interpret. For macroeconomic information, it’s strongly advisable to distinguish the sequence as soon as to scale back the autocorrelation of the residuals, particularly when the pattern dimension is small. There’s certainly generally robust serial correlation noticed in these variables. A easy calculation exhibits that the primary variations will virtually at all times have a lot smaller serial correlations than the unique sequence.
The third advice is to make use of the Field-Jenkins methodology to mannequin every macroeconomic variable individually after which seek for relationships between the sequence by relating the residuals from every particular person mannequin. The thought right here is that the Field-Jenkins course of extracts the defined a part of the sequence, leaving the residuals, which include solely what can’t be defined by the sequence’ personal previous conduct. This makes it simpler to test whether or not these unexplained elements (residuals) are associated throughout variables.

6. Conclusion

Many econometrics textbooks warn about specification errors in regression fashions, however the issue nonetheless exhibits up in lots of printed papers. Granger and Newbold (1974) highlighted the danger of spurious regressions, the place you get a excessive paired with very low Durbin-Watson statistics.

Utilizing Python simulations, we confirmed a few of the essential causes of those spurious regressions, particularly together with variables that don’t belong within the mannequin and are extremely autocorrelated. We additionally demonstrated how these points can fully distort speculation exams on the coefficients.

Hopefully, this submit will assist cut back the danger of spurious regressions in future econometric analyses.

7. Appendice: Python code for simulation.

#####################################################Simulation Code for desk 1 #####################################################

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

np.random.seed(123)
M = 100 
n = 50
S = np.zeros(M)
for i in vary(M):
#---------------------------------------------------------------
# Generate the info
#---------------------------------------------------------------
    espilon_y = np.random.regular(0, 1, n)
    espilon_x = np.random.regular(0, 1, n)

    Y = np.cumsum(espilon_y)
    X = np.cumsum(espilon_x)
#---------------------------------------------------------------
# Match the mannequin
#---------------------------------------------------------------
    X = sm.add_constant(X)
    mannequin = sm.OLS(Y, X).match()
#---------------------------------------------------------------
# Compute the statistic
#------------------------------------------------------
    S[i] = np.abs(mannequin.params[1])/mannequin.bse[1]


#------------------------------------------------------ 
#              Most worth of S
#------------------------------------------------------
S_max = int(np.ceil(max(S)))

#------------------------------------------------------ 
#                Create bins
#------------------------------------------------------
bins = np.arange(0, S_max + 2, 1)  

#------------------------------------------------------
#    Compute the histogram
#------------------------------------------------------
frequency, bin_edges = np.histogram(S, bins=bins)

#------------------------------------------------------
#    Create a dataframe
#------------------------------------------------------

df = pd.DataFrame({
    "S Interval": [f"{int(bin_edges[i])}-{int(bin_edges[i+1])}" for i in vary(len(bin_edges)-1)],
    "Frequency": frequency
})
print(df)
print(np.imply(S))

#####################################################Simulation Code for desk 2 #####################################################

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.stattools import durbin_watson
from tabulate import tabulate

np.random.seed(1)  # Pour rendre les résultats reproductibles

#------------------------------------------------------
# Definition of capabilities
#------------------------------------------------------

def generate_random_walk(T):
    """
    Génère une série de longueur T suivant un random stroll :
        Y_t = Y_{t-1} + e_t,
    où e_t ~ N(0,1).
    """
    e = np.random.regular(0, 1, dimension=T)
    return np.cumsum(e)

def generate_arima_0_1_1(T):
    """
    Génère un ARIMA(0,1,1) selon la méthode de Granger & Newbold :
    la série est obtenue en additionnant une marche aléatoire et un bruit blanc indépendant.
    """
    rw = generate_random_walk(T)
    wn = np.random.regular(0, 1, dimension=T)
    return rw + wn

def distinction(sequence):
    """
    Calcule la différence première d'une série unidimensionnelle.
    Retourne une série de longueur T-1.
    """
    return np.diff(sequence)

#------------------------------------------------------
# Paramètres
#------------------------------------------------------

T = 50           # longueur de chaque série
n_sims = 100     # nombre de simulations Monte Carlo
alpha = 0.05     # seuil de significativité

#------------------------------------------------------
# Definition of operate for simulation
#------------------------------------------------------

def run_simulation_case(case_name, m_values=[1,2,3,4,5]):
    """
    case_name : un identifiant pour le kind de génération :
        - 'rw-levels' : random stroll (ranges)
        - 'rw-diffs'  : variations of RW (white noise)
        - 'arima-levels' : ARIMA(0,1,1) en niveaux
        - 'arima-diffs'  : différences d'un ARIMA(0,1,1) => MA(1)
    
    m_values : liste du nombre de régresseurs.
    
    Retourne un DataFrame avec pour chaque m :
        - % de rejets de H0
        - Durbin-Watson moyen
        - R^2_adj moyen
        - % de R^2 > 0.1
    """
    outcomes = []
    
    for m in m_values:
        count_reject = 0
        dw_list = []
        r2_adjusted_list = []
        
        for _ in vary(n_sims):
#--------------------------------------
# 1) Era of independents de Y_t and X_{j,t}.
#----------------------------------------
            if case_name == 'rw-levels':
                Y = generate_random_walk(T)
                Xs = [generate_random_walk(T) for __ in range(m)]
            
            elif case_name == 'rw-diffs':
                # Y et X sont les différences d'un RW, i.e. ~ white noise
                Y_rw = generate_random_walk(T)
                Y = distinction(Y_rw)
                Xs = []
                for __ in vary(m):
                    X_rw = generate_random_walk(T)
                    Xs.append(distinction(X_rw))
                # NB : maintenant Y et Xs ont longueur T-1
                # => ajuster T_effectif = T-1
                # => on prendra T_effectif factors pour la régression
            
            elif case_name == 'arima-levels':
                Y = generate_arima_0_1_1(T)
                Xs = [generate_arima_0_1_1(T) for __ in range(m)]
            
            elif case_name == 'arima-diffs':
                # Différences d'un ARIMA(0,1,1) => MA(1)
                Y_arima = generate_arima_0_1_1(T)
                Y = distinction(Y_arima)
                Xs = []
                for __ in vary(m):
                    X_arima = generate_arima_0_1_1(T)
                    Xs.append(distinction(X_arima))
            
            # 2) Prépare les données pour la régression
            #    Selon le cas, la longueur est T ou T-1
            if case_name in ['rw-levels','arima-levels']:
                Y_reg = Y
                X_reg = np.column_stack(Xs) if m>0 else np.array([])
            else:
                # dans les cas de différences, la longueur est T-1
                Y_reg = Y
                X_reg = np.column_stack(Xs) if m>0 else np.array([])
            
            # 3) Régression OLS
            X_with_const = sm.add_constant(X_reg)  # Ajout de l'ordonnée à l'origine
            mannequin = sm.OLS(Y_reg, X_with_const).match()
            
            # 4) Check international F : H0 : tous les beta_j = 0
            #    On regarde si p-value  0.7)
        
        outcomes.append({
            'm': m,
            'Reject %': reject_percent,
            'Imply DW': dw_mean,
            'Imply R^2': r2_mean,
            '% R^2_adj>0.7': r2_above_0_7_percent
        })
    
    return pd.DataFrame(outcomes)
    
#------------------------------------------------------
# Utility of the simulation
#------------------------------------------------------       

instances = ['rw-levels', 'rw-diffs', 'arima-levels', 'arima-diffs']
all_results = {}

for c in instances:
    df_res = run_simulation_case(c, m_values=[1,2,3,4,5])
    all_results[c] = df_res

#------------------------------------------------------
# Retailer information in desk
#------------------------------------------------------

for case, df_res in all_results.gadgets():
    print(f"nn{case}")
    print(tabulate(df_res, headers='keys', tablefmt='fancy_grid'))

References

Granger, Clive WJ, and Paul Newbold. 1974. “Spurious Regressions in Econometrics.” Journal of Econometrics 2 (2): 111–20.
Knowles, EAG. 1954. “Workouts in Theoretical Statistics.” Oxford College Press.

Source link

How AI Agents “Talk” to Each Other

Stop Building AI Platforms | Towards Data Science

What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization

ChatGPT Is Fixing Its ‘Annoying’ New Personality

5 Money Habits That Set Successful Entrepreneurs Apart

Nail Your Data Science Interview: Day 11 — Natural Language Processing | by Payal Choudhary | May, 2025

Is Your Company Ready for AI? Signs You’re in It for Strategy, Not Just Headlines | by Medoid AI | Apr, 2025

Data as a Product: The Evolution of Data Delivery | by Tushar Mahuri | May, 2025

Most Popular

OpenAI just released GPT-4.5 and says it is its biggest and best chat model yet

Why Lack of Accountability Is the Silent Productivity Killer

K-Nearest Neighbor (KNN) — The Lazy Learning Algorithm | by Bhakti K | Feb, 2025

Our Picks

How MSPs Can Build Brands That Clients Can’t Resist

New-Generation Marketing Mix Modelling with Meridian | by Benjamin Etienne | Feb, 2025

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster