Forecasting includes predicting future values based mostly on the historical past of previous values. It is likely one of the most typical functions of time collection evaluation, with makes use of spanning retail, climate forecasting, monetary markets, and extra (Aggarwal, 2015). Whereas forecasting underneath regular market circumstances is one factor, forecasting during times of great change — similar to after 2020 — is a far larger problem. This text shares my journey making an attempt to forecast hybrid automobile gross sales from 2010 to 2025, and the important thing classes I discovered about forecasting in a world of uncertainty.
Hybrid automobiles have been steadily rising in reputation, however after 2020, the market modified dramatically. The COVID-19 pandemic, altering client preferences, new laws, and authorities incentives all contributed to dramatic modifications in hybrid automobile gross sales (Grand View Analysis, 2024). The aim of this mission was to construct a mannequin that might predict future month-to-month hybrid automobile gross sales based mostly on historic patterns and the altering social and political panorama.
To that finish, I used:
- Month-to-month hybrid automobile gross sales knowledge from December 2010 by means of March 2025
- Financial options similar to gasoline costs and unemployment charges
- Depend of hybrid-related insurance policies
- Infrastructure progress metrics from EV charging station data
And I attempted the next modeling methods:
- A time collection mannequin Seasonal AutoRegressive Built-in Shifting Common (SARIMA)
- A machine studying mannequin LightGBM with added options
- Two-Regime modeling and chow check to investigate pre- and post-2020 tendencies
- Adjusted the train-test cut up methodology and added options to raised seize the post-2020 local weather
The next libraries have been put in and imported for processing in a Jupyter pocket book:
Month-to-month hybrid automobile gross sales from Argonne National Laboratory protecting 2010 to 2025 have been downloaded. Hybrid automobiles have been chosen electrical automobiles as a result of they supplied an extended historic document — essential for coaching time collection fashions. The one format accessible was pdf. Fortunately, pdfplumber can be utilized to parse this and create a dataframe.
We’ll seize the rows with months which maintain our knowledge.
Then we’ll cut up the strains, outline our columns, take away commas, then create a dataframe.
Then we convert to datetime, set the index, and preserve our column of focus. It will function our hybrid gross sales knowledge for all subsequent fashions.
After making ready the hybrid gross sales knowledge, I started with a standard time collection mannequin: SARIMA. They’re designed to seize time collection with seasonal elements, embody autoregressive relationships the place previous values affect present ones, and shifting averages the place previous forecast errors affect present values. SARIMA was a logical first alternative since hybrid automobile gross sales had development and seasonal elements (larger gross sales at sure instances of 12 months).
Our train-test cut up was set to earlier than 2020 and 2020 on respectively.
Then a dickey-fuller check for stationarity was accomplished. If the p-value is larger than 0.05, then it’s not stationary and differencing (d=1) is required.
I selected comparatively easy SARIMA parameters to start out: (order = (1,1,1), seasonal_order = (1,1,1,12)) — permitting for development differencing and annual seasonality.
Month-to-month forecasts have been produced throughout our check interval and measures of Imply Absolute Error (MAE), Root Imply Squared Error (RMSE), and Imply Absolute Proportion Error (MAPE). MAE is the typical dimension of errors in a bunch of predictions, RMSE is the sq. root of the typical squared variations between predicted and precise values, and MAPE is the typical absolute proportion error between predicted and precise values. Decrease values are higher for all three metrics.
Sadly, our SARIMA mannequin doesn’t carry out nicely as can been seen within the visible beneath.
Whereas SARIMA did seize some seasonal and development elements pre-2020, it missed the mark on the post-2020 growth in hybrid automobile gross sales. An essential lesson is that SARIMA, and different time collection fashions, assume future habits is statistically much like previous habits. An assumption which breaks down when the real-world sees dramatic change.
Given the boundaries of SARIMA, I turned to a extra versatile, feature-based machine studying mannequin LightGBM. LightGBM (Mild Gradient Boosting Machine) is a call tree-based algorithm designed for effectivity and scalability. It might probably deal with nonlinear relationships and be sturdy to totally different function sorts.
The mannequin contains lagged hybrid automobile gross sales for 1, 3, 6, and 12 months to seize momentum in hybrid gross sales. It additionally contains gasoline costs from the Energy Information Administration (EIA) to seize financial stress to modify to hybrids when gasoline costs rise.
First, a replica of the hybrid automobile gross sales dataframe was made and lagged options created utilizing the shift() perform.
Then month-to-month gasoline knowledge was downloaded from the EIA. Related columns have been stored, renamed, lacking values dropped, datetime format set, days modified to the primary of the month, index set, and the suitable timeframe set.
Then the gasoline knowledge was merged with the lagged hybrid gross sales. As a result of creation of lagged options, the info’s timeframe was adjusted to 2011 throughout the merge.
A easy month counter variable representing the variety of months for the reason that begin of the dataset to assist seize long-term adoption tendencies over time. Then pandas datareader was used to seize Federal Reserve Financial Knowledge unemployment charges.
Options variables and the goal variable have been set, pre and post-2020 train-test splits have been set, the mannequin was initialized, and fitted.
Then predictions have been made on the check set, the mannequin was assessed, and forecasted versus precise gross sales have been plotted.
Whereas the LightGBM confirmed some enchancment in comparison with SARIMA, it nonetheless struggled with the explosive progress in gross sales post-2022. Given how dramatically totally different the market was in our check knowledge, extra vital modeling modifications have been wanted.
In time collection forecasting, two-regime modeling refers to splitting the info into two distinct intervals (regimes) with totally different underlying circumstances. As a substitute of assuming your complete historic interval is statistically related, two-regime fashions acknowledge that elements like markets can dramatically change over time.
We outline our two regimes, pre- and post-2020, outline our options, and the goal variable.
The mannequin is initialized, fitted, and examined on the post-2020 cut up. This was as a result of the primary objective of this mannequin was to suit it on the post-2020 knowledge to point out statistically how totally different this time interval was. Then the outcomes are plotted.
We see dramatically improved outcomes by merely becoming the mannequin onto your complete post-2020 timeframe. Moderately than serving as a forecasting mannequin, the two-regime method gives sturdy statistical proof that hybrid gross sales basically modified in comparison with pre-2020 tendencies.
I additionally ran a chow check to substantiate that there was a structural break within the mannequin. The chow check determines whether or not a break takes place at a given interval in an in any other case steady time collection (Solar & Wang, 2021). Not surprisingly, these confirmed the dramatic change post-2020.
Recognizing this shift motivated a major change to the following steps of the modeling course of.
The Alternative Fuels Data Center (AFDC) accommodates knowledge associated to enacted federal and state legal guidelines and incentives for various fuels and automobiles. The aim of including this function was to raised seize the post-2020 regulatory atmosphere, which might result in the dramatic rise in hybrid automobile gross sales. The info was downloaded straight from the AFDC web site.
Then the info was loaded, filtered for hybrid-related legal guidelines that have been enacted, datetime adjusted, and saved as month-to-month counts in a brand new dataframe to be merged with the dataframe from the earlier, full LightGBM.
Now we’ll modify our train-test cut up to the tip of 2023. This manner, the mannequin can be taught extra of the post-2020 gross sales tendencies that have been dramatically totally different earlier than 2020. And the LightGBM might be initialized like earlier than.
Then we make our predictions with the brand new mannequin, get the identical metrics, and plot the precise versus predict gross sales.
With these modifications we noticed a barely larger MAE, however roughly 5,000 unit drop in RMSE, and a 16% discount in MAPE. The later train-test cut up allowed the mannequin to be taught post-2020 tendencies higher, and captured some surges associated to post-Inflation Discount Act adoption. Nevertheless, the mannequin nonetheless struggles with the excessive peaks seen in 2024–2025. Whereas a lot work nonetheless must be accomplished, adjusting the coaching timeframe and incorporating policy-related options improved forecasting efficiency.
Modeling Journey
This mission got down to forecast U.S. hybrid gross sales from 2010 to 2025, with an emphasis on navigating the complexities of post-2020 market modifications. We began with a standard time collection mannequin SARIMA, progressing to a machine studying mannequin LightGBM with lag options and exterior financial options. Finally, function units have been modified, with coverage and financial indicators launched, and the coaching timeframe was adjusted to raised seize regime shifts within the knowledge. Whereas constructing sturdy forecasts was a problem, this iterative method taught beneficial classes about forecasting during times of nice change.
Classes Discovered
- Markets change quicker than fashions: Conventional fashions which assume that future habits at all times mirrors the previous battle to foretell precisely when real-world circumstances change dramatically — as they did post-2020 on account of coverage, client, and financial modifications.
- Function Engineering Solely Helps So A lot: Exterior variables like gasoline costs and unemployment charges helped enhance forecasts, however couldn’t totally seize the size of hybrid gross sales progress seen after 2021. Actual-world dynamics are multifactorial and sometimes demand real-time knowledge.
- Selecting the Proper Prepare-Check Cut up is Essential: Shifting the train-test cut up to incorporate some post-2020 tendencies considerably improved forecast accuracy. This emphasised the significance of of aligning coaching knowledge to the context through which the mannequin is predicted to forecast into.
- Imperfect Fashions are Nonetheless Useful: Even when we don’t get the forecast accuracy we want, imperfect fashions can reveal essential dynamics, drive technique, and supply beneficial classes to researchers.
Future Enhancements
- Incorporate Actual-Time Knowledge: This will likely embody Google Developments search knowledge for hybrid automobiles so as to higher seize fast market modifications.
- Undertake Extra Dynamic Fashions: Strategies like Markov-Switching fashions might enable for larger adaptability when the underlying knowledge regime modifications dramatically.
- Financial Forecasts: Ahead-looking variables like coverage rollout schedules could higher inform future hybrid gross sales.
- Frequent Re-Coaching: Recurrently updating the mannequin with the most recent knowledge would enable for it to adapt to ongoing modifications.