For rugby followers the lengthy wait is sort of over, like Christmas the Six Nations comes annually to elevate our spirits within the chilly winter months. When you’re not very conversant in rugby, the Six Nations is an annual match the place the highest nationwide sides in Europe (England, France, Eire, Italy, Scotland, Wales) every play 5 fixtures alternating who performs at house or away every year. All groups compete to win, however essentially the most coveted prize is a ‘Grandslam’ — the place a staff wins all 5 of their fixtures. Given how aggressive the match is a Grandslam is fairly uncommon, and for the reason that match was expanded to 6 sides in 2000 there have solely been 13 Grandslams of a doable 25.
This 12 months, within the 2025 match, Eire come into the competitors competing for a 3rd consecutive sequence win with stiff competitors from France, who’s home league (The High 14) has been electrical this 12 months within the European Champions Cup.
With that in thoughts, and on condition that roughly half of tournaments have led to a Grandslam, how probably is a Grandslam in 2025? On this quick article we’ll discover how we are able to use earlier fixture outcomes and different info to make a greatest guess at how probably a Grandslam is. We’ll be specializing in linear fashions, and we’ll discover this from each the Frequentist and Bayesian Perspective. The fashions are constructed utilizing SciKit-Be taught and the Bayesian modelling library Bambi (which is constructed on high of the wonderful PyMC framework).
Learn on to grasp how and why I estimate the chance of a Six Nations Grandslam to be round 30–40% in 2025.
Within the age of AI individuals are more and more used to mapping inputs to outputs with extremely correct predictions. Whether or not that is utilizing LLMs to generate pure language responses, Laptop Imaginative and prescient fashions to tag pictures and even Auto ML to foretell tabular datasets it’s more and more taken with no consideration that these fashions simply work.
Regardless of this, the connection between inputs and outputs naturally entails a degree of uncertainty — and when you find yourself working with small or noisy datasets, such as you usually see in sports activities, it is very important connect an estimate of uncertainty to your predictions. For instance, the opening fixture of the 2025 Six Nations France host Wales at house — we could predict that France will win, however how assured are we about this?
The dataset used for this evaluation is sourced from publicly accessible sources, resembling Wikipedia. The problem with predicting 2025 fixture outcomes is that the out-of-sample predictions are primarily based on panel information, and staff type typically fluctuates throughout the years as squads and managers change.
In our publicly sourced information we collect stats from 2020–2024 together with:
- The age profile of squads
- The expertise of squads (i.e. variety of worldwide caps)
- The variety of distinct membership sides that make up a nationwide squad
- Earlier desk place
- Earlier fixture consequence
- Whether or not there’s a change of coach for the reason that earlier match
The information preparation right here is completed utilizing Pandas. Determine 1 exhibits how we merge the information on a fixture degree foundation, incorporating details about the squad for every year of the match. Taking a look at this we are able to see that in 2025:
- Eire have the oldest squad with a proportionally excessive variety of caps on common. This tells us that the squad is extremely established and, since Irish rugby is provincial, the squad is made up of solely 4 sides. Given the age profile of the aspect and that they’ve a brand new coach for this match there could also be uncertainty over whether or not they could be at or close to the ‘peak’ as a squad
- France have one of many youngest squads on common and, on common, the bottom variety of caps. Regardless of this they’ve been performing exceptionally nicely, and got here second within the 2024 match suggesting their squad is on the rise
- England have the second youngest squad, however proportionally extra caps on common suggesting they’re attempting to stability youth with expertise within the 2025 match
- Scotland have the second oldest and some of the capped squads within the match. They’ve a longtime aspect and, arguably, underperformed in 2024 the place they got here in fourth place. Their aspect could also be nearing its peak earlier than they undergo a interval of rebuilding
- Italy are in an identical place to Scotland by way of common variety of caps, however with a barely youthful age profile. There was numerous adjustments in administration through the years however come into the competitors this 12 months with a longtime squad and the identical coach. They may shock individuals this 12 months
- Wales are in a interval of rebuilding and have a younger and inexperienced squad and underperformed within the 2024 match the place they got here in final place
Since we’re utilizing linear strategies to foretell outcomes, I created a binary flag for whether or not or not the house aspect gained the fixture, and for every fixture we’ll predict the chances of the house aspect successful (i.e. sure/no). The chance of not successful at house is, implicitly, the identical as predicting that the away aspect win.
Earlier than constructing a predictive mannequin, it is very important do some exploratory evaluation. Determine 2 exhibits the correlation plot for the options.
As you would possibly count on, the place you completed final 12 months is extremely correlated to successful this 12 months. Likewise, your squad profile is extremely correlated with successful. Having a change of coach is correlated, however not as strongly — although this can be as a result of there are proportionally fewer situations the place this occurs between tournaments.
An vital consideration right here is whether or not there may be correlation amongst the inputs (options) of the mannequin, since autocorrelation can negatively influence mannequin reliability. We are able to see right here that there’s a sturdy correlation to the age and variety of caps, that is intuitive since older gamers will (on common) have extra caps. To accommodate this we exchange these inputs with a composite characteristic which represents the proportion of caps to age. We additionally take away just a few of the much less correlated inputs from the mannequin, since usually much less is extra when becoming a mannequin to keep away from overfitting.
As soon as we now have recognized the options of our mannequin we are able to put together the information for coaching. Since it is a panel information drawback we break up the information as beneath.
Mannequin Validation: We begin by validating the mannequin and getting an estimate of out-of-sample accuracy. To do that we back-test on earlier tournaments
- Prepare dataset — fixture outcomes 2020–2023
- Check dataset — fixture leads to 2024 match
Mannequin Predictions: We are able to create our predictive mannequin for 2025 for out-of-sample predictions as
- Prepare dataset — fixture outcomes from 2020–2024
- Prediction dataset — upcoming fixtures for 2025
We put together the dataset for modelling utilizing:
- One-hot encoding for fixtures
- MinMax scaling for numeric options
You will need to apply the scaling on every dataset individually to mitigate the chance of data leakage.
We are able to create our Frequentist mannequin utilizing SciKit-Be taught’s Logistic Regression classifier. Determine 3 exhibits the Confusion Matrix for the back-testing on 2020–2024 fixtures
In Determine 3 we are able to see that the accuracy of the mannequin is round 73%. Chances are you’ll be questioning why there’s a whole of 30 fixtures for the 2024 predictions when there’s solely 15 fixtures every match? The rationale for that is, so as to enhance mannequin accuracy, we stack the information in order that we get a Dwelling and Away consequence for every fixture. It is because sides solely play one another as soon as per 12 months and swap house and away every match. We, as people, perceive that France v Wales is identical as Wales v France, however the mannequin can not immediately perceive this. To do that we swap house and away, after which swap the binary flag for house win, preserving the integrity of the information.
For instance:
- 2024 Wales v France → HomeWin = 0 [original]
- 2024 France v Wales → HomeWin = 1 [inverted]
Utilizing our out-of-sample predictions for 2025 we get the beneath win chances for the upcoming 2025 match.
In Desk 1 we see that:
- Eire are anticipated to do nicely primarily based on earlier type and an opportunity to get a ‘three-peat’ (third consecutive title)
- France are anticipated to do very nicely, significantly at house
- England have a fairly sturdy likelihood, however in all chance will end mid-table
- Scotland are anticipated to have the slight edge within the Calcutta cup once more this 12 months, however it is going to be tight
- Italy and Wales can be anticipated to compete to keep away from the wood spoon, with Italy anticipated to be slight favourites
As soon as we’ve estimated the chances for the fixtures, we are able to use Monte Carlo strategies to simulate the match and estimate the chance of a Six Nations Grandslam. Monte Carlo strategies use random sampling to estimate chances and quantify uncertainty.
To do that we run 10,000 match simulations making a random alternative seeded with our win chances. To do that we use Numpy’s random alternative technique for our set of house and away fixtures with the corresponding win chances. Determine 4 exhibits us a violin plot for the simulated variety of wins per match per aspect
It’s value noting that these factors are jittered to enhance the aesthetics of the plot, however total, we are able to see from Determine 4 that:
- France and Eire are clear favourites to win, although primarily based on previous type Eire is likely to be anticipated to be extra prone to win a Grandslam
- It’s vital to notice that previous type doesn’t all the time predict present type, for instance Eire have a brand new head coach, the oldest staff and are taking a look at a rebuild part following the retirement of their key playmaker, Jonny Sexton
- England and Scotland might trigger some upsets, however are prone to be battling it out for the upper-mid desk place. Primarily based on current type Scotland usually tend to get 3 wins and England 2 wins, however there may be extra uncertainty on how England might do within the competitors
- Wales and Italy are prone to be scrapping it out for the underside of the desk, with each groups pretty prone to choose up a minimum of one win within the match, although this can be the Italy-Wales fixture, which Italy are doable favourites for given house benefit in 2025
General, this mannequin seems in-line with what many pundits have mentioned about their expectations for the match. One limitation of this strategy is that we’re making the idea that the win chances of the fixtures are usually distributed across the level estimates from the Logistic Regression mannequin. This can be a powerful assumption.
One other assumption of the mannequin is that the end result of a win in a single fixture doesn’t have an effect on the win chances in different fixtures, i.e. that fixtures are impartial. Personally, I don’t assume that is completely unreasonable since that is skilled sport, and sides are coached to have a successful mindset in every fixture — and infrequently sides are inconsistent between fixtures. For instance, Scotland carried out very nicely towards England in 2024 however went on to lose subsequent fixtures and England went on to beat Eire who finally gained the match.
We are able to keep away from making sturdy assumptions on the distribution of win chances throughout the match by as a substitute sampling these immediately. To do that we are able to use Markov Chain Monte Carlo (MCMC) strategies — which give a Bayesian strategy to estimating the distribution of mannequin parameters via random sampling. Basically, the fashions work by updating their prior beliefs on the distribution of mannequin parameters because the sampler observes actual information. As soon as the mannequin converges across the ‘true’ distributions it samples immediately from the posterior distribution of the mannequin parameters. Within the case of a Logistic Regression mannequin, we mannequin the goal variable as a Bernoulli distribution.
There are potential drawbacks to utilizing Bayesian Logistic Regression fashions, for instance they are often delicate to the priors that the mannequin assumes, the prediction chances is probably not nicely calibrated (relying on the prior assumptions) and, within the case of a hierarchical mannequin, there could also be ‘shrinkage’. Shrinkage happens the place hierarchy ranges are pulled the imply of the guardian degree — in sports activities modelling the influence of that is that groups which are on the high and backside of the desk could have their estimates pulled up or down in the direction of the imply of the desk.
Determine 5 exhibits the violin plot for the estimated distribution of wins taken immediately from the predictive posterior distribution. The distributions look a little bit extra unfold out than from our Logistic Regression, presumably indicating the upper unfold of uncertainty in our mannequin. Wanting on the plot there could also be some shrinkage as each Wales and Italy are anticipated to do higher than within the Logistic Regression mannequin, and Eire seem to have much less likelihood of a Grandslam.
We are able to use our samples to immediately estimate the chance of a Grandslam by merely taking the variety of Grandslams over the variety of tournaments, that is proven in Determine 6.
We are able to then evaluate our mannequin outcomes to revealed odds. I discovered some odds revealed by a guess maker on January 1st that gave the next odds:
- No Winner 5/6 [this implies Any Winner odds of 6/5]
- Eire 10/3
- France 9/2
- England 9/1
- Scotland 14/1
- Wales 500/1
- Italy 2000/1
We are able to convert the revealed odds to approximate chances utilizing the beneath method:
There are two issues to contemplate right here:
- Firstly, betting corporations publish implied odds relatively than true odds since they consider a revenue margin for the chances they publish (i.e. the home all the time wins)
- Secondly, odds change as new info turns into accessible. Our evaluation is comparatively easy and doesn’t consider accidents or different elements. That is vital since there have been notable accidents and withdrawals forward of the beginning of the match so the chances may have modified. For this reason I’m evaluating the chances we’ve estimated to ones revealed at first of the 12 months the place current accidents gained’t have an effect on the revealed odds.
So how do our fashions evaluate to revealed odds? Our Frequentist mannequin was surprisingly shut, and our Bayesian mannequin implied there was much less certainty on the chance of a Six Nations Grandslam. In Desk 2 you may see a comparability of the transformed odds and our estimated chances
General, our estimates don’t look unreasonable regardless of the comparatively small and sparse dataset we had been utilizing.
Our evaluation discovered that:
- Within the 2025 Six Nations France prone to find yourself punching above their weight given the comparatively youthful aspect they’ve received
- Eire look the probably to get a Grandslam, however that is primarily based on previous efficiency. With a brand new coach, growing older squad and altering of playmakers the outlook is much less sure
- England’s True Odds are prone to be worse than their Implied Odds and primarily based on previous efficiency ought to purpose for a powerful mid-table place. They’ve one of many youngest squads however with extra caps than different sturdy sides relative to their age profile. They’ve the potential to be disruptive within the match
- Scotland have a greater likelihood of a Grandslam than England and are prone to be additionally competing for a powerful mid-table place. They’ve the second oldest and most skilled staff after Eire and could also be at or close to their peak as a squad. Might it’s now or by no means for this squad?
- Wales and Italy are unlikely to be excessive performers within the 2025 Six Nations, and Italy can be vying to complete above Wales for the second 12 months working
- There’s a fairly sturdy likelihood of a Grandslam by any staff, round a 30–40% likelihood
- This might be a really aggressive match total with many sides having a great likelihood of successful
On this article we’ve seen how we are able to leverage Frequentist and Bayesian strategies to quantify uncertainty across the probably winners of the Six Nations in 2025. While our fashions had been comparatively easy and constrained to utilizing a small dataset our chances weren’t too dissimilar from revealed odds, although these have since modified as occasions have developed (accidents, call-ups, and so on.).
Thanks for studying this text, I hope its been fascinating. When you’re concerned about studying extra concerning the evaluation you will discover the complete code on my GitHub account.