Can Machine Learning Predict the World Cup?

With FIFA set to kick off on Thursday, June 11, 2026, the opening match on the Mexico Metropolis Stadium, I believe it will be enjoyable to construct one of the best ML mannequin we are able to to foretell match outcomes. To do that, I’ve introduced collectively a number of databases—49,000 matches—with knowledge on Elo scores, match outcomes, and cup places. From FIFA to the Baltic Cup, with matches from 1872 to 2026, we are going to take a probabilistic method to the game.

We’ll evaluate the efficiency of a number of ML fashions, together with

multinomial regression

multinomial ridge / elastic-net mannequin

LightGBM

We may even work to grasp the strengths and weaknesses of our fashions to create a well-calibrated mannequin that predicts dwelling wins 86% of the time. By weighing mannequin efficiency, calibration, and complexity, we are going to discover one of the best mannequin for our knowledge.

Soccer by the Numbers

Distribution of whole objectives per match within the coaching dataset, displaying a powerful focus of matches with low aim totals and an extended proper tail of more and more uncommon high-scoring video games. Illustration by Creator.

Lots of people say soccer is sleep-inducing. As a soccer fan, I disagree, however to be honest, this isn’t with out purpose. Nearly all of matches finish with fewer than 5 objectives, and something above 20 is an anomaly, if not unattainable. In distinction, it’s not unusual for one participant to attain greater than 50 factors in an NBA recreation. However regardless of the tempo, pubs from England to botecos in Rio stay full.

What critics don’t perceive is that the low rating could make a recreation extra fascinating, as this makes it tougher for groups to achieve a considerable lead, maintaining followers on the sting till the tip. Sadly, this additionally means matches finish in a draw near 22% of the time—which can be infuriating. But the game stays as well-liked as ever.

Bar chart of international football matches by year before 2018, showing growth from few early matches to high annual match counts in the modern era. — Annual rely of worldwide matches within the pre-2018 coaching dataset, displaying the long-term growth of worldwide soccer exercise from sparse early information to persistently excessive match volumes after the late twentieth century. Illustration by Creator

The truth that so many matches finish in a draw truly turns into a modeling downside later, however earlier than we get to that lets go over how we put this knowledge togther.

Stitching the information collectively

Oftentimes one of the best ways to enhance a mannequin is to easily get extra knowledge. We might be working with international_results.csv, international_team_ratings.csv and international_goalscorers.csv

We need to matchinternational_results.csv to international_team_ratings.csv so we are able to use Elo scores. This could possibly be easy, however as you would possibly’ve guessed, the crew names don’t match up completely, so we have to flip to textual content processing until we need to verify 336 groups individually. We additionally have to be extremely cautious of when the Elo ranking was up to date. We may take the Elo on the identical day the match happens, however that might be a supply of knowledge leakage, as Elo scores are up to date solely after the match. Making use of it as a characteristic tempting however problematic.

We should take the latest Elo rating, and as an extra engineered characteristic we hold observe of the time for the reason that newest Elo replace, positing that earlier scores can be extra informative than older ones. The code for becoming a member of these tables and the complete mission is out there within the Appendix.

Horizontal bar chart ranking international football tournaments by match count, with friendlies and FIFA World Cup qualification as the largest categories in the training dataset — Prime tournaments by match rely within the coaching dataset, highlighting the dominance of friendlies and FIFA World Cup qualification matches relative to all different worldwide competitions. Illustration by Creator.

international_results.csv

Area kind	Examples
Match identification	`source_match_id`, `date`, `season`, `competitors`
Groups	`home_team`, `away_team`
Last end result	`home_score`, `away_score`, `match_result`, `result_class`
Context	`impartial`, `event`, `metropolis`, `nation`

international_team_ratings.csv

Function	Which means
`home_rating_pre_match`	Dwelling crew Elo earlier than kickoff
`away_rating_pre_match`	Away crew Elo earlier than kickoff
`rating_diff`	Dwelling Elo minus away Elo
`rating_age_days_home`	How stale the house crew ranking is
`rating_age_days_away`	How stale the away crew ranking is

international_goalscorers.csv

Function thought	Which means
Distinctive scorers in current matches	Whether or not a crew depends upon one scorer or many
Targets by high scorer	Focus of scoring
Current scoring type	Attacking output earlier than this match

Bar chart comparing train and test class distribution for football match results, showing shares of home wins, draws, and away wins in each dataset split. — Comparability of match-result class distributions throughout the coaching and take a look at splits, displaying broadly related final result shares with dwelling wins as essentially the most frequent end result, adopted by away wins and attracts. Illustration by Creator.

As a result of we’re doing a time-series prediction, we have to guarantee our cut up respects the time order. We’ll consider our mannequin on all video games from 2018 onward, which might be roughly 8,000 matches.

Efficient cut up	Approximate date logic
mannequin practice	earlier a part of pre-2018 knowledge
validation	newest ~20% of the pre-2018 coaching pool
take a look at	2018 onward

Engineered Options

Grid of histograms showing engineered football prediction feature distributions, including prior matches, recent draw rates, goal differences, goals scored, goals conceded, and points per match. — Overview of engineered characteristic distributions used for mannequin coaching, displaying prior match counts, current draw charges, goal-difference measures, goals-for and goals-against charges, and points-per-match indicators throughout dwelling and away crew histories. Illustration by Creator.

We need to transfer from fundamental match-level predictors in the direction of richer pre-match options that seize: crew energy, attacking and defensive high quality, dwelling/away results, matchup stability, goalkeeper energy, historic efficiency developments.

1. Draw-modeling options

Probably the most evident failure of our baseline multinomial logistic regression mannequin was its weak efficiency at classifying attracts. Whereas the mannequin may calculate the chance of a draw as a result of we outlined the goal variable as match_result ∈ {H, D, A} (Dwelling win, Draw, Away win), Draw was merely by no means the more than likely final result. We will see this by the lacking column for Attracts within the confusion matrix.

Confusion matrix for a baseline football match prediction model, showing actual versus predicted home wins, draws, and away wins on the test set with row-normalized percentages. — Row-normalized take a look at confusion matrix for one of the best baseline mannequin, displaying that the mannequin predicts solely dwelling and away outcomes, with dwelling wins most frequently categorized accurately and attracts by no means predicted as a separate class. Illustration by Creator.

This poor draw efficiency shouldn’t be particular to at least one mannequin household. After we isolate high-confidence errors — circumstances the place the mannequin’s predicted class was improper, and its most predicted chance was no less than 0.60 — the identical sample seems throughout fashions: they’re systematically overconfident in dwelling wins. Many matches that truly led to attracts had been assigned a assured home-win prediction, suggesting that the fashions seize team-strength course higher than match-level uncertainty or draw probability.

Faceted bar chart of high-confidence wrong football match predictions on the test set, comparing glmnet multinomial ridge, LightGBM, and multinomial models by actual and predicted class. — Counts of high-confidence improper predictions on the take a look at set for Mannequin, evaluating three mannequin households and displaying that the majority assured errors happen when precise attracts are predicted as dwelling wins. Illustration by Creator.

To deal with this ‘blindness’ to the draw choice, we are able to engineer options reminiscent of abs_rating_diff, home_draw_rate_last_5, form_draw_rate_mean_last_5, and binary context options like impartial, flag_is_world_cup, and flag_is_friendly, indicating whether or not the match is on impartial floor or on the World Cup.

Function group	Which means	Examples
Elo closeness	Measures how evenly matched the groups are. Smaller ranking gaps are particularly related for draw chance.	`abs_rating_diff`
Current draw tendency	Measures how usually every crew’s prior matches led to attracts.	`home_draw_rate_last_5`, `away_draw_rate_last_10`
Mixed draw tendency	Captures whether or not each groups have just lately been draw-prone.	`form_draw_rate_mean_last_5`, `form_draw_rate_mean_last_10`
Match context	Event and venue indicators that will have an effect on draw frequency.	`impartial`, `flag_is_world_cup`, `flag_is_friendly`

Last LightGBM predicted chances by final result class. Illustration by Creator.

With these options, our mannequin can now higher discriminate between Dwelling/Away wins and attracts, as evidenced by a 3.3% enhance in true-positive draw predictions. That is nonetheless low, on condition that ~20% of matches finish in attracts. So our options assist however not by a lot. This implies that it could possibly be price constructing a mannequin devoted to attract modeling, with the goal variable match_result ∈ {D, ¬D}, however for now we have to engineer extra options.

¬D represents not D that means our goal variable is the match ends in draw (1), or match doesn’t finish in draw (0)

Confusion matrix for a LightGBM football prediction model on the test split, showing actual versus predicted home win, draw, and away win classes. — Check confusion matrix for one of the best LightGBM validation mannequin. Illustration by Creator.

2. Elo options

The common crew has an Elo barely above 1500; that is close to Saudi Arabia, Iceland, and Haiti for FIFA 2026. After we graph the distributions of Dwelling wins, Attracts, and away wins, we are able to see that because the distinction decreases, Attracts turn into more and more probably. Our distributions are additionally barely shifted to the left, indicating a small dwelling benefit, as anticipated.

We’d be leaving LogLoss factors on the desk if we relied solely on pre-match Elo as our solely characteristic. To get essentially the most from the information, we additionally

Function	Which means
`home_rating_pre_match`	Dwelling crew Elo ranking earlier than kickoff.
`away_rating_pre_match`	Away crew Elo ranking earlier than kickoff.
`rating_diff`	Dwelling crew Elo minus away crew Elo earlier than kickoff. Constructive values favor the house crew.
`rating_age_days_home`	Days for the reason that dwelling crew’s Elo ranking was final up to date.
`rating_age_days_away`	Days for the reason that away crew’s Elo ranking was final up to date.

Line chart of predicted football match probabilities by rating difference, showing away win, draw, and home win probability curves. — Multinomial chance curves by ranking distinction. Illustration by Creator.

3. Rolling past-performance options

A critic may argue that utilizing rolling previous efficiency and Elo shouldn’t be a good suggestion, since they each mannequin crew energy, which might add redundant or extremely correlated options to the mannequin.

Rolling previous efficiency does seize crew energy, however it’s particularly there to assist the modeling of crew momentum. Profitable streaks are a really actual factor in sports activities. Actually, the present best choice by supercomputers is Spain. One purpose they’re predicted first is their historic 31-match unbeaten streak coming into FIFA 2026.

Function group	Which means	Examples
Current factors per match	Common factors earned over every crew’s earlier 5 or 10 matches.	`home_points_per_match_last_5`, `away_points_per_match_last_10`
Current aim distinction	Common objectives scored minus objectives conceded over prior matches.	`home_goal_diff_per_match_last_5`, `away_goal_diff_per_match_last_10`
Current draw price	Share of prior matches that led to a draw.	`home_draw_rate_last_5`, `away_draw_rate_last_10`
Dwelling-away type variations	Distinction between the house and away groups on the identical rolling metric.	`form_points_diff_last_5`, `form_goal_diff_diff_last_10`
Prior match counts	Variety of earlier matches accessible earlier than the fixture.	`home_prior_matches`, `away_prior_matches`

4. Assault and protection type options

Whereas our mannequin tried to seize attacking and defending crew energy via factors, that is the place our mannequin falls in need of super-computer approaches. Trendy approaches usually additionally implement participant knowledge, which is invaluable in computing a crew’s strengths. As a result of we’re working solely with game-level knowledge, our modeling of attacking and defensive options is computed from earlier match outcomes like Current scoring charges, conceding charges, Scoring-rate distinction, and Conceding-rate distinction.

Function group	Which means	Examples
Current scoring price	Common objectives scored per match over the earlier 5 or 10 matches.	`home_goals_for_per_match_last_5`, `away_goals_for_per_match_last_10`
Current conceding price	Common objectives conceded per match over the earlier 5 or 10 matches.	`home_goals_against_per_match_last_5`, `away_goals_against_per_match_last_10`
Scoring-rate distinction	Dwelling crew’s current scoring price minus away crew’s current scoring price.	`form_goals_for_diff_last_5`, `form_goals_for_diff_last_10`
Conceding-rate distinction	Dwelling crew’s current conceding price minus away crew’s current conceding price. Decrease values favor the house crew defensively.	`form_goals_against_diff_last_5`, `form_goals_against_diff_last_10`

Correlation heatmap of numeric football model features, including rating difference, pre-match ratings, rating age, and season variables. — Correlation heatmap of numeric mannequin options. Illustration by Creator.

Grid Search

As a result of massive search grids can overfit in cross-validation, and grid search scales multiplicatively, parameters are searched logarithmically (1e-5, 1e-4, 1e-3, 1e-2). Besides with parameters like alpha, which should exist between zero and one.

glmnet_alpha Controls the elastic-net mix between ridge and lasso regression, the place zero is Pure ridge, and one is pure lasso.

multinomial_decay penalizes massive coefficients extra. That may cut back overfitting, however extreme decay can result in underfitting.

Grid Search O(n) = quantity of configurations examined × time to practice one mannequin

Mannequin household	Grid/configurations proven	What was tuned
Baselines	`majority_baseline`, `frequency_baseline`, `rating_diff_multinom`	Principally not tuned; comparability baselines
glmnet	`alpha = 0, .25, .5, .75, 1`	Elastic-net mixing parameter
multinom	`decay = 0, 1e-5, 1e-4, 1e-3, 1e-2`	L2 weight decay / coefficient shrinkage
LightGBM	`less_regular`, `deeper`, `more_regular`, `current_final`, `l2_regularized`, `shallower`, `l1_l2_regularized`, `compact_robust`, `faster_small`, `slower_small`	Named bundles of tree-depth, learning-rate, boosting-round, and regularization settings

LightGBM was essentially the most advanced mannequin household within the comparability. Not like the baseline fashions, which used few or no tuning parameters, LightGBM required decisions about tree complexity, studying price, boosting rounds, and regularization. This made it extra versatile, but additionally elevated the chance of overfitting if the parameters weren’t tuned rigorously. We additionally have to take care to not use a mannequin that’s extra sophisticated than our knowledge requires, as we may lose out on interpretability.

The GBM parameters had been tuned by evaluating a compact grid of LightGBM configurations. These configurations assorted tree complexity, studying velocity, variety of boosting rounds, and regularization energy, maintaining one of the best mannequin scored on log-loss. Beneath is a listing of the LightGBM parameters.

Parameter	Which means
`learning_rate`	How a lot every new tree is allowed to vary the mannequin. Decrease values study extra slowly however can generalize higher.
`num_iterations` / `nrounds`	Variety of boosting rounds, that means what number of bushes are added. Extra bushes can enhance efficiency however may overfit.
`num_leaves`	Controls how advanced every tree will be. Extra leaves permit extra detailed patterns however enhance overfitting danger.
`max_depth`	Most depth of every tree. Deeper bushes seize extra advanced interactions. Shallower bushes are easier and safer.
`min_data_in_leaf`	Minimal variety of observations required in a leaf. Larger values make the mannequin much less delicate to small noisy patterns.
`lambda_l1`	L1 regularization. Pushes some results towards zero, making the mannequin easier.
`lambda_l2`	L2 regularization. Shrinks massive results and reduces overconfidence.
`feature_fraction`	Fraction of options used for every tree. Utilizing fewer options can cut back overfitting.
`bagging_fraction`	Fraction of rows used for every tree. Utilizing fewer rows may cut back overfitting.
`bagging_freq`	How usually row subsampling is utilized. If set to `0`, bagging is often off.

Horizontal bar chart comparing Model validation multiclass log loss across baseline, glmnet, multinomial, and LightGBM configurations. — Validation log loss by Mannequin configurations. Illustration by Creator.

Horizontal bar chart comparing best validation log loss across baseline, glmnet, multinomial, and LightGBM football prediction model families. — Greatest validation log loss by mannequin household. Illustration by Creator.

Last Mannequin

The official chosen mannequin was LightGBM with the safe_plus_form_compact characteristic set, utilizing 20 pre-match options drawn from Elo scores, event context, and lagged crew summaries. It was chosen based mostly on the bottom validation-set multiclass log loss, with the take a look at set reserved for remaining reporting.

The chosen LightGBM mannequin achieved a validation log lack of 0.893 and a take a look at log loss of 0.873. Its validation end result was one of the best throughout the Mannequin comparability, however the margin over regression was small: multinomial regression trailed by solely about 0.002 log-loss factors on validation. On the held-out take a look at set, multinomial regression barely outperformed LightGBM on each log loss and macro F1.

Line chart comparing test and validation log loss across football model feature tiers, showing the effect of baseline ratings, context, lagged form, and goalscorer features. — Incremental log loss throughout characteristic tiers. Illustration by Creator.

Which means the end result needs to be interpreted cautiously. LightGBM is the formally chosen predictive mannequin, however the proof doesn’t present that gradient boosting clearly dominates easier regression fashions for the given knowledge. Regression fashions stay extremely vital as a result of they’re simpler to interpret and carry out almost in addition to, and in some take a look at metrics barely higher than, different strategies.

Faceted bar chart comparing baseline football prediction models by accuracy, Brier score, log loss, and macro F1 across test and validation splits. — Baseline mannequin metrics throughout take a look at and validation splits. Illustration by Creator.

Function engineering produced equally modest good points. Compact lagged options improved validation log loss relative to baseline, however the take a look at enchancment was tiny. Goalscorer options didn’t meaningfully enhance log loss within the Mannequin comparability.

Bar chart comparing classwise F1 scores for LightGBM football prediction models across feature tiers, showing home win, draw, and away win performance on test and validation splits. — Classwise LightGBM F1 by characteristic tier. Illustration by Creator.

The clearest limitation was draw prediction. The chosen mannequin virtually by no means predicted draw as the highest class: on the take a look at set, it accurately predicted solely 2 attracts out of 1,784 precise attracts, for draw recall of 0.11%. This implies that the mannequin’s chance estimates should still include helpful info, however argmax classification stays strongly biased towards dwelling and away wins, making a separate mannequin for draw modeling an affordable subsequent step. Elo and compact pre-match type present a helpful sign stack, however the good points over robust baselines are incremental.

The mannequin is significantly better at predicting dwelling wins than away wins on the take a look at set:

It accurately identifies about 87% of precise dwelling wins
It accurately identifies about 63% of precise away wins

The mannequin can also be able to outputting a chance distribution over Dwelling, Draw, and Away wins, which is commonly extra helpful than only a single arduous prediction.

Calibration

Histogram of final LightGBM football prediction confidence, comparing correct and incorrect predictions by maximum predicted class probability. — Last mannequin confidence by prediction correctness. Illustration by Creator.

The baseline-plus fashions are broadly effectively calibrated on the take a look at set. Throughout confidence bins. This implies predicted confidence tracks noticed accuracy, that means when the fashions are reasonably assured, they’re right at roughly the corresponding price, and when confidence rises, noticed accuracy rises with it. The deviations from the perfect calibration line are modest, suggesting that the fashions’ chance estimates are typically usable somewhat than only a rank-ordering of outcomes.

The plot under measures calibration of the highest predicted class—the mannequin’s confidence in whichever final result it selected—not calibration for dwelling wins, attracts, and away wins individually. A mannequin can due to this fact look effectively calibrated total whereas nonetheless misestimating one class, particularly attracts. The combination calibration plot helps the declare that the fashions’ confidence scores are broadly reliable, however it doesn’t, by itself, present that the draw chances are effectively calibrated.

Calibration plot comparing test accuracy and mean prediction confidence for baseline-plus football prediction models, with bin sizes and ideal calibration reference line. — Check calibration curves for baseline-plus fashions. Illustration by Creator.

The category-specific calibration plots present the place that combination image holds and the place it turns into extra sophisticated. Dwelling-win and away-win chances comply with the perfect calibration line intently throughout most bins: because the mannequin assigns larger chance to both final result, the noticed frequency rises at roughly the identical price. In sensible phrases, the mannequin’s dwelling and away chances behave like significant chances, not simply scores.

Faceted calibration plot for the best football prediction model on the test split, comparing mean predicted probability with observed frequency for away wins, draws, and home wins. — Calibration bins for one of the best validation mannequin. Illustration by Creator.

Attracts are totally different. The mannequin’s draw chances are moderately calibrated inside its vary, however that vary is slender. It hardly ever assigns draw chances a lot above the low-to-middle vary, even when the match is comparatively balanced.

That is the central distinction: the mannequin doesn’t ignore attracts; it often treats them as danger components somewhat than probably outcomes. Draw chances should still be helpful for measuring draw danger, however attracts seldom turn into the mannequin’s high prediction, which helps clarify the persistent weak point in draw recall.

Faceted calibration plot for Model 33 LightGBM football predictions on the test set, comparing observed frequency and mean predicted probability for away wins, draws, and home wins. — Check calibration by class for Mannequin 33. **Illustration by Creator.**

Ranking Distinction Evaluation

The rating-difference evaluation reveals why attracts are structurally tough for the mannequin. Noticed draw charges are highest when the groups are intently matched and decline as absolutely the Elo ranking hole widens. All three mannequin households study this broad sample: their predicted draw chances additionally fall as matches turn into extra lopsided.

The failure shouldn’t be directional however scalar. In essentially the most evenly matched fixtures, the noticed draw price is roughly one-third, whereas the fashions assign draw chances nearer to one-quarter. They accurately determine balanced matches as extra draw-prone, however they don’t elevate the draw chance sufficient. Because of this, the mannequin can acknowledge draw danger with out usually choosing a draw because the more than likely final result. This reconciles the obvious contradiction between affordable draw calibration and weak draw recall: the chances transfer in the appropriate course, however often not far sufficient to win the argmax determination, that being to select the category with the very best predicted chance.

Function Significance

As you would possibly count on, an important characteristic for our mannequin is the ranking distinction, adopted by whether or not the match was on neural floor—a distant second. By checking the characteristic significance, we are able to see which of our engineered options supplied significant sign.

Mannequin 33 LightGBM characteristic significance by acquire. Illustration by Creator.

Conclusion

I believe it is a good time to debate dataset measurement and mannequin alternative. Usually, the bigger and extra advanced the dataset, the extra purpose we now have to decide on a extra sophisticated mannequin. As we noticed on this instance, the good points from switching from regression to LightGBM had been very small; it is a good signal that trying a extra advanced mannequin on this knowledge won’t yield higher predictions. Soccer forecasting is much less about discovering a magic algorithm and extra about constructing leakage-safe options, evaluating interpretable baselines, and asking whether or not the mannequin’s confidence is deserved.

For now, one factor is evident: wer’re gonna want extra knowledge if we need to get a greater prediction. Notably player-level knowledge—figuring out if Neymar is sitting out is essential. The granularity of the information can also be vital if we need to change our forecast as the sport progresses.

Apendix

The code for the entire mission will be discovered on my GitHub
The info source has a Inventive Commons CC0-1.0 license

make_team_clean <- operate(team_name) >
stringr::str_squish()

stringr::str_squish()
stringi::stri_trans_general(“Latin-ASCII”)
- Converts accented Latin characters to plain ASCII characters.
str_to_lower()
stringr::str_replace_all(“[^a-z0-9]+”, “_”)
- It replaces something that’s not a lowercase letter or quantity with an underscore.

Website | LinkedIn | GitHub

Source link

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

My AI Couldn’t See My Files — I Built a Zero-Dependency MCP Server

The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy

How to Fine-Tune an SLM for Emotion Recognition

FPN Paper Walkthrough: Leveraging the Internal Pyramid

Five Ways to Fine-Tune Chronos-2, the Time Series Foundation Model

Can Machine Learning Predict the World Cup?

Toyota Corolla GRMN: Nürburgring-proven hot hatch unveiled

Ghent-based Sensie raises €500k to bring real-time plant intelligence to greenhouse growers

How a Citizen Science Organization Aims to Preserve the Places It Brings Tourists to Study

Featured Picks

Renault Trafic Escapade 7-passenger camper van with kitchen cube

Understanding the palletizing solution landscape

Vitamin B3 may reduce skin cancer risk by up to 54%

Can Machine Learning Predict the World Cup?

Soccer by the Numbers

Stitching the information collectively

Engineered Options

1. Draw-modeling options

2. Elo options

3. Rolling past-performance options

4. Assault and protection type options

Grid Search

Last Mannequin

Calibration

Ranking Distinction Evaluation

Function Significance

Conclusion

Apendix

Related Posts