shipped a readmission-prediction mannequin in early 2024. This can be a composite case drawn from patterns documented by Hernán & Robins in Nature Machine Intelligence, however each element maps to actual deployment failures.
Accuracy on the held-out check set: 94%. The operations crew used it to determine which sufferers to prioritize for follow-up calls. They anticipated readmission charges to drop.
Charges went up.
The mannequin had captured each correlation within the knowledge: older sufferers, sure zip codes, particular discharge diagnoses. It carried out precisely as designed. The check metrics had been clear. The confusion matrix seemed textbook.
However when the crew acted on these predictions (calling sufferers flagged as high-risk, rearranging discharge protocols) the relationships within the knowledge shifted beneath them. Sufferers who acquired additional follow-up calls didn’t enhance. Those who saved getting readmitted shared a special profile totally: they couldn’t afford their drugs, lacked dependable transportation to follow-up appointments, or lived alone with out assist for post-discharge care. The variables that predicted readmission weren’t the identical variables that triggered it.
The mannequin by no means discovered that distinction, as a result of it was by no means designed to. It noticed correlations and assumed they had been handles you could possibly pull. They weren’t. They had been shadows forged by deeper causes the mannequin couldn’t see.
A mannequin that predicts readmission with 94% accuracy informed the crew precisely who would come again. It informed them nothing about why, or what to do about it.
For those who’ve constructed a mannequin that predicts nicely however fails when became a call, you’ve already felt this drawback. You simply didn’t have a reputation for it.
The title is confounding. The answer is causal inference. And in 2026, the instruments to do it correctly are lastly mature sufficient for any knowledge scientist to make use of.
The Query Your Mannequin Can’t Reply
Machine Studying (ML) is constructed for one job: discover patterns in knowledge and predict outcomes. That is associational reasoning. It really works brilliantly for spam filters, picture classifiers, and advice engines. Sample in, sample out.
However enterprise stakeholders not often ask “what is going to occur subsequent?” They ask “what ought to we do?” Ought to we increase the value? Ought to we alter the remedy protocol? Ought to we provide this buyer a reduction?
These are causal questions. And answering them with associational fashions is like utilizing a thermometer to set the thermostat. The thermometer tells you the temperature. It doesn’t inform you what would occur in the event you modified the dial.
Answering “what ought to we do?” with a device designed for “what is going to occur?” is like utilizing a thermometer to set the thermostat.
Judea Pearl, the pc scientist who received the 2011 Turing Award for his work on probabilistic and causal reasoning, organized this hole into what he calls the Ladder of Causation. The ladder has three rungs, and the space between them explains why so many ML tasks fail once they transfer from prediction to motion.
Degree 1: Affiliation (“Seeing”). “Sufferers who take Drug X have higher outcomes.” That is pure correlation. Each commonplace ML mannequin operates right here. It solutions: what patterns exist within the knowledge?
Degree 2: Intervention (“Doing”). “If we give Drug X to this affected person, will their end result enhance?” This requires understanding what occurs if you change one thing. Pearl formalizes this with the do-operator: P(Y | do(X)). No quantity of observational knowledge, by itself, can reply this.
Degree 3: Counterfactual (“Imagining”). “This affected person took Drug X and recovered. Would they’ve recovered with out it?” This requires reasoning about realities that by no means occurred. It’s the highest type of causal considering.
Right here’s what every stage seems like in observe. A Degree 1 mannequin at an e-commerce firm says: “Customers who seen product pages for trainers additionally purchased protein bars.” Helpful for suggestions. A Degree 2 query from the identical firm: “If we ship a 20% low cost on protein bars to customers who seen trainers, will purchases improve?” That requires figuring out whether or not the low cost causes purchases or whether or not the identical customers would have purchased anyway. A Degree 3 query: “This person purchased protein bars after receiving the low cost. Would they’ve purchased them with out it?” That requires reasoning a few world that didn’t occur.
Most ML operates on Degree 1. Most enterprise choices require Degree 2 or 3. That hole is the place unsuitable choices are made at scale.
When Accuracy Lies
The hole between prediction and causation shouldn’t be theoretical. It has a physique rely.
Think about the kidney stone study from 1986. Researchers in contrast two remedies for renal calculi. Remedy A outperformed Remedy B for small stones. Remedy A additionally outperformed Remedy B for big stones. However when the info was pooled throughout each teams, Remedy B appeared superior.
That is Simpson’s paradox. The lurking variable was stone severity. Medical doctors had prescribed Remedy A for more durable instances. Pooling the info erased that context, flipping the obvious conclusion. A prediction mannequin skilled on the pooled knowledge would confidently suggest Remedy B. It could be unsuitable.
That’s a statistics textbook instance. The hormone remedy case drew blood.
For many years, observational research recommended that postmenopausal Hormone Replacement Therapy (HRT) diminished the danger of coronary coronary heart illness. The proof seemed strong. Thousands and thousands of ladies had been prescribed HRT based mostly on these findings. Then the Ladies’s Well being Initiative, a large-scale randomized managed trial revealed in 2002, revealed the other: HRT truly elevated cardiovascular danger.
For many years, observational research recommended hormone remedy protected hearts. A correct trial revealed it broken them. Thousands and thousands of prescriptions, one confound.
The confound was wealth. More healthy, wealthier ladies had been extra prone to each select HRT and have decrease coronary heart illness charges. The observational fashions captured this correlation and mistook it for a remedy impact. A 2019 paper by Miguel Hernán in CHANCE used this actual case to argue that knowledge science wants “a second likelihood to get causal inference proper.”
How widespread is this error? A 2021 scoping review in the European Journal of Epidemiology examined observational research and located that 26% of them conflated prediction with causal claims. One in 4 revealed papers, in medical journals, the place individuals make life-and-death choices based mostly on the outcomes.
The core construction behind each instances is the confounding fork: a hidden widespread trigger (Z) that influences each the remedy (X) and the result (Y), making a spurious affiliation between them. Stone severity drove each remedy selection and outcomes. Wealth drove each HRT adoption and coronary heart well being. In every case, the correlation between X and Y was actual within the knowledge. However appearing on it as if X triggered Y produced the unsuitable intervention.

The lesson is uncomfortable: a mannequin can have excessive accuracy, go each validation test, and nonetheless give suggestions that make outcomes worse. Accuracy measures how nicely a mannequin captures current patterns. It says nothing about whether or not these patterns survive if you intervene.
The Toolkit Caught Up
For years, causal inference lived behind a wall of econometrics textbooks, customized R scripts, and a small circle of specialists. That wall has come down.
Microsoft Analysis constructed DoWhy, a Python library that reduces causal evaluation to 4 specific steps: mannequin your assumptions, determine the causal estimand, estimate the impact, and refute your personal outcome. That fourth step is what separates causal inference from “I ran a regression and it was important.” DoWhy forces you to attempt to break your conclusion earlier than you belief it.
Alongside DoWhy sits EconML, one other Microsoft Analysis library that gives the estimation algorithms: Double Machine Learning (DML), causal forests, instrumental variable strategies, and doubly strong estimators. Collectively, they kind the PyWhy project, which is rapidly changing into the usual causal evaluation stack in Python.
DoWhy reduces causal evaluation to 4 steps: mannequin, determine, estimate, refute. That final step separates causal inference from “I ran a regression.”
The market indicators align. Fortune Business Insights valued the worldwide Causal Synthetic Intelligence (AI) market at $81.4 billion in 2025, projecting $116 billion for 2026 (a 42.5% Compound Annual Progress Fee, or CAGR). An extra 25% of organizations plan to undertake causal AI by 2026, which might carry whole adoption amongst AI-driven organizations to almost 70%.
Uber constructed CausalML for uplift modeling and remedy impact estimation. Netflix has revealed analysis on causal bandits for content material suggestions. Amazon’s AWS team uses DoWhy for root trigger evaluation in microservice architectures, diagnosing why latency spikes occur quite than simply predicting when they are going to. These aren’t tutorial experiments. They’re manufacturing techniques working at scale.
The sensible barrier was experience. You wanted to know structural causal fashions, the backdoor criterion, and the right way to derive estimands by hand. DoWhy automates the identification step. You draw the DAG (encoding your area data), and the library determines which statistical estimand solutions your causal query. That’s the half that used to take a PhD-level strategies course to do manually.
The place Causal Strategies Break Down
A good objection: most ML purposes work nice with out causal reasoning. Suggestion techniques, picture classification, fraud detection, search rating. Sample in, sample out. These issues genuinely don’t want causal construction, and including it might be over-engineering.
Causal inference additionally carries a value that prediction doesn’t. It requires assumptions. You have to specify a Directed Acyclic Graph (DAG), a diagram encoding which variables trigger which. In case your DAG is unsuitable (a lacking confounder, a reversed arrow) your causal estimate may be worse than a naive correlation. The rubbish-in-garbage-out drawback doesn’t disappear; it strikes from the info to the assumptions.
The argument right here shouldn’t be that causal inference ought to change prediction. It’s that causal inference should complement prediction everytime you transfer from sample recognition to decision-making. The failure mode shouldn’t be “ML doesn’t work.” The failure mode is “ML works for prediction, then will get misapplied to a causal query.” Understanding which query you’re answering is the talent that separates a mannequin builder from a call scientist.
Does Your Downside Want Causal Inference?
The 5-Query Diagnostic
Earlier than you choose a technique, run your drawback by these 5 questions. For those who reply “sure” to 2 or extra, you want causal inference. For those who reply “sure” to query 1 alone, you want causal inference.
- Are you making a call or a prediction?
Predicting who will churn = commonplace ML. Deciding which intervention prevents churn = causal inference. - Would appearing in your mannequin change the underlying relationships?
In case your intervention alters the very patterns the mannequin discovered, your correlations will shift post-deployment. This can be a causal drawback. - May a confounding variable clarify your outcome?
If two variables (remedy and end result) share a standard trigger, your noticed affiliation could vanish, reverse, or amplify as soon as the confounder is managed for. Assume: the HRT case. - Do you have to reply “what if?” or “why?”
“What if we doubled the value?” is a Degree 2 (intervention) query. “Why did this buyer go away?” is a Degree 3 (counterfactual) query. Each require causal reasoning. - Is there choice bias in how remedies had been assigned?
If docs prescribe Drug A to sicker sufferers, or if customers self-select right into a characteristic, evaluating uncooked outcomes with out adjustment is meaningless.

Which Causal Methodology Suits Your Downside?
As soon as you recognize you want causal inference, the subsequent query is which methodology. This matrix maps widespread conditions to the proper device.

For those who’re uncertain the place to begin: start with a DAG. Draw the causal relationships you imagine exist between your remedy, end result, and potential confounders. Even a tough DAG makes your assumptions specific, which is the one most essential step. You possibly can refine the estimation methodology afterward.
A DoWhy Workflow in Apply
Right here’s a concrete instance: measuring whether or not a buyer loyalty program truly will increase annual spending (versus loyal prospects who would spend extra anyway self-selecting into this system).
# Set up: pip set up dowhy
import dowhy
from dowhy import CausalModel
# Step 1: MODEL your causal assumptions as a DAG
# Earnings impacts each loyalty signup AND spending (confounder)
mannequin = CausalModel(
knowledge=df,
remedy="loyalty_program",
end result="annual_spending",
common_causes=["income", "prior_purchases", "age"],
)
# Step 2: IDENTIFY the causal estimand
# DoWhy determines what statistical amount solutions your query
recognized = mannequin.identify_effect()
# Returns: E[annual_spending | do(loyalty_program=1)]
# - E[annual_spending | do(loyalty_program=0)]
# Step 3: ESTIMATE the causal impact
estimate = mannequin.estimate_effect(
recognized,
method_name="backdoor.propensity_score_matching"
)
print(f"Causal impact: ${estimate.worth:.2f}/12 months")
# Step 4: REFUTE your personal outcome
# Add a random variable that should not have an effect on the estimate
refutation = mannequin.refute_estimate(
recognized, estimate,
method_name="random_common_cause"
)
print(refutation)
# If the impact holds underneath random confounders, your result's strong
4 steps. Mannequin your assumptions, determine the estimand, estimate the impact, then attempt to break your personal outcome. The DoWhy documentation supplies full tutorials on integrating EconML estimators for extra superior use instances (DML, causal forests, instrumental variables).
The refutation step deserves emphasis. In commonplace ML, you validate with held-out check units. In causal inference, you validate by attempting to destroy your personal estimate: including random confounders, utilizing placebo remedies, working the evaluation on knowledge subsets. If the impact survives, you could have one thing actual. If it collapses, you’ve saved your self from a expensive unsuitable choice.
In case your mannequin’s suggestions would change the relationships it discovered from, you’ve left prediction territory. Welcome to causation.
What Modifications Now
The convergence is already seen. Tech corporations are hiring for causal reasoning: Microsoft constructed the complete PyWhy stack, Uber launched CausalML, Netflix revealed analysis on causal inference in production. The skillset is not confined to economics PhD applications and epidemiology departments. It’s coming into manufacturing ML groups.
Universities are adapting. Hernán’s classification of information science duties into Description, Prediction, and Causal Inference (revealed by the Harvard School of Public Health) is changing into a normal pedagogical framework. The query is not “ought to knowledge scientists be taught causal inference?” It’s “how rapidly can they?”
For the person practitioner, the return on studying causal strategies is uneven. The info scientist who can reply “what is going to occur?” is effective. The one who can reply “what ought to we do?” (and display why the reply is strong) instructions a special type of belief within the room. That belief interprets straight into affect over choices, useful resource allocation, and technique.
The educational curve is actual however shorter than it seems. For those who perceive conditional chance and have constructed regression fashions, you have already got 60% of the muse. The remaining 40% is studying to assume in graphs (DAGs), understanding the distinction between conditioning and intervening, and figuring out when to succeed in for which estimator. The PyWhy documentation, Brady Neal’s free online course on causal inference, and Pearl’s accessible The Book of Why cowl that hole in weeks, not years.
Keep in mind the health-tech firm from the opening? After the readmission spike, they rebuilt their evaluation utilizing DoWhy. They drew a DAG, recognized that socioeconomic components had been confounders (not causes) of readmission, and remoted the precise causal drivers: treatment adherence and follow-up appointment entry. They redesigned their intervention round these two levers.
Readmission charges dropped 18%.
The mannequin’s accuracy didn’t change. What modified was the query it answered.
The subsequent time a stakeholder asks “what ought to we do?”, you could have two choices: hand them a correlation and hope it survives contact with actuality, or hand them a causal estimate with a refutation report exhibiting precisely how laborious you tried to interrupt it. The instruments exist. The maths is settled. The code is four lines.
The one query left is whether or not you’ll preserve predicting, or begin inflicting.
References
- Pearl, J. & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Fundamental Books.
- Pearl, J. & Bareinboim, E. (2022). “On Pearl’s Hierarchy and the Foundations of Causal Inference.” Technical Report R-60, UCLA Cognitive Programs Laboratory.
- Hernán, M.A. (2019). “A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks.” CHANCE, 32(1), 42-49.
- Luijken, Ok. et al. (2021). “Prediction or causality? A scoping review of their conflation within current observational research.” European Journal of Epidemiology, 37, 35-46.
- Hernán, M.A. & Robins, J.M. (2020). “Causal inference and counterfactual prediction in machine learning for actionable healthcare.” Nature Machine Intelligence, 2, 369-375.
- Sharma, A. & Kiciman, E. (2020). DoWhy: An End-to-End Library for Causal Inference. Microsoft Analysis / PyWhy.
- Battocchi, Ok. et al. (2019). EconML: A Python Package for ML-Based Heterogeneous Treatment Effect Estimation. Microsoft Analysis / ALICE.
- Fortune Enterprise Insights. (2025). “Causal AI Market Size, Industry Share | Forecast, 2026-2034.”
- Charig, C.R. et al. (1986). “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotomy.” BMJ, 292(6524), 879-882.
- PyWhy Contributors. (2024). “Tutorial on Causal Inference and its Connections to Machine Learning (Using DoWhy+EconML).” PyWhy Documentation.

