: Limitations of Machine Studying
As an information scientist in right this moment’s digital age, you have to be geared up to reply a wide range of questions that go far past easy sample recognition. Typical machine studying is constructed on affiliation; it seeks to seek out patterns in present knowledge to foretell future observations underneath the idea that the underlying system stays fixed. When you practice a mannequin to foretell home costs, you might be asking the algorithm to seek out the more than likely worth given a set of options.
Nonetheless, causal evaluation introduces a “what if” part. It goes past statement to ask how the system would react if we actively modified a variable. That is the distinction between noticing that individuals who purchase costly lattes are additionally doubtless to purchase sports activities vehicles and understanding whether or not reducing the worth of that espresso will truly trigger a rise in automobile gross sales. On the planet of causal inference, we’re basically making an attempt to be taught the underlying legal guidelines of a enterprise or social system, permitting us to foretell the outcomes of actions we haven’t taken but.
Causal evaluation is important in a wide range of fields when we have to transfer past observing patterns to make choices, significantly in areas like healthcare, advertising and marketing, and public coverage. Take into account a medical researcher evaluating a brand new blood stress medicine and its impact on coronary heart assault severity. With historic knowledge, you would possibly see that sufferers taking the medicine even have extra extreme coronary heart assaults. A typical ML (Machine Studying) mannequin would counsel the drug is dangerous. Nonetheless, that is doubtless attributable to confounding: medical doctors solely prescribe the medicine to sufferers who have already got poorer well being. To seek out the reality, we should isolate the drug’s precise affect from the noise of the sufferers’ present situations.
On this article, I’ll introduce a few of the vital ideas and instruments in causal ML in an accessible method. I’ll solely use libraries that handle knowledge, calculate chances, and estimate regression parameters. This text will not be a tutorial, however a place to begin for these fascinated with however intimidated by causal inference strategies. I used to be impressed by the web studying Causal Inference for the Courageous and True by Matheus Facure Alves. Word: For these unfamiliar with chance, E[X] refers back to the common worth {that a} random variable/amount x takes.
The Potential Outcomes Framework
After we begin a causal research, the questions we ask are way more particular than loss minimization or prediction accuracy. We usually begin with the Common Remedy Impact (ATE), which tells us the imply affect of an intervention or motion throughout a whole inhabitants.
In our medical instance, we need to know the distinction in coronary heart assault severity if your complete inhabitants took the drug versus if your complete inhabitants didn’t. To outline this mathematically, we use the Potential Outcomes Framework. First, let’s outline a number of variables:
Y: The Final result (e.g., a coronary heart assault severity rating from 0 to 100).T: The Remedy indicator. This can be a binary “change”:T = 1 means the affected person took the drug.T = 0 means the affected person didn’t take the drug (the Management).Y(1): The end result we might see if the affected person was handled.Y(0): The end result we might see if the affected person was not handled.
The theoretical ATE is the anticipated distinction between these two potential outcomes throughout your complete inhabitants:ATE = E[Y(1) - Y(0)]
To deal with the dilemma of unobserved outcomes, researchers use the Potential Outcomes Framework as a conceptual information. On this framework, we assume that for each particular person, there exist two “potential” outcomes: Y(1) and Y(0). We solely ever observe one in all these two values, which is called the Basic Downside of Causal Inference.
If a affected person takes the medicine (T=1), we see their factual end result, Y(1). Their end result with out the medicine is now a counterfactual, a state of the world that might have existed however didn’t.
The limitation of causal inference is that for any given particular person, we solely ever observe one in all these two values. If a affected person takes the medicine, we see their factual end result, Y(1), whereas their end result with out the medicine Y(0) stays a counterfactual, a state that might have existed however didn’t.
ATE in a Excellent World
Because the particular person therapy impact is the distinction between these two values, it stays hidden to us. This shifts your complete aim of causal estimation away from the person and towards the group. As a result of we can’t subtract a counterfactual from a factual for one particular person, we should discover intelligent methods to match teams of individuals.
If the group receiving the therapy is statistically similar to the group that’s not, we will use the common noticed end result of 1 group to face in for the lacking counterfactual of the opposite. This permits us to estimate the Common Remedy Impact by calculating the distinction between the imply end result of the handled group and the imply end result of the management group:
{ATE} = E[Y|T=1] - E[Y|T=0]
Suppose that for individuals who took the drug, we noticed a imply coronary heart assault severity of 56/100, in comparison with 40/100 for individuals who didn’t. If we try to estimate the causal impact by taking a easy distinction in means, the information means that taking the drug led to a 16-point improve in severity.
E[Y|T=1] = 56, E[Y|T=0] = 40 -> ATE_BIASED = 16
Except this drug is among the many most harmful created, there may be doubtless one other mechanism at play. This discrepancy arises as a result of we will solely interpret a easy distinction in means because the Common Remedy Impact if the therapy was assigned by a Randomized Managed Trial (RCT), which ensures full random task of therapy teams. With out randomization, the handled and management teams usually are not exchangeable and differ in ways in which make a direct comparability tough to do.
Randomization
The explanation an RCT is the default technique for calculating the ATE is that it helps remove Choice Bias. In our medical instance, the 16-point hurt we noticed doubtless occurred as a result of medical doctors gave the drug to the highest-risk sufferers. On this situation, the handled group was already predisposed to greater severity scores earlier than they ever took the tablet. After we use an RCT, we take away the human factor of alternative. With this randomized choice, we be certain that high-risk and low-risk sufferers are distributed equally between each teams.
Mathematically, randomization ensures that the therapy task is impartial of the potential outcomes.
Now, we will assume that the common end result of the handled group is an ideal proxy for what would have occurred if your complete inhabitants had been handled. As a result of the “Handled” and “Management” teams begin as statistical clones of each other, any distinction we see on the finish of the research have to be brought on by the drug itself.
Observational Information and Confounders
In the true world, we are sometimes compelled to work with observational knowledge. In these conditions, the straightforward distinction in means fails us due to the presence of confounders. A confounder is a variable that influences each the therapy and the end result, making a “backdoor path” that permits a non-causal correlation to stream between them.
In an effort to visualize these hidden relationships, causal researchers use Directed Acyclic Graphs (DAGs). A DAG is a specialised graph the place variables are represented as nodes and causal relationships are represented as arrows. Directed that the arrows have a particular path, indicating a one-way causal stream from a trigger to an impact. Acyclic means the graph comprises no cycles; you can not observe a sequence of arrows and find yourself again on the first variable, primarily as a result of transitioning from one node to the subsequent ought to signify a lapse in time. A confounder will reveal itself in a DAG by its directed connection to each the therapy and the end result, as seen beneath.
As soon as we’ve got recognized the confounders by our DAG, the subsequent step is to mathematically account for them. If we need to isolate the true impact of the drug, we have to examine sufferers who’re related in each manner aside from whether or not they took the drugs. In causal evaluation, an important software for that is Linear Regression. By together with the confounder as an impartial variable, the mannequin calculates the connection whereas holding the preliminary well being of the affected person fixed. For our instance, I generated a mock dataset
the place therapy task was depending on preliminary well being (I.H). This may be seen within the code beneath, the place each the chance of receiving the drug and the severity relies on the preliminary well being rating.

On this view, people who obtained the drug had a mean severity improve of three.47 factors. To seek out the reality, we run an OLS (Bizarre Least Squares) a number of linear regression mannequin to regulate
for the members’ preliminary well being score.

A very powerful discovering right here is the coefficient of the therapy variable (drug). Whereas the uncooked knowledge instructed the drug was dangerous, our coefficient is roughly -9.89. This implies that once we management for the confounder of preliminary well being, taking the drug truly decreases coronary heart assault severity by practically 10 factors. That is very near our true impact, which was a lower of precisely 10 factors!
This can be a end result that was extra in step with our expectations, and that’s as a result of we eradicated a big supply of choice bias by controlling for confounders. The wonderful thing about linear regression on this context is that the setup is just like that of a typical regression drawback. Transformations will be utilized, diagnostic plots will be produced, and slopes will be interpreted as regular. Nonetheless, as a result of we’re together with confounders in our mannequin, their impact on the end result is not going to be absorbed into the therapy coefficient, one thing often known as de-biasing or adjusting, as beforehand talked about.
Matching and Propensity Scoring
Whereas a number of linear regression is a robust software for de-biasing, it depends closely on the idea that the connection between your confounders and the end result is linear. In lots of real-world conditions, your handled and management teams is likely to be so basically completely different {that a} regression mannequin is forcedto guess ends in areas the place it has no precise knowledge.
To resolve this, researchers typically flip to Matching, a way that shifts the main target from mathematical adjustment to knowledge restructuring. As an alternative of utilizing a components to carry well being fixed, matching searches the management group for a ”twin” for each handled particular person. After we pair a affected person who took the drug (T = 1) with a affected person of practically similar preliminary well being who didn’t (T = 0), we successfully prune our dataset right into a Artificial RCT.
On this balanced subset, the teams are lastly exchangeable, permitting us to match their outcomes on to reveal the true Common Remedy Impact (ATE). It’s nearly as if every pair permits us to look at each the factual and the counterfactual states for a single sort of statement. After we consider find out how to match two entries in a dataset, take into account that every entry is represented by a vector in an n-dimensional house, the place n − 1 is the variety of options or confounders.
At first look, it appears we might merely calculate the space between these vectors utilizing Euclidean distance. Nonetheless, the difficulty with this method is that every one covariates are weighted equally, no matter their precise causal affect. In excessive dimensions, an issue often known as the curse of dimensionality, even an entry’s closest match might nonetheless be basically completely different within the ways in which truly matter for the therapy.
In our mock dataset, members with the bottom well being scores beneath, we see that handled participant 74 and untreated participant 668 have practically similar preliminary well being scores. As a result of we’re solely coping with one confounder right here, these two are ultimate candidates to be matched collectively. Nonetheless, as dimensionality will increase, it turns into not possible to seek out these matches by simply trying on the numbers, and easy Euclidean distance fails to prioritize the variables that really drive the choice bias.

In apply, this course of is mostly executed as one-to-one matching, the place every handled unit is paired with its single closest neighbor within the management group. To make sure these matches are high-
high quality, we use the Propensity Rating: a single quantity representing the chance {that a} participant would obtain the therapy given their traits, P (T = 1|X). This rating collapses our high-
dimensional house right into a single dimension that particularly displays the chance of remedies given a set of covariates. We then use a k-Nearest Neighbors (k-NN) algorithm to carry out a ”fuzzy” search
on this rating.
To stop poor matches, we will select a threshold to function the utmost allowable distance to match. We will calculate propensity in plenty of methods, the commonest being logistic regression, however different ML strategies able to outputting chances, akin to XGBoost or Random Forest, work as effectively. Within the beneath code, I calculated propensities by organising a logistic regression mannequin that predicts drug participation from simply preliminary well being. In apply, you’d have extra confounders in your mannequin.
As talked about, step one of propensity rating matching is the calculation of the propensity rating. In our instance, we solely have preliminary well being as a confounder, in order that would be the sole covariate in our easy logistic regression.

As anticipated, members 74 and 668 had been assigned a really related propensity and would doubtless be matched. It’s also typically useful to generate what is called a Widespread Help plot, which shows the density of calculated propensity scores separated by handled and management. Ideally, we need to see as a lot overlap and symmetry as potential, as that means matching items shall be easier. As seen beneath, choice bias is current in our dataset. It’s a good train to research the information technology code above and decide why.

Though not needed within the one-dimensional case, we will then use k-NN to match handled with untreated primarily based on their propensity rating.
When you recall from earlier than, our linear regression yielded an ATE of -9.89 in comparison with our now calculated worth of -10.16. As we improve the complexity and variety of covariates in our mannequin, our propensity rating matching ATE will doubtless get nearer and nearer to the underlying causal impact of -10.
Time Invariant Results Utilizing Distinction-in-Variations
Whereas matching is great for de-biasing primarily based on the variables we will see, it falls brief when there are hidden components, like a affected person’s genetic predisposition or a hospital’s particular administration model, that we haven’t recorded in our knowledge. If these unobserved confounders are time-invariant (that means they keep fixed over the research interval), we will use Distinction-in-Variations (DiD) to cancel them out.
As an alternative of simply evaluating the handled group to the management group at a single time limit, DiD appears at two teams over two durations: earlier than and after the therapy. The logic is easy but elegant: we calculate the change within the management group and assume the handled group would have modified by that very same quantity in the event that they hadn’t obtained the therapy. Any extra change noticed within the handled group is attributed to the therapy itself. The equation for the DiD estimator is as follows:
Whereas this components could seem intimidating at first look, it’s best learn because the distinction in modifications occurring earlier than and after therapy. For instance, think about two ice cream retailers in several cities. Earlier than the weekend, Retailer A (our therapy group) sells 200 cones, and Retailer B (our management group) sells 300. On Saturday, a warmth wave hits Retailer A’s city, however not Retailer B’s. By the top of the day, Retailer A’s gross sales leap to 500, whereas Retailer B’s gross sales rise to 400. A easy evaluation of Retailer A would counsel the warmth wave brought about a +300 improve. Nonetheless, the management store (Retailer B) grew by +100 in the identical interval with none warmth wave, maybe attributable to a vacation or common summer season climate.
The Distinction-in-Variations method subtracts this pure time pattern of +100 from Retailer A’s complete development. It successfully cancels out any time-invariant confounders—components like the shop’s location or its base recognition that might have in any other case skewed our outcomes. This reveals that the true causal affect of the warmth wave was +200 items.
A major limitation of the fundamental Distinction-in-Variations (DiD) is that it doesn’t account for components that change over time. Whereas the “change-in-change” logic efficiently cancels out static, time-invariant confounders (like somebody’s genetic historical past or a hospital’s geographic location), it stays weak to time-varying confounders. These are components that shift in the course of the research interval and have an effect on the therapy and management teams in another way.
In our coronary heart assault research, as an illustration, even a DiD evaluation might be biased if the hospitals administering the drug additionally underwent vital staffing modifications or obtained upgraded tools in the course of the “Publish” interval. If we fail to account for these altering variables, the DiD estimator will incorrectly attribute their affect to the drug itself, resulting in a “polluted” causal estimate.
It is very important be aware that the straightforward cross-sectional knowledge construction we utilized for Regression and Matching is inadequate for this technique. To calculate a “change within the change,” we want a temporal dimension in our dataset. Particularly, we want a variable indicating whether or not an statement occurred within the Pre-treatment or Publish-treatment interval for each the handled and management teams.
To resolve this, we transfer past easy subtraction and implement DiD inside a A number of Linear Regression framework. This permits us to explicitly “management” for time-varying components, successfully isolating the therapy impact whereas holding exterior shifts fixed.
The regression mannequin is outlined as:

Under, a brand new artificial dataset is constructed to mirror the required construction. I additionally added a High quality of Care variable for demonstration functions. I didn’t embody the complete simulation code attributable to its size, nevertheless it basically modifies the earlier logic by duplicating our observations throughout two distinct time durations.

Since we’ve got our knowledge within the appropriate format, we will match a linear regression mannequin utilizing the specs simply talked about.

The R-squared worth of 0.324 signifies that the mannequin explains roughly 32.4 p.c of thevariance in coronary heart assault severity. In causal evaluation, that is frequent, as many unmeasured components like
genetics are handled as noise. The intercept of 48.71 represents the baseline severity for the controlgroup in the course of the pre-treatment interval. The drug coefficient of 12.75 confirms choice bias, exhibiting
the handled group initially had greater severity scores. Moreover, the standard of care coefficient suggests that every unit improve in that index corresponds to a 2.10-point discount in severity.
The interplay time period, drug:publish, supplies the difference-in-differences estimator, which reveals an estimated drug impact of -6.58. This tells us the medicine diminished severity after adjusting for group variations and time traits, although the estimate is notably decrease than the true impact of -10. This discrepancy happens as a result of the standard of care improved particularly for the handled group in the course of the post-treatment interval, because of the knowledge technology course of. Since these two modifications occurred concurrently to the identical group, they’re completely correlated, or collinear.
The mannequin basically faces a mathematical stalemate the place it can’t decide if the development got here from the drug or the higher care, so it splits the credit score between them. As for any linear regression, if two variables are completely correlated, a mannequin would possibly drop one totally or present extremely unstable estimates. However, all variables preserve p-values of 0.000, confirming that regardless of the break up credit score, the outcomes stay statistically vital. In actual knowledge and evaluation, we’ll cope with these sorts of conditions, and it is very important know all of the instruments in your knowledge science shed earlier than you sort out an issue.
Conclusion and Last Ideas
On this article, we explored the transition from commonplace ML to the logic of causal inference. We noticed by artificial examples that whereas easy variations in means will be deceptive attributable to choice bias, strategies like linear regression, propensity rating matching, and difference-in-differences permit us to strip away confounders and isolate true affect.
Having these instruments in our arsenal will not be sufficient. As seen with our ultimate mannequin, even subtle strategies can yield points when interventions overlap. Whereas these strategies are highly effective in adjusting for confounding, they require a deep understanding of their underlying mechanics. Counting on mannequin outputs with out acknowledging the fact of collinearity or time-varying components can result in deceptive conclusions.
On the similar time, realizing when and find out how to apply these instruments can function a useful ability to any knowledge scientist. For my part, among the best elements of doing statistical programming for causal inference is the truth that many of the strategies stem from a number of basic statistical fashions, making implementation simpler than one would possibly anticipate.
The actual world is undeniably messy and full of knowledge points, and it’s uncommon that we are going to observe a superbly clear causal sign. Causal machine studying is finally about exploiting the best knowledge whereas having the boldness that our variables permit for true adjustment. This text is my first step within the documentation of my causal inference journey, and I plan to launch a component two that dives deeper into extra subjects, together with Instrumental Variables (IV), Panel Regression, Double Machine Studying (DML), and Meta-Learners.
Urged Studying
Facure, Matheus. Causal Inference for the Courageous and True. Out there at: https://matheusfacure.github.io/python-causality-handbook/

