Correlation vs. Causation: Measuring True Impact with Propensity Score Matching

process in Knowledge Science, particularly if we’re performing an A/B Check to know the consequences of a given variable over these teams.

The issue is that the world is simply… nicely, actual. I imply, it is rather lovely to consider a managed setting the place we are able to isolate only one variable and measure the impact of it. However what occurs more often than not is that life simply runs over every little thing, and the following factor , your boss is asking you to match the impact of the most recent marketing campaign on clients’ bills.

However you by no means ready the info for the experiment. All you’ve got is the continued information earlier than and after the marketing campaign.

Enter Propensity Rating Matching

In easy phrases, Propensity Rating Matching (PSM) is a statistical method used to see if a selected motion (a “remedy”) truly precipitated a end result.

As a result of we are able to’t return in time and see what would have occurred if somebody had made a unique selection, we discover a “twin” within the information, somebody who seems to be nearly precisely like them however didn’t take the remedy motion, and evaluate their outcomes as a substitute. Discovering these “statistical twins” helps us evaluate clients pretty, even while you haven’t run a wonderfully randomized experiment.

The Drawback With the Averages

Easy averages assume the teams had been similar to start with. While you evaluate a easy common of a handled group to a management group, you might be measuring all of the pre-existing variations that led individuals to decide on that remedy within the first place.

Suppose we wish to check a brand new vitality gel for runners. If we simply evaluate everybody who used the gel to everybody who didn’t, we’re ignoring vital elements like the degrees of expertise and data of the runners. Individuals who purchased the gel could be extra skilled, have higher footwear, and even practice tougher and be supervised by an expert. They had been already “predisposed” to run sooner anyway.

PSM acknowledges the variations and acts like a scout:

The Scouting Report: For each runner who used the gel, the scout seems to be at their stats: age, years of expertise, and common coaching miles.
Discovering the Twin: The scout then seems to be by the group of runners who didn’t use the gel to discover a “twin” with the very same stats.
The Comparability: Now, you evaluate the end instances of those “twins.”

Did you discover how now we’re evaluating related teams? Excessive-performers vs. Excessive-performers, Low-Low. In that approach, we are able to isolate the opposite elements that may trigger the specified impact (confounding) and measure the true impression of the vitality gel.

Nice. Let’s transfer on to discover ways to implement this mannequin.

Step-by-Step of PSM

Now we’ll go over the steps we should take to implement a PSM in our information. That is vital, so we are able to construct the instinct and study logical steps to take when we have to apply this to any dataset.

Step one is making a easy Logistic Regression Mannequin. It is a well-known classification mannequin that may attempt to predict what’s the chance that the topic could possibly be within the remedy group. In less complicated phrases, what’s the propensity of that particular person to take the motion being studied?
From the the 1st step, we’ll add the propensity rating (chance) to the dataset.
Subsequent, we’ll use the Nearest Neighbors algorithm to scan the management group and discover the particular person with the closest rating to every handled person.
As a “high quality filter”, we add a threshold quantity for calibration. If the “closest” match remains to be increased than that threshold, we toss them out. It’s higher to have a smaller, excellent pattern than a big, biased one.
We consider the matched pairs utilizing Standardized Imply Distinction (SMD). It’s for checking if two teams are literally comparable.

Let’s code then!

Dataset

For the aim of this train, I’ll generate a dataset of 1000 rows with the next variables:

Age of the particular person
Previous bills with this firm
A binary flag indicating the use of a cellular system
A binary flag indicating whether or not the particular person noticed the promoting

   age   past_spend  is_mobile  saw_ad
0   29   557.288206          1       1
1   45   246.829612          0       1
2   24   679.609451          0       0
3   67  1039.030017          1       1
4   20   323.241117          0       1

You will discover the code that generated this dataset within the GitHub repository.

Code Implementation

Subsequent, we’re going to implement the PSM utilizing Python. Let’s begin importing the modules.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

Now, we are able to begin by creating the propensity rating.

Step 1: Calculating the Propensity Scores

On this step, we’ll simply run a LogisticRegression mannequin that takes into consideration the age, past_spend, and is_mobile variables and estimates the chance that this particular person noticed the promoting.

Our concept is to not have a 99% accuracy within the prediction, however to steadiness the covariates, guaranteeing that the handled and management teams have practically similar common traits (like age, bills) in order that any distinction within the consequence might be attributed to the remedy reasonably than pre-existing variations.

# Step 1: Calculate the Propensity Scores

# Outline covariates and treament
covariates = ['age', 'past_spend', 'is_mobile']
treatment_col = 'saw_ad'

# 1. Estimate Propensity Scores (Likelihood of remedy)
lr = LogisticRegression()
X = df[covariates]
y = df[treatment_col]

# Match a Logistic Regression
lr.match(X, y)

# Retailer the chance of being within the 'Remedy' group
df['pscore'] = lr.predict_proba(X)[:, 1]

So, after we match the mannequin, we sliced the predict_proba() outcomes to return solely the column with the chances to be within the remedy group (prediction of saw_ad == 1)

Propensity Rating added to the dataset. Picture by the creator.

Subsequent, we’ll cut up the info into management and check.

Management: individuals who didn’t see the promoting.
Remedy: individuals who noticed the promoting.

# 2. Cut up into Remedy and Management
handled = df[df[treatment_col] == 1].copy()
management = df[df[treatment_col] == 0].copy()

It’s time to discover the statistical twins on this information.

Step 2: Discovering the Matching Pairs

On this step, we’ll use NearestNeighbors additionally from Scikit Be taught to seek out the matching pairs for our observations. The thought is easy.

We’ve got two teams with their propensity to be a part of the remedy group, contemplating all of the confounding variables.
So we discover the one statement from the management dataset that matches essentially the most with every one from the remedy dataset.
We use pscore and age for this match. It could possibly be solely the propensity rating, however after trying on the matched pairs, I noticed that including age would give us a greater match.

# 3. Use Nearest Neighbors to seek out matches
# We use a 'caliper', or a threshold to make sure matches aren't too far aside
caliper = 0.05
nn = NearestNeighbors(n_neighbors=1, radius=caliper)
nn.match(management[['pscore', 'age']])

# Discover the matching pairs
distances, indices = nn.kneighbors(handled[['pscore', 'age']])

Now that we’ve the pairs, we are able to calibrate the mannequin to discard these that aren’t too shut to one another.

Step 3: Calibrating the Mannequin

This code snippet filters distances and indices primarily based on the caliper to establish legitimate matches, then extracts the unique Pandas indices for the efficiently matched management and handled observations. Any index over the edge is discarded.

Then we simply concatenate each datasets with the remaining observations that handed the standard management.

# 4. Filter out matches which can be exterior our 'caliper' (high quality management)
matched_control_idx = [control.index[i[0]] for d, i in zip(distances, indices) if d[0] <= caliper]
matched_treated_idx = [treated.index[i] for i, d in enumerate(distances) if d[0] <= caliper]

# Mix the matched pairs into a brand new balanced dataframe
matched_df = pd.concat([df.loc[matched_treated_idx], df.loc[matched_control_idx]])

Okay. We’ve got a dataset with matched pairs of consumers who noticed the promoting and didn’t see it. And the perfect factor is that we are actually in a position to evaluate related teams and isolate the impact of the promoting marketing campaign.

print(matched_df.saw_ad.value_counts())

saw_ad
1    532
0    532
Identify: rely, dtype: int64

Let’s see if our mannequin gave good matches.

Step 4: Analysis

To guage a PSM mannequin, the perfect metrics are:

Standardized Imply Distinction (SMD)
Verify the usual deviation of the Propensity Rating.
Visualize the info overlap

Let’s start by checking the propensity rating statistics.

# Verify commonplace deviation (variance across the imply) of the Propensity Rating
matched_df[['pscore']].describe().T

Propensity Rating stats. Picture by the creator.

These statistics recommend that our propensity rating matching course of has created a dataset the place the handled and management teams have very related propensity scores. The small commonplace deviation and the concentrated interquartile vary (25%-75%) point out good overlap and steadiness of propensity scores. It is a constructive signal that our matching was efficient in bringing the distributions of covariates nearer collectively between the handled and management teams.

Transferring on, To match the technique of different covariates like age and is_mobile after Propensity Rating Matching, we are able to check with the Standardized Imply Variations (SMD). A small SMD (usually beneath 0.1 or 0.05) signifies that the technique of the covariate are well-balanced between the handled and management teams, suggesting profitable matching.

We’ll calculate the SMD metric utilizing a customized perform that takes the imply and commonplace deviation of a given covariate variable and calculates the metric.

def calculate_smd(df, covariate, treatment_col):
    treated_group = df[df[treatment_col] == 1][covariate]
    control_group = df[df[treatment_col] == 0][covariate]

    mean_treated = treated_group.imply()
    mean_control = control_group.imply()
    std_treated = treated_group.std()
    std_control = control_group.std()

    # Pooled commonplace deviation
    pooled_std = np.sqrt((std_treated**2 + std_control**2) / 2)

    if pooled_std == 0:
        return 0 # Keep away from division by zero if there isn't any variance
    else:
        return (mean_treated - mean_control) / pooled_std

# Calculate SMD for every covariate
smd_results = {}
for cov in covariates:
    smd_results[cov] = calculate_smd(matched_df, cov, treatment_col)

smd_df = pd.DataFrame.from_dict(smd_results, orient='index', columns=['SMD'])

# Interpretation of SMD values
for index, row in smd_df.iterrows():
    smd_value = row['SMD']
    interpretation = "well-balanced (wonderful)" if abs(smd_value) < 0.05 else 
                     "fairly balanced (good)" if abs(smd_value) < 0.1 else 
                     "reasonably balanced" if abs(smd_value) < 0.2 else 
                     "poorly balanced"
    print(f"The covariate '{index}' has an SMD of {smd_value:.4f}, indicating it's {interpretation}.")

	SMD
age	        0.000000
past_spend	0.049338
is_mobile	0.000000

The covariate 'age' has an SMD of 0.0000, indicating it's well-balanced (wonderful).
The covariate 'past_spend' has an SMD of -0.0238, indicating it's well-balanced (wonderful).
The covariate 'is_mobile' has an SMD of 0.0000, indicating it's well-balanced (wonderful).

SMD < 0.05 or 0.1: That is typically thought-about well-balanced or wonderful steadiness. Most researchers intention for an SMD lower than 0.1, and ideally lower than 0.05.

We will see that our variables move this check!

Lastly, let’s verify the distributions overlay between Management and Remedy.

# Management and Remedy Distribution Overlays
plt.determine(figsize=(10, 6))
sns.histplot(information=matched_df, x='past_spend', hue='saw_ad', kde=True, alpha=.4)
plt.title('Distribution of Previous Spend for Handled vs. Management Teams')
plt.xlabel('Previous Spend')
plt.ylabel('Density / Rely')
plt.legend(title='Noticed Advert', labels=['Control (0)', 'Treated (1)'])
plt.present()

Distributions overlay: They need to be one over the opposite and related in form. Picture by the creator.

It seems to be good. The distributions are completely overlapping and have a reasonably related form.

It is a pattern of the matched pairs. You will discover the code to build this on GitHub.

Pattern of the matched pairs dataset. Picture by the creator.

With that mentioned, I imagine we are able to conclude that this mannequin is working correctly, and we are able to transfer on to verify the outcomes.

Outcomes

Okay, since we’ve matching teams and distributions, let’s transfer on to the outcomes. We’ll verify the next:

Distinction of Means between the 2 teams
T-Check to verify for statistical distinction
Cohen’s D to calculate the impact measurement.

Listed here are the statistics of the matched dataset.

Stats on the ultimate dataset. Picture by the creator.

After Propensity Rating Matching, the estimated causal impact of seeing the advert (saw_ad) on past_spend might be inferred from the distinction in means between the matched handled and management teams.

# Distinction of averages
avg_past_spend_treated = matched_df[matched_df['saw_ad'] == 1]['past_spend'].imply()
avg_past_spend_control = matched_df[matched_df['saw_ad'] == 0]['past_spend'].imply()

past_spend_difference = avg_past_spend_treated - avg_past_spend_control

print(f"Common past_spend (Handled): {avg_past_spend_treated:.2f}")
print(f"Common past_spend (Management): {avg_past_spend_control:.2f}")
print(f"Distinction in common past_spend: {past_spend_difference:.2f}")

Common past_spend (Handled Group): 541.97
Common past_spend (Management Group): 528.14
Distinction in Common past_spend (Handled – Management): 13.82

This means that, on common, customers who noticed the advert (handled) spent roughly 13.82 greater than customers who didn’t see the advert (management), after accounting for the noticed covariates.

Let’s verify if the distinction is statistically vital.

# T-Check
treated_spend = matched_df[matched_df['saw_ad'] == 1]['past_spend']
control_spend = matched_df[matched_df['saw_ad'] == 0]['past_spend']

t_stat, p_value = stats.ttest_ind(treated_spend, control_spend, equal_var=False)

print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.3f}")

if p_value < 0.05:
    print("The distinction in past_spend between handled and management teams is statistically vital (p < 0.05).")
else:
    print("The distinction in past_spend between handled and management teams is NOT statistically vital (p >= 0.05).")

T-statistic: 0.805
P-value: 0.421
The distinction in past_spend between handled and management teams 
is NOT statistically vital (p >= 0.05).

The distinction shouldn’t be vital, on condition that the usual deviation remains to be very excessive (~280) between teams.

Allow us to additionally run a calculation of the impact measurement utilizing Cohen’s D.

# Cohen's D Impact measurement

def cohens_d(df, outcome_col, treatment_col):
    treated_group = df[df[treatment_col] == 1][outcome_col]
    control_group = df[df[treatment_col] == 0][outcome_col]

    mean1, std1 = treated_group.imply(), treated_group.std()
    mean2, std2 = control_group.imply(), control_group.std()
    n1, n2 = len(treated_group), len(control_group)

    # Pooled commonplace deviation
    s_pooled = np.sqrt(((n1 - 1) * std1**2 + (n2 - 1) * std2**2) / (n1 + n2 - 2))

    if s_pooled == 0:
        return 0 # Keep away from division by zero
    else:
        return (mean1 - mean2) / s_pooled

# Calculate Cohen's d for 'past_spend'
d_value = cohens_d(matched_df, 'past_spend', 'saw_ad')

print(f"Cohen's d for past_spend: {d_value:.3f}")

# Interpret Cohen's d
if abs(d_value) < 0.2:
    interpretation = "negligible impact"
elif abs(d_value) < 0.5:
    interpretation = "small impact"
elif abs(d_value) < 0.8:
    interpretation = "medium impact"
else:
    interpretation = "massive impact"

print(f"This means a {interpretation}.")

Cohen's d for past_spend: 0.049
This means a negligible impact.

The distinction is small, suggesting a negligible common remedy impact on past_spend on this matched pattern.

With that, we conclude this text.

Earlier than You Go

Causal impact is the world of Knowledge Science that provides us the the explanation why one thing occurs, different than simply telling us if that’s possible or to not occur.

Many instances, you might face this problem of understanding why one thing works (or not) in a enterprise. Corporations love that, much more if it may get monetary savings or make gross sales enhance due to that info.

Simply keep in mind the essential steps to create your mannequin.

Run a Logistic Regression to calculate propensity scores
Cut up the info into Management and Remedy
Run Nearest Neighbors to seek out the right match of Management and Remedy teams, so you’ll be able to isolate the true impact.
Consider your mannequin utilizing SMD
Calculate your outcomes.

If you happen to appreciated this content material, discover out extra about me in my web site.

https://gustavorsantos.me