A Survival Analysis Guide with Python: Using Time-To-Event Models to Forecast Customer Lifetime

to many areas of information, serving to us take care of uncertainty, calculate chances, and help selections alongside the best way.

A type of areas that depends closely on statistics is the medical trade, utilizing instruments like T-Exams, A/B Exams, or Survival Evaluation. This final one is the topic of this text.

Survival evaluation originated within the medical and organic sciences, the place they had been making an attempt to mannequin, as their major occasion, the demise of a affected person or organism. That’s the explanation for the identify.

Nonetheless, statisticians understood that such evaluation was so highly effective that it could possibly be utilized to many different areas of life, and so it unfold to the enterprise area, much more after the surge of Knowledge Science.

Let’s study extra about it.

Survival Evaluation

Survival Evaluation [SA] is a department of statistics used to foretell the period of time it takes for a selected occasion to happen.[1]

Also called Time-to-event, this research can decide how lengthy it is going to take for one thing to occur whereas accounting for the truth that some occasions haven’t occurred but by the point the information is collected.

The examples will not be solely within the medical and organic sciences, however all over the place.

Time till a machine fails
Time till a buyer cancels a subscription
Time till the shopper buys once more

Now, provided that we try to estimate a quantity, moderately than a gaggle or class, this implies we’re coping with a kind of regression downside. So why can’t we go together with OLS Linear Regression?

Why Use Survival Evaluation?

Normal regression fashions like OLS or Logistic Regression wrestle with survival knowledge as a result of they’re designed to deal with accomplished occasions, not “ongoing” tales.

Think about you need to predict who completed a 10-mile race, however the enter knowledge is an occasion that’s nonetheless occurring. The race is at 2 hours, and also you need to use the information you might have up to now to estimate one thing.

The common regression algorithms will fail as a result of:

OLS: You solely have the information from those that have already completed the race. Utilizing solely their knowledge will create an enormous bias for sooner folks.
Logistic Regression: It will probably inform if somebody completed the race, most likely, but it surely treats those that completed at half-hour the identical as those that completed in 8 hours.

The Fundamentals of Survival Evaluation

Allow us to go over a couple of necessary ideas for understanding Survival Evaluation.

First, we should perceive the delivery and demise of an information level.

Beginning: The second we began to measure that knowledge level. For instance, the second a affected person is recognized with most cancers, or the day an individual is employed by an organization. Discover that the observations don’t want to begin all on the similar time.
Loss of life: It occurs on the prevalence of the occasion of curiosity. The day the worker left the corporate.

Now, the attention-grabbing factor about SA is that the research or the commentary can finish earlier than the occasion occurs. On this case, we can have one other necessary idea: the censored knowledge level.

Censoring (Non-death): If the research ends or a topic drops out earlier than the occasion occurs, the information is “censored,” which means we solely know they survived at the very least till that time.

Knowledge will be censored in several methods, although.

Proper Censoring: Commonest. The occasion happens after the commentary interval ends or the topic drops out.

Knowledge level C is right-censored. Picture by the creator.

Left Censoring: The occasion occurred earlier than the research began.

Nice. You will need to observe that survival evaluation is a option to estimate the chance of an occasion occurring as a operate of time. By treating survival as a operate of time, we will reply questions {that a} single chance rating can’t, reminiscent of: “At what particular month does the chance of a buyer churning peak?”

Now that we all know the fundamentals, let’s study extra concerning the capabilities concerned in SA.

Survival Perform

The survival operate S(t) expresses the chance of the occasion not occurring as a operate of time. It would naturally lower as time passes, since increasingly more people will expertise the occasion.

So, making use of it to our worker churn instance, we might see the chance that an worker continues to be within the firm after N years.

Survival Perform. Picture by the creator.

Hazard Perform

The hazard operate signifies the chance of the occasion occurring at a given cut-off date. It’s the reverse of the survival operate, and represents the chance of churn (as a substitute of the chance of staying within the firm).

This operate will calculate what’s the chance that the staff who haven’t churned till now will achieve this from this cut-off date.

Selecting Your Mannequin for Survival Evaluation

As you see, SA is a subject that may get deep and dense actual fast. However let’s attempt to hold it easy.

There are two important fashions used when performing survival evaluation. One is the Kaplan-Meier, which is less complicated however doesn’t think about the impact of extra predictor variables, and it requires a couple of assumptions to work.

The opposite one is the Cox Proportional Hazard mannequin, which is the trade normal as a result of it will probably take different variables into the mannequin, it’s extra steady mathematically, and it really works properly even when some assumptions are violated.

Let’s study extra about them.

Kaplan-Meier

Works properly with right-censored knowledge (bear in mind? when the occasion happens after the commentary interval ends)
Intuitive mannequin
Non-parametric: doesn’t comply with any distribution
Assumptions are required, like dropouts will not be associated to the occasion; Entry time doesn’t have an effect on survival threat; and Occasion instances are identified precisely.
Returns a survival operate that appears like a staircase

When to make use of:

Easy survival evaluation with out different covariates or predictors.
Nice for fast visualizations.

Cox Proportional Hazard

Business normal
Accepts extra predictors or covariates
Works properly even when some assumptions are violated
Estimates a hazard operate, which are usually extra steady than survival capabilities

When to make use of:

Estimate on knowledge with a number of predictor (covariate) variables.

Subsequent, let’s get our palms on some code.

Code

On this part, we’ll discover ways to mannequin an SA utilizing each fashions beforehand introduced.

The dataset chosen for this train is the Telco Customer Churn, which yow will discover within the UCI Machine Studying Repository underneath the Inventive Commons license.

View of the dataset. Picture by the creator.

Subsequent, let’s import the packages wanted.

# Knowledge
from ucimlrepo import fetch_ucirepo

# Knowledge Wrangling
import pandas as pd
import numpy as np

# DataViz
import matplotlib.pyplot as plt
import seaborn as sns

# Lifelines Survival Evaluation
from lifelines import KaplanMeierFitter
from lifelines import CoxPHFitter

# fetch dataset 
telco_churn = fetch_ucirepo(id=563) 
  
# knowledge (as pandas dataframes) 
X = telco_churn.knowledge.options 
y = telco_churn.knowledge.targets 
  
# Pandas df
df = pd.concat([X, y], axis=1)
df.head(3)

Implementing Kaplan-Meier

Now, as talked about, the Kaplan-Meier [KM] mannequin is de facto easy and easy to make use of, being a good selection for visualizations. All we want are two variables: one predictor and one label.

Then, we will instantiate the KM mannequin and match it to the information, utilizing Subscription Size (complete months of subscription) because the predictor, and Churn because the occasion noticed.

# Instantiate Ok-M
kmf = KaplanMeierFitter()

# Match the mannequin
kmf.match(df['Subscription  Length'],
        event_observed=df['Churn'],
        label= 'Buyer Churn')

Finished. Subsequent, we will visualize the survival operate.

# Plot survival curve
plt.determine(figsize=(12, 5))
kmf.plot_survival_function()
plt.title('Kaplan-Meier Survival Curve: Telco Buyer Lifetime')
plt.xlabel('Time (months)')
plt.ylabel('Chance of Remaining Subscribed')
plt.grid(True)
plt.present()

That is so nice! We are able to see that greater than 90% of the purchasers stick with the Telecom firm for about 35 months.

Kaplan-Meier mannequin is nice for visualizations. Picture by the creator.

If we need to affirm, we will simply code that to study that 90% stick with the corporate for 34 months, truly.

# Checking survival price at 34 months
kmf.survival_function_at_times(34)

Buyer Churn
34	0.900613

If we need to know what the median time is when folks churn, we will use KM’s attribute .median_survival_time_. That is the cut-off date (t) the place the survival chance drops to 50%. In our case, will probably be inf as a result of the survival operate by no means drops underneath 0.5. But when the consequence was 24 (for instance), it implies that on common, half of your prospects can have churned by month 24.

# Time (t) when Survival drops underneath 50%
median_survival = kmf.median_survival_time_
print(f"Median Buyer Lifetime: {median_survival} months")

We are able to additionally carry out different analyses, reminiscent of comparisons between teams. Think about that this Telco firm classifies its prospects into two teams:

Heavy-users: Frequency of Use > median
Comfortable-users: Frequency of Use <= median

We are able to evaluate each survival capabilities from these two teams.

# Column Teams
df['Heavy_User'] = np.the place(df['Frequency of use'] > df['Frequency of use'].median(), 1, 0)
df.head()

plt.determine(figsize=(12, 5))
plt.title('Kaplan-Meier Survival Curve: Telco Buyer Lifetime')
plt.xlabel('Time (months)')
plt.ylabel('Chance Churn')

# Match the mannequin for Comfortable customers and plot
kmf.match(df[df.Heavy_User == 0]['Subscription  Length'], df[df.Heavy_User == 0]['Churn'], label='Comfortable Person')
ax = kmf.plot_survival_function()

# Match the mannequin for Heavy customers and plot
kmf.match(df[df.Heavy_User == 1]['Subscription  Length'], df[df.Heavy_User == 1]['Churn'], label='Heavy Person')
ax = kmf.plot_survival_function(ax=ax)

plt.present()

And there it’s. Whereas heavy customers keep regular with the corporate all through the entire timeframe, the gentle customers will churn rapidly after the thirtieth month. Their median survival time is 40 months.

Survival comparability between teams. Picture by the creator.

When evaluating teams, you could be sure that the distinction is statistically important. For that, the bundle lifelines has the log-rank take a look at applied. It’s a speculation take a look at:

Ho (null speculation): The survival curves of two populations don’t differ.
Ha (various speculation): The survival curves of two populations are completely different.

from lifelines.statistics import logrank_test
# 3. Carry out the Log-Rank Take a look at
outcomes = logrank_test(df[df.Heavy_User == 0]['Subscription  Length'],
                       df[df.Heavy_User == 1]['Subscription  Length'],
                       event_observed_A= df[df.Heavy_User == 0]['Churn'], 
                       event_observed_B= df[df.Heavy_User == 1]['Churn'])

# 4. Print Outcomes
print(f"P-value: {outcomes.p_value}")
print(f"Take a look at Statistic: {outcomes.test_statistic}")

if outcomes.p_value < 0.05:
    print("End result: Statistically important distinction between teams.")
else:
    print("End result: No important distinction detected.")

P-value: 7.23487469906141e-103
Take a look at Statistic: 463.7794219211866
End result: Statistically important distinction between teams.

Implementing Cox Proportional Hazard

The primary cool factor that you are able to do with the Cox Proportional Hazard [CPH] Mannequin is checking how different variables can affect the survival of your noticed particular person.

Let’s break it down.

We begin by selecting some covariates
We filter the dataset
Instantiate the mannequin
Match the mannequin

# 1. Put together the information
# Choosing the time, the occasion, and our chosen covariates
cols_to_use = [
    'Subscription  Length', # Time (t)
    'Churn',                 # Event (E)
    'Charge  Amount',        # Covariate 1
    'Complains',             # Covariate 2
    'Frequency of use'       # Covariate 3
]

# Dropping any lacking values for the mannequin
df_model = df[cols_to_use].dropna()

# 2. Initialize and match the Cox mannequin
# Use the penalizer to stabilize the maths if not converging.
cph = CoxPHFitter(penalizer=0.1)
cph.match(df_model, 
        duration_col='Subscription  Size', 
        event_col='Churn')

# 3. Show the outcomes
cph.print_summary()

# 4. Visualize the affect of covariates
cph.plot()

That is our stunning consequence.

How can we interpret this?

The dashed vertical line at 0.0 is the impartial level.

If a variable’s level sits at 0, it has no impact on churn.
To the Proper (> 0): Will increase the hazard (makes churn occur sooner).
To the Left (< 0): Decreases the hazard (makes the shopper keep longer).
On the desk, a very powerful column for enterprise stakeholders is the Hazard Ration exp(coef). It tells us the multiplier impact on the chance of churn.

[TABLE] Complains (5.36): A buyer who complains is 5.36 instances (or 436%) extra seemingly to churn at any given time than a buyer who doesn’t complain. It is a huge impact.

[GRAPHIC] Complains (Excessive Hazard): That is our strongest predictor. Clients with complaints are roughly 5.4 instances extra seemingly to churn at any given second in comparison with those that don’t.

[TABLE] Frequency of use (0.99): Whereas the p-value says that is technically important, an HR of 0.99 is successfully 1. It means the impression on churn is negligible (solely a 1% change).

[GRAPHIC] Frequency of Use (Impartial): The sq. is sitting virtually precisely on the 0.0 line. On this particular mannequin, how typically a buyer makes use of the service doesn’t considerably change when they churn.

[TABLE] Cost Quantity (0.83): For each one-unit enhance in cost, the chance of churn drops by 17% ($1 – 0.83 = 0.17$). Increased-paying prospects are extra steady.

[GRAPHIC] Cost Quantity (Protecting Issue): The sq. is to the left of the zero line. Increased prices are related to a decrease threat of churn.

We are able to additionally check out each the Survival and the Hazard capabilities for this mannequin.

Survival and Hazard capabilities from the CPH mannequin. Picture by the creator.

The curve is just like the KM mannequin. Let’s evaluate the survival chance on the similar thirty fourth month.

# Extract the baseline survival chance at time 34
survival_at_34 = cph.baseline_survival_.loc[34]
print(f"Baseline Survival Chance at interval 34: {survival_at_34.values[0]:.4f}")

Baseline Survival Chance at interval 34: 0.9294

It’s virtually 3% larger, at ~93%

And to shut this text, let’s decide two completely different prospects, one with out complaints and the opposite with complaints, and let’s evaluate their survival chances on the thirty fourth month.

# 1. Decide a buyer (or predict for a brand new one)
particular person = df_model.iloc[[110,111]]

# 2. Predict their full survival curve
pred_survival = cph.predict_survival_function(particular person)

# 3. Get the worth at time 34
prob110_at_34 = pred_survival.loc[34].values[0]
prob111_at_34 = pred_survival.loc[34].values[1]

print(f"Buyer 110 (no complaints) Chance of 'Surviving' to interval 34: {prob110_at_34:.2%}")
print(f"Buyer 111 (sure compaints) Chance of 'Surviving' to interval 34: {prob111_at_34:.2%}")

Buyer 110 (no complaints) Chance of 'Surviving' to interval 34: 93.94%
Buyer 111 (sure compaints) Chance of 'Surviving' to interval 34: 61.68%

Huge distinction, huh? Greater than 30%. And we will lastly calculate the time in months when every buyer is anticipated to churn.

# Time Till Churn (Anticipated life) by buyer
pred_churn = cph.predict_expectation(df_model.iloc[[110,111]])

# Get the values in months
prob110_churn = pred_churn.loc[110]
prob111_churn = pred_churn.loc[111]

print(f"Buyer 110 (no complaints) anticipated churn at: {prob110_churn: .0f} months")
print(f"Buyer 111 (sure compaints)  anticipated churn at: {prob111_churn:.0f} months")

Buyer 110 (no complaints) anticipated churn at:  41 months
Buyer 111 (sure compaints)  anticipated churn at: 31 months

Positively, complaints make a distinction in churn for this Telco firm.

Earlier than You Go

Effectively, survival evaluation is rather more than only a statistical operate. Corporations can use it to know buyer conduct.

The Kaplan-Meier and Cox Proportional Hazard fashions present actionable insights into subscriber longevity. We’ve seen how variables like buyer worth and repair complaints immediately have an effect on churn, permitting determination makers to pursue extra focused retention methods.

Knowledge professionals who perceive these fashions can construct a robust device for firms to enhance their relationship with their consumer base. Use these instruments to remain forward of the curve. Actually.

Should you appreciated this content material, discover me on my web site.

https://gustavorsantos.me

GitHub Repository

https://github.com/gurezende/Survival-Analysis

References

[1. Survival Analysis Definition] (https://en.wikipedia.org/wiki/Survival_analysis)

[2. The Complete Introduction to Survival Analysis in Python] (https://medium.com/data-science/the-complete-introduction-to-survival-analysis-in-python-7523e17737e6)

[3. Introduction to Customer Survival Analysis: Understanding Customer Lifetimes] (https://medium.com/@slavyolov/introduction-to-customer-survival-analysis-understanding-customer-lifetimes-6e4ba41d7724)

[4. Ultimate Guide to Survival Analysis] (https://www.graphpad.com/guides/survival-analysis)

[5. What is the difference between Kaplan-Meier (KM) and Cox Proportional Hazards (CPH) ratio?] (https://www.droracle.ai/articles/218904/what-is-the-difference-between-kaplan-meier-km-and-cox)

[6. Lifelines Documentation] (https://lifelines.readthedocs.io/en/latest/)

[7. Survival Analysis in R For Beginners] (https://www.datacamp.com/tutorial/survival-analysis-R)

Source link

A Survival Analysis Guide with Python: Using Time-To-Event Models to Forecast Customer Lifetime

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Best Moving Companies of 2025

Trump’s new meme-coin sparks anger in crypto world

Mastering SQL Window Functions | Towards Data Science

A Survival Analysis Guide with Python: Using Time-To-Event Models to Forecast Customer Lifetime

Survival Evaluation

Why Use Survival Evaluation?

The Fundamentals of Survival Evaluation

Survival Perform

Hazard Perform

Selecting Your Mannequin for Survival Evaluation

Kaplan-Meier

Cox Proportional Hazard

Code

Implementing Kaplan-Meier

Implementing Cox Proportional Hazard

How can we interpret this?

Earlier than You Go

GitHub Repository

References

Related Posts