STA323 Project 1 · Q2 Task (2)

Survival Analysis Report

A blog summary of the Spark-based survival-analysis case study on customer churn, including event definition, Kaplan–Meier estimation, Cox modeling, AFT modeling, and customer lifetime value analysis.

← Back to homepage

1. Overview

Survival analysis studies the time until an event occurs while handling censored observations. In this project, customer tenure is used as the duration variable, and customer churn is treated as the event of interest. A record with Churn = Yes is an observed event, while Churn = No is right-censored.

Compared with ordinary churn classification, survival analysis is more informative because it uses both event occurrence and event timing.

2. Dataset and Preprocessing

The analysis follows the Spark survival-analysis tutorial with the Telco Customer Churn dataset. The cohort was restricted to month-to-month customers with internet service. The event indicator is defined as 1 for churned customers and 0 for censored customers.

7,043raw records
3,351analysis-cohort records
1,556observed churn events
46.43%cohort churn rate
ItemValue
Rows in raw dataset7043
Rows in analysis cohort3351
Churned customers in cohort1556
Non-churned customers in cohort1795
Mean tenure in cohort19.43 months
Median tenure in cohort13 months
Mean monthly charges73.59
Churn indicatorCustomer countAvg. tenureAvg. monthly charges
0 (censored)179523.6271.17
1 (churned)155614.6076.38

3. Kaplan–Meier Survival Analysis

The Kaplan–Meier estimator was used to estimate the customer survival function over tenure months. The overall median survival time of the filtered cohort was 34 months.

ResultStatistic / valuep-value
Overall median survival time34 months
Gender log-rank test2.03890.1533
OnlineSecurity log-rank test141.60321.19 × 10−32

The gender difference was not statistically significant, while OnlineSecurity created a very strong separation in survival experience.

4. Cox Proportional Hazards Model

The Cox model estimates how covariates affect churn hazard. A hazard ratio below 1 indicates lower churn risk.

CovariateCoefficientHazard ratiop-value
dependents_Yes-0.330.72< 0.005
internetService_DSL-0.220.80< 0.005
onlineBackup_Yes-0.780.46< 0.005
techSupport_Yes-0.640.53< 0.005

The most protective variables in this Cox model were onlineBackup_Yes and techSupport_Yes.

The proportional hazards assumption check suggested possible non-proportional behavior for some variables, so the Cox model should be interpreted as useful but not assumption-perfect.

5. Accelerated Failure Time Model

The log-logistic AFT model describes how covariates accelerate or decelerate time until churn. In this model, exp(coef) greater than 1 suggests longer time until churn.

CovariateCoefficientexp(coef)p-value
internetService_DSL0.381.47< 0.005
onlineSecurity_Yes0.862.37< 0.005
onlineBackup_Yes0.812.25< 0.005
techSupport_Yes0.691.99< 0.005

6. Customer Lifetime Value

The survival model was also used to estimate expected customer lifetime value. Predicted survival probabilities were multiplied by monthly profit and discounted over time.

Contract monthSurvival probabilityDiscounted expected profitCumulative NPV
10.948128.2128.21
120.813722.10297.51
240.723517.78533.25

7. Main Findings

  1. The filtered cohort has a high churn rate of 46.43%, so it is a suitable high-risk segment for survival modeling.
  2. Churned customers have shorter average tenure than censored customers.
  3. OnlineSecurity is strongly associated with better survival, while gender is not significant.
  4. TechSupport and OnlineBackup are associated with lower churn hazard.
  5. OnlineSecurity and TechSupport increase estimated time until churn in the AFT model.