Standard Case: Customer Churn Prediction - Impact of Missing Demographic Data

Business Context

A telecom company wants to understand why customers are churning. They suspect that missing information in their CRM (like income or age) is not random and might be correlated with churn. This analysis aims to verify if patterns of missing data in customer profiles can help predict the 'churn' target variable.

About the Set : dataSciencePilot

Automated Machine Learning (AutoML) and pipeline generation.

Discover all actions of dataSciencePilot

Data Preparation

Creation of a customer dataset with demographic information and a churn flag. 'income' and 'last_complaint_category' have missing values.

Copied!

1	DATA casuser.crm_churn_data;
2	LENGTH customer_id $ 10. last_complaint_category $ 20.;
3	INPUT customer_id $ age income tenure plan_type $ churn last_complaint_category $;
4	CARDS;
5	CUST001 34 55000 24 Premium 0 Technical
6	CUST002 45 . 60 Basic 1 Billing
7	CUST003 28 48000 12 Basic 0 .
8	CUST004 52 120000 120 Premium 0 Technical
9	CUST005 21 . 6 Basic 1 .
10	CUST006 65 85000 84 Premium 0 Billing
11	CUST007 33 62000 30 Basic 0 Technical
12	CUST008 41 . 48 Premium 1 .
13	;
14	RUN;

Étapes de réalisation

Load the customer data into CAS.

Copied!

1	PROC CASUTIL;
2	load DATA=casuser.crm_churn_data outcaslib='casuser' casout='crm_churn_data' replace;
3	RUN;
4	QUIT;

Run analyzeMissingPatterns to see how missing 'income' and 'last_complaint_category' relate to the 'churn' target. 'plan_type' is treated as a nominal variable.

Copied!

1	PROC CAS;
2	dataSciencePilot.analyzeMissingPatterns /
3	TABLE={name='crm_churn_data', caslib='casuser'},
4	inputs={{name='age'}, {name='income'}, {name='tenure'}, {name='plan_type'}, {name='last_complaint_category'}},
5	nominals={'plan_type', 'last_complaint_category'},
6	target='churn',
7	casOut={name='churn_missing_analysis', caslib='casuser', replace=true};
8	RUN;
9	QUIT;

Expected Result

The output table 'churn_missing_analysis' should contain results like 'TargetCounts' and 'TargetMeans'. We expect to see a higher churn rate (mean of 'churn' closer to 1) for the pattern where both 'income' and 'last_complaint_category' are missing, suggesting that incomplete customer profiles are a risk factor for churn.

Voir la documentation technique de analyzeMissingPatterns