dataSciencePilot analyzeMissingPatterns

Standard Case: Customer Churn Prediction - Impact of Missing Demographic Data

Scénario de test & Cas d'usage

Business Context

A telecom company wants to understand why customers are churning. They suspect that missing information in their CRM (like income or age) is not random and might be correlated with churn. This analysis aims to verify if patterns of missing data in customer profiles can help predict the 'churn' target variable.
About the Set : dataSciencePilot

Automated Machine Learning (AutoML) and pipeline generation.

Discover all actions of dataSciencePilot
Data Preparation

Creation of a customer dataset with demographic information and a churn flag. 'income' and 'last_complaint_category' have missing values.

Copied!
1DATA casuser.crm_churn_data;
2 LENGTH customer_id $ 10. last_complaint_category $ 20.;
3 INPUT customer_id $ age income tenure plan_type $ churn last_complaint_category $;
4 CARDS;
5CUST001 34 55000 24 Premium 0 Technical
6CUST002 45 . 60 Basic 1 Billing
7CUST003 28 48000 12 Basic 0 .
8CUST004 52 120000 120 Premium 0 Technical
9CUST005 21 . 6 Basic 1 .
10CUST006 65 85000 84 Premium 0 Billing
11CUST007 33 62000 30 Basic 0 Technical
12CUST008 41 . 48 Premium 1 .
13;
14RUN;

Étapes de réalisation

1
Load the customer data into CAS.
Copied!
1PROC CASUTIL;
2 load DATA=casuser.crm_churn_data outcaslib='casuser' casout='crm_churn_data' replace;
3RUN;
4QUIT;
2
Run analyzeMissingPatterns to see how missing 'income' and 'last_complaint_category' relate to the 'churn' target. 'plan_type' is treated as a nominal variable.
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='crm_churn_data', caslib='casuser'},
4 inputs={{name='age'}, {name='income'}, {name='tenure'}, {name='plan_type'}, {name='last_complaint_category'}},
5 nominals={'plan_type', 'last_complaint_category'},
6 target='churn',
7 casOut={name='churn_missing_analysis', caslib='casuser', replace=true};
8RUN;
9QUIT;

Expected Result


The output table 'churn_missing_analysis' should contain results like 'TargetCounts' and 'TargetMeans'. We expect to see a higher churn rate (mean of 'churn' closer to 1) for the pattern where both 'income' and 'last_complaint_category' are missing, suggesting that incomplete customer profiles are a risk factor for churn.