Edge Case: Scoring Loan Default Risk with Missing Data
Scénario de test & Cas d'usage
Business Context
A financial institution uses a BART classification model (trained with bartProbit) to assess loan default risk. The scoring process must be robust and handle incoming applications that may have missing information for certain fields, such as 'credit_history_years'. The goal is to see how bartScore handles missing predictor values and correctly classifies applicants based on a custom probability cutoff.
Create a binary classification training set for loan default. The scoring set is created with intentionally missing values (.) for the 'credit_history_years' variable for some applicants.
Copied!
data mycas.loan_train;\n call streaminit(333);\n do i = 1 to 3000;\n loan_amount = 5000 + rand('UNIFORM') * 45000;\n income = 30000 + rand('UNIFORM') * 120000;\n credit_history_years = 1 + rand('UNIFORM') * 20;\n default_flag = (loan_amount / income > 0.4) or (credit_history_years < 3);\n if rand('UNIFORM') < 0.2 then default_flag = 1 - default_flag; /* add noise */\n output;\n end;\nrun;\n\ndata mycas.loan_applications_score;\n call streaminit(555);\n do application_id = 1 to 100;\n loan_amount = 10000 + rand('UNIFORM') * 50000;\n income = 40000 + rand('UNIFORM') * 100000;\n /* Introduce missing values for 20% of applicants */\n if rand('UNIFORM') < 0.2 then credit_history_years = .;\n else credit_history_years = 1 + rand('UNIFORM') * 15;\n output;\n end;\nrun;
1
DATA mycas.loan_train;
2
call streaminit(333);
3
DO i = 1 to 3000;
4
loan_amount = 5000 + rand('UNIFORM') * 45000;
5
income = 30000 + rand('UNIFORM') * 120000;
6
credit_history_years = 1 + rand('UNIFORM') * 20;
7
default_flag = (loan_amount / income > 0.4) or (credit_history_years < 3);
Attempt to score the applications table, which contains missing values. Use classification parameters 'into' and 'intoCutpt' to generate a risk label based on a 40% probability threshold.
The action should complete without errors. The output table 'mycas.loan_risk_scored' must contain 100 rows, indicating that the action successfully processed records with missing values in a predictor variable. The table should contain 'application_id', the default predicted probability 'P_default_flag1', and the custom classification column 'Risk_Label'. The 'Risk_Label' should be 1 if 'P_default_flag1' is >= 0.4, and 0 otherwise. This demonstrates the action's robustness to imperfect data.