dataPreprocess binning

Handling Missing Data and Outliers in Credit Risk Assessment

Scénario de test & Cas d'usage

Business Context

A financial institution is preparing a dataset for a credit scoring model. The input data, containing applicant 'income' and 'credit_score', is known to have missing values. The risk department requires that missing values be treated as a distinct category and that extreme income values do not distort the binning process.
About the Set : dataPreprocess

Data cleaning, imputation, and preprocessing.

Discover all actions of dataPreprocess
Data Preparation

Creation of a credit applicant dataset with deliberate missing values and outliers for 'income' and 'credit_score'.

Copied!
1DATA mycas.credit_applicants;
2 call streaminit(456);
3 DO applicant_id = 1 to 1000;
4 IF rand('UNIFORM') < 0.1 THEN income = .; /* 10% missing income */
5 ELSE income = 25000 + (rand('UNIFORM')**2) * 200000;
6 IF rand('UNIFORM') < 0.05 THEN credit_score = .; /* 5% missing score */
7 ELSE credit_score = 300 + floor(rand('UNIFORM') * 550);
8 OUTPUT;
9 END;
10 /* Add outliers */
11 applicant_id=1001; income=5000000; credit_score=820; OUTPUT;
12 applicant_id=1002; income=10000; credit_score=310; OUTPUT;
13RUN;

Étapes de réalisation

1
Execute bucket binning while explicitly handling missing values. A separate bin (ID 0) should be created for them. Define explicit start and end boundaries for 'income' to cap outliers.
Copied!
1PROC CAS;
2 dataPreprocess.binning /
3 TABLE={name='credit_applicants'},
4 inputs={{name='income'}, {name='credit_score'}},
5 method='BUCKET',
6 nBinsArray=4,
7 binMissing=true,
8 binStarts={30000, .},
9 binEnds={150000, .},
10 outVarsNamePrefix='binned_',
11 casOut={name='applicants_binned', replace=true},
12 casOutBinDetails={name='applicants_bin_details', replace=true};
13RUN;
14QUIT;
2
Check the bin details table to confirm the creation of the special missing value bin and verify the boundaries for the 'income' variable.
Copied!
1PROC CAS;
2 TABLE.fetch /
3 TABLE={name='applicants_bin_details'}
4 where="_Variable_ = 'income'";
5RUN;
6QUIT;
3
Count the observations in the missing bin (Bin 0) for both variables to ensure they match the number of missing values inserted in the data preparation step.
Copied!
1PROC CAS;
2 SIMPLE.freq /
3 TABLE={name='applicants_binned'}
4 inputs={'binned_income', 'binned_credit_score'};
5RUN;
6QUIT;

Expected Result


The output table 'applicants_binned' contains new variables 'binned_income' and 'binned_credit_score'. The frequency analysis will show a count for Bin 0 for both variables, corresponding to the missing values. The 'applicants_bin_details' table will show that for the 'income' variable, the first bin starts at 30000 and the last bin ends at 150000, effectively grouping all high-income outliers into the last bin.