Handling Missing Data and Outliers in Credit Risk Assessment

Business Context

A financial institution is preparing a dataset for a credit scoring model. The input data, containing applicant 'income' and 'credit_score', is known to have missing values. The risk department requires that missing values be treated as a distinct category and that extreme income values do not distort the binning process.

About the Set : dataPreprocess

Data cleaning, imputation, and preprocessing.

Discover all actions of dataPreprocess

Data Preparation

Creation of a credit applicant dataset with deliberate missing values and outliers for 'income' and 'credit_score'.

Copied!

1	DATA mycas.credit_applicants;
2	call streaminit(456);
3	DO applicant_id = 1 to 1000;
4	IF rand('UNIFORM') < 0.1 THEN income = .; /* 10% missing income */
5	ELSE income = 25000 + (rand('UNIFORM')*2) 200000;
6	IF rand('UNIFORM') < 0.05 THEN credit_score = .; /* 5% missing score */
7	ELSE credit_score = 300 + floor(rand('UNIFORM') * 550);
8	OUTPUT;
9	END;
10	/* Add outliers */
11	applicant_id=1001; income=5000000; credit_score=820; OUTPUT;
12	applicant_id=1002; income=10000; credit_score=310; OUTPUT;
13	RUN;

Étapes de réalisation

Execute bucket binning while explicitly handling missing values. A separate bin (ID 0) should be created for them. Define explicit start and end boundaries for 'income' to cap outliers.

Copied!

1	PROC CAS;
2	dataPreprocess.binning /
3	TABLE={name='credit_applicants'},
4	inputs={{name='income'}, {name='credit_score'}},
5	method='BUCKET',
6	nBinsArray=4,
7	binMissing=true,
8	binStarts={30000, .},
9	binEnds={150000, .},
10	outVarsNamePrefix='binned_',
11	casOut={name='applicants_binned', replace=true},
12	casOutBinDetails={name='applicants_bin_details', replace=true};
13	RUN;
14	QUIT;

Check the bin details table to confirm the creation of the special missing value bin and verify the boundaries for the 'income' variable.

Copied!

1	PROC CAS;
2	TABLE.fetch /
3	TABLE={name='applicants_bin_details'}
4	where="_Variable_ = 'income'";
5	RUN;
6	QUIT;

Count the observations in the missing bin (Bin 0) for both variables to ensure they match the number of missing values inserted in the data preparation step.

Copied!

1	PROC CAS;
2	SIMPLE.freq /
3	TABLE={name='applicants_binned'}
4	inputs={'binned_income', 'binned_credit_score'};
5	RUN;
6	QUIT;

Expected Result

The output table 'applicants_binned' contains new variables 'binned_income' and 'binned_credit_score'. The frequency analysis will show a count for Bin 0 for both variables, corresponding to the missing values. The 'applicants_bin_details' table will show that for the 'income' variable, the first bin starts at 30000 and the last bin ends at 150000, effectively grouping all high-income outliers into the last bin.

Voir la documentation technique de binning