Scénario de test & Cas d'usage
Data cleaning, imputation, and preprocessing.
Discover all actions of dataPreprocessCreation of a credit applicant dataset with deliberate missing values and outliers for 'income' and 'credit_score'.
| 1 | DATA mycas.credit_applicants; |
| 2 | call streaminit(456); |
| 3 | DO applicant_id = 1 to 1000; |
| 4 | IF rand('UNIFORM') < 0.1 THEN income = .; /* 10% missing income */ |
| 5 | ELSE income = 25000 + (rand('UNIFORM')**2) * 200000; |
| 6 | IF rand('UNIFORM') < 0.05 THEN credit_score = .; /* 5% missing score */ |
| 7 | ELSE credit_score = 300 + floor(rand('UNIFORM') * 550); |
| 8 | OUTPUT; |
| 9 | END; |
| 10 | /* Add outliers */ |
| 11 | applicant_id=1001; income=5000000; credit_score=820; OUTPUT; |
| 12 | applicant_id=1002; income=10000; credit_score=310; OUTPUT; |
| 13 | RUN; |
| 1 | PROC CAS; |
| 2 | dataPreprocess.binning / |
| 3 | TABLE={name='credit_applicants'}, |
| 4 | inputs={{name='income'}, {name='credit_score'}}, |
| 5 | method='BUCKET', |
| 6 | nBinsArray=4, |
| 7 | binMissing=true, |
| 8 | binStarts={30000, .}, |
| 9 | binEnds={150000, .}, |
| 10 | outVarsNamePrefix='binned_', |
| 11 | casOut={name='applicants_binned', replace=true}, |
| 12 | casOutBinDetails={name='applicants_bin_details', replace=true}; |
| 13 | RUN; |
| 14 | QUIT; |
| 1 | PROC CAS; |
| 2 | TABLE.fetch / |
| 3 | TABLE={name='applicants_bin_details'} |
| 4 | where="_Variable_ = 'income'"; |
| 5 | RUN; |
| 6 | QUIT; |
| 1 | PROC CAS; |
| 2 | SIMPLE.freq / |
| 3 | TABLE={name='applicants_binned'} |
| 4 | inputs={'binned_income', 'binned_credit_score'}; |
| 5 | RUN; |
| 6 | QUIT; |
The output table 'applicants_binned' contains new variables 'binned_income' and 'binned_credit_score'. The frequency analysis will show a count for Bin 0 for both variables, corresponding to the missing values. The 'applicants_bin_details' table will show that for the 'income' variable, the first bin starts at 30000 and the last bin ends at 150000, effectively grouping all high-income outliers into the last bin.