dataPreprocess

binning

Description

The `binning` action in the `dataPreprocess` action set is a powerful tool for unsupervised variable discretization. It groups continuous numerical variables into a smaller number of 'bins'. This is a common data preparation step for many machine learning algorithms, as it can help manage outliers, reduce noise, and handle non-linear relationships. The action supports several methods for creating bins, such as equal-width (bucket), equal-frequency (quantile), or user-defined cutpoints.

dataPreprocess.binning { binEnds={double-1, double-2, ...}, binMapping="LEFT"|"RIGHT", binMissing=TRUE|FALSE, binStarts={double-1, double-2, ...}, binWidths={double-1, double-2, ...}, casOut={...}, casOutBinDetails={...}, code={...}, copyAllVars=TRUE|FALSE, copyVars={"variable-name-1", "variable-name-2", ...}, cutPoints={double-1, double-2, ...}, freq="variable-name", fuzzyCompare=double, includeInputVars=TRUE|FALSE, includeMissingGroup=TRUE|FALSE, inputs={{...}, {...}}, method="BUCKET"|"CUTPTS"|"QUANTILE", nBinsArray={integer-1, integer-2, ...}|integer, noDataLowerUpperBound=TRUE|FALSE, outputTableOptions={...}, outVarsNamePrefix="string", outVarsNameSuffix="string", percentileDefinition=integer, percentileMaxIterations=integer, percentileTolerance=double, sasVarNameLength=TRUE|FALSE, table={...}, weight="variable-name" };
Settings
ParameterDescription
binEnds Specifies the bin end values. If applicable, they override the data maximum values.
binMapping Controls how to map values that fall at the boundary between consecutive bins. LEFT enables you to express the bins with [], (], ..., (] notation. RIGHT enables [), [), ..., [] notation.
binMissing When set to True, bins missing values into a separate bin. The ID for this bin is 0.
binStarts Specifies the bin start values. If applicable, they override the data minimum values.
binWidths Specifies the bin width.
casOut Specifies the output table to store the scored data.
casOutBinDetails Specifies the output table to store details about the created bins.
code Specifies settings for generating SAS DATA step scoring code.
copyAllVars When set to True, all variables from the input table are copied to the output table.
copyVars Specifies a list of variables to copy from the input table to the output table.
cutPoints Specifies the user-provided cutpoints for the 'CUTPTS' binning method.
freq Specifies the frequency variable for the analysis.
fuzzyCompare Specifies the fuzzy comparison threshold used to determine the distinctness of numeric values.
includeInputVars When set to True, the original analysis variables are included in the output table.
includeMissingGroup When set to True, missing values are allowed as group-by keys.
inputs Specifies the numerical variables to be binned.
method Specifies the binning technique to use: BUCKET (equal-width), QUANTILE (equal-frequency), or CUTPTS (user-defined).
nBinsArray Specifies the number of bins to create for each variable.
noDataLowerUpperBound When set to True, the global lower and upper bounds of the bin set are unlimited in the generated score code.
outputTableOptions Specifies options for the result tables, such as which tables to return.
outVarsNamePrefix Specifies a prefix to apply to the names of the generated binned variables.
outVarsNameSuffix Specifies a suffix to apply to the names of the generated binned variables.
percentileDefinition Specifies the percentile definition to use for the QUANTILE method (from 1 to 6).
percentileMaxIterations Specifies the maximum number of iterations for percentile computation.
percentileTolerance Specifies the tolerance for percentile computation.
sasVarNameLength When set to True, constrains the output variable names to a maximum length of 32 characters.
table Specifies the input CAS table containing the data to be processed.
weight Specifies the weight variable for the analysis.
Data Preparation View data prep sheet
Data Creation

This example creates a sample dataset `sample_data` in the active caslib. The table contains customer information, including age and income, which will be used in the binning examples.

Copied!
1DATA mycas.sample_data;
2 DO i = 1 to 100;
3 age = 20 + floor(rand('UNIFORM') * 50);
4 income = 30000 + floor(rand('UNIFORM') * 70000);
5 OUTPUT;
6 END;
7RUN;

Examples

This example performs a simple bucket (equal-width) binning on the `age` variable, dividing it into 5 bins. The results are stored in a new table named `binned_age`.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataPreprocess.binning
3 TABLE={name='sample_data'},
4 inputs={{name='age'}},
5 method='BUCKET',
6 nBinsArray=5,
7 casOut={name='binned_age', replace=true};
8RUN;
Result :
The action generates an output table `binned_age` in the active caslib. This table contains the original variables plus a new variable, `bin_age`, which holds the bin number (from 1 to 5) for each observation's age.

This example demonstrates quantile (equal-frequency) binning on both `age` and `income`. It creates 4 bins for `age` and 10 for `income`. It also generates two output tables: `binned_customers` containing the scored data, and `bin_details` containing the metadata about the bins (like lower/upper bounds for each bin). The original input variables are also copied to the output table.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataPreprocess.binning
3 TABLE={name='sample_data'},
4 inputs={{name='age'}, {name='income'}},
5 method='QUANTILE',
6 nBinsArray={4, 10},
7 includeInputVars=true,
8 casOut={name='binned_customers', replace=true},
9 casOutBinDetails={name='bin_details', replace=true};
10RUN;
Result :
Two tables are created in the active caslib. `binned_customers` contains the original `age` and `income` columns, plus two new columns: `bin_age` (with values from 1-4) and `bin_income` (with values from 1-10). The `bin_details` table contains detailed information about the cutoffs and statistics for each bin created for both variables.

This example uses the `CUTPTS` method to create custom bins for the `income` variable based on specific financial thresholds. It also demonstrates how to customize the output variable name using a prefix and suffix.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataPreprocess.binning
3 TABLE={name='sample_data'},
4 inputs={{name='income'}},
5 method='CUTPTS',
6 cutPoints={50000, 75000, 90000},
7 outVarsNamePrefix='custom',
8 outVarsNameSuffix='group',
9 casOut={name='income_groups', replace=true};
10RUN;
Result :
An output table `income_groups` is created. It includes a new variable named `custom_income_group`. This variable will have values 1 for income <= 50000, 2 for income between 50000 and 75000, 3 for income between 75000 and 90000, and 4 for income > 90000.

Associated Scenarios

Use Case
Customer Segmentation for a Targeted Marketing Campaign

A retail company wants to segment its customer base to launch a new loyalty program. The marketing team needs to group customers into five distinct, equally-sized groups based o...

Use Case
Handling Missing Data and Outliers in Credit Risk Assessment

A financial institution is preparing a dataset for a credit scoring model. The input data, containing applicant 'income' and 'credit_score', is known to have missing values. The...

Use Case
Patient Grouping Based on Pre-defined Clinical Biomarker Thresholds

In a clinical trial analysis, researchers need to categorize patients into risk groups based on specific, medically-defined cut-off points for a 'biomarker_level'. These thresho...