dataPreprocess

catTrans

Description

The catTrans action groups and encodes categorical variables using various unsupervised and supervised techniques. It is useful for feature engineering, reducing cardinality, and preparing data for modeling by transforming categorical variables into more meaningful representations like Weight of Evidence (WOE) or grouped bins.

dataPreprocess.catTrans { arguments={<catTransArguments>}, casOut={<casouttable>}, casOutBinDetails={<casouttable>}, casOutLevelBinMap={<casouttable>}, code={<code>}, copyAllVars=TRUE | FALSE, copyVars={"variable-name-1" <, "variable-name-2", ...>}, distinctCountLimit=integer, evaluationStats=TRUE | FALSE, events={"string-1" <, "string-2", ...>}, freq="variable-name", fuzzyCompare=double, includeInputVars=TRUE | FALSE, includeMissingGroup=TRUE | FALSE, inputs={{<casinvardesc-1>} <, {<casinvardesc-2>}, ...>}, inputsInheritFormats=TRUE | FALSE, method="DTREE" | "GROUPRARE" | "ONEHOT" | "RTREE" | "WOE", outputTableOptions={<outputTableOptions>}, outVarsNamePrefix="string", outVarsNameSuffix="string", sasVarNameLength=TRUE | FALSE, table={<castable>}, targets={{<casinvardesc-1>} <, {<casinvardesc-2>}, ...>}, targetsInheritFormats=TRUE | FALSE, weight="variable-name" }
Settings
ParameterDescription
argumentsSpecifies the list of arguments to use for the transformation.
casOutSpecifies the output table to store the scored data.
casOutBinDetailsSpecifies the output table for binning results information.
casOutLevelBinMapSpecifies the output table for nominal level to bin mapping information.
codeSpecifies settings for generating SAS DATA step scoring code.
copyAllVarsWhen set to True, copies all variables from the input table to the output table.
copyVarsSpecifies a list of variables to copy from the input table to the output table.
distinctCountLimitSpecifies the limit for counting distinct values.
evaluationStatsWhen set to True, computes a default set of evaluation statistics for the transformed variables.
eventsSpecifies a list of event values for the target variables, used in supervised methods.
freqSpecifies the frequency variable.
fuzzyCompareSpecifies the fuzzy comparison threshold for determining distinctness of numeric values.
includeInputVarsWhen set to True, includes the original input variables in the output table.
includeMissingGroupWhen set to True, allows missing values as group-by keys.
inputsSpecifies the categorical variables to be transformed.
inputsInheritFormatsSpecifies that the variables inherit formats from the source table.
methodSpecifies the grouping and encoding technique. Options include GROUPRARE (unsupervised), ONEHOT, and supervised methods like DTREE, RTREE, and WOE.
outputTableOptionsSpecifies options for the output result tables.
outVarsNamePrefixSpecifies a prefix for the names of the generated output variables.
outVarsNameSuffixSpecifies a suffix for the names of the generated output variables.
sasVarNameLengthWhen set to True, constrains the output variable names to a maximum length of 32 characters.
tableSpecifies the input CAS table containing the data to be processed.
targetsSpecifies the target variable(s) for supervised grouping methods like WOE or DTREE.
targetsInheritFormatsSpecifies that the target variables inherit formats from the source table.
weightSpecifies the weight variable for the analysis.
Data Preparation View data prep sheet
Data Creation

This example creates a sample dataset `customer_data` with a categorical variable `product_category` and a binary target `purchase`. This data will be used to demonstrate how to transform the categorical variable.

Copied!
1DATA casuser.customer_data;
2 LENGTH product_category $ 20;
3 DO i = 1 to 1000;
4 IF rand('UNIFORM') < 0.5 THEN purchase = 1; ELSE purchase = 0;
5 select(rand('INTEGER', 1, 10));
6 when(1) product_category = 'Electronics';
7 when(2) product_category = 'Clothing';
8 when(3) product_category = 'Home Goods';
9 when(4) product_category = 'Books';
10 when(5) product_category = 'Sports';
11 when(6) product_category = 'Toys';
12 when(7) product_category = 'Groceries';
13 when(8) product_category = 'Automotive';
14 when(9) product_category = 'Music';
15 otherwise product_category = 'Other';
16 END;
17 OUTPUT;
18 END;
19RUN;

Examples

This example uses the default `GROUPRARE` method to group infrequent product categories. Categories that represent less than 5% of the total observations will be grouped into a single 'rare' category.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataPreprocess.catTrans /
3 TABLE={name='customer_data', caslib='casuser'},
4 method='GROUPRARE',
5 inputs={{name='product_category'}},
6 casOut={name='customer_data_grouped', caslib='casuser', replace=true},
7 outVarsNamePrefix='grouped',
8 arguments={rareThresholdPercent=5};
9RUN;
10QUIT;
Result :
An output table `customer_data_grouped` is created in the `casuser` caslib. It contains the original variables plus a new variable, `grouped_product_category`, where rare categories are consolidated.

This example demonstrates a supervised transformation using the `WOE` method. It groups the `product_category` variable based on its relationship with the binary target `purchase`. The action calculates the Weight of Evidence for each category group. The resulting table will include the WOE-transformed variable, which can be directly used in predictive models like logistic regression.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataPreprocess.catTrans /
3 TABLE={name='customer_data', caslib='casuser'},
4 method='WOE',
5 inputs={{name='product_category'}},
6 targets={{name='purchase'}},
7 events={'1'},
8 evaluationStats=true,
9 casOut={name='customer_data_woe', caslib='casuser', replace=true},
10 casOutBinDetails={name='bin_details_woe', caslib='casuser', replace=true},
11 outVarsNamePrefix='woe',
12 arguments={overrides={woeDefinition='EVENT', minPerNObsInBin=2}};
13RUN;
14QUIT;
Result :
Two tables are created: `customer_data_woe` contains the original data plus the new `woe_product_category` variable. `bin_details_woe` contains detailed statistics for each bin, including WOE, Information Value (IV), and event/non-event counts.

FAQ

What is the purpose of the catTrans action?
What are the different methods available in the catTrans action?
How can I handle rare levels in my categorical variables?
Which grouping techniques are supervised?
How does the action handle missing values?
What is the primary purpose of the catTrans action?
What are the different methods available for categorical variable transformation in the catTrans action?
How can I handle rare levels of a categorical variable using this action?
What is one-hot encoding and how is it implemented in the catTrans action?
Which grouping techniques are considered supervised and what do they require?
How does the 'WOE' method work and what are its key options?
Can I generate SAS DATA step code for scoring from this action?