The catTrans action groups and encodes categorical variables using various unsupervised and supervised techniques. It is useful for feature engineering, reducing cardinality, and preparing data for modeling by transforming categorical variables into more meaningful representations like Weight of Evidence (WOE) or grouped bins.
| Parameter | Description |
|---|---|
| arguments | Specifies the list of arguments to use for the transformation. |
| casOut | Specifies the output table to store the scored data. |
| casOutBinDetails | Specifies the output table for binning results information. |
| casOutLevelBinMap | Specifies the output table for nominal level to bin mapping information. |
| code | Specifies settings for generating SAS DATA step scoring code. |
| copyAllVars | When set to True, copies all variables from the input table to the output table. |
| copyVars | Specifies a list of variables to copy from the input table to the output table. |
| distinctCountLimit | Specifies the limit for counting distinct values. |
| evaluationStats | When set to True, computes a default set of evaluation statistics for the transformed variables. |
| events | Specifies a list of event values for the target variables, used in supervised methods. |
| freq | Specifies the frequency variable. |
| fuzzyCompare | Specifies the fuzzy comparison threshold for determining distinctness of numeric values. |
| includeInputVars | When set to True, includes the original input variables in the output table. |
| includeMissingGroup | When set to True, allows missing values as group-by keys. |
| inputs | Specifies the categorical variables to be transformed. |
| inputsInheritFormats | Specifies that the variables inherit formats from the source table. |
| method | Specifies the grouping and encoding technique. Options include GROUPRARE (unsupervised), ONEHOT, and supervised methods like DTREE, RTREE, and WOE. |
| outputTableOptions | Specifies options for the output result tables. |
| outVarsNamePrefix | Specifies a prefix for the names of the generated output variables. |
| outVarsNameSuffix | Specifies a suffix for the names of the generated output variables. |
| sasVarNameLength | When set to True, constrains the output variable names to a maximum length of 32 characters. |
| table | Specifies the input CAS table containing the data to be processed. |
| targets | Specifies the target variable(s) for supervised grouping methods like WOE or DTREE. |
| targetsInheritFormats | Specifies that the target variables inherit formats from the source table. |
| weight | Specifies the weight variable for the analysis. |
This example creates a sample dataset `customer_data` with a categorical variable `product_category` and a binary target `purchase`. This data will be used to demonstrate how to transform the categorical variable.
| 1 | DATA casuser.customer_data; |
| 2 | LENGTH product_category $ 20; |
| 3 | DO i = 1 to 1000; |
| 4 | IF rand('UNIFORM') < 0.5 THEN purchase = 1; ELSE purchase = 0; |
| 5 | select(rand('INTEGER', 1, 10)); |
| 6 | when(1) product_category = 'Electronics'; |
| 7 | when(2) product_category = 'Clothing'; |
| 8 | when(3) product_category = 'Home Goods'; |
| 9 | when(4) product_category = 'Books'; |
| 10 | when(5) product_category = 'Sports'; |
| 11 | when(6) product_category = 'Toys'; |
| 12 | when(7) product_category = 'Groceries'; |
| 13 | when(8) product_category = 'Automotive'; |
| 14 | when(9) product_category = 'Music'; |
| 15 | otherwise product_category = 'Other'; |
| 16 | END; |
| 17 | OUTPUT; |
| 18 | END; |
| 19 | RUN; |
This example uses the default `GROUPRARE` method to group infrequent product categories. Categories that represent less than 5% of the total observations will be grouped into a single 'rare' category.
| 1 | PROC CAS; |
| 2 | dataPreprocess.catTrans / |
| 3 | TABLE={name='customer_data', caslib='casuser'}, |
| 4 | method='GROUPRARE', |
| 5 | inputs={{name='product_category'}}, |
| 6 | casOut={name='customer_data_grouped', caslib='casuser', replace=true}, |
| 7 | outVarsNamePrefix='grouped', |
| 8 | arguments={rareThresholdPercent=5}; |
| 9 | RUN; |
| 10 | QUIT; |
This example demonstrates a supervised transformation using the `WOE` method. It groups the `product_category` variable based on its relationship with the binary target `purchase`. The action calculates the Weight of Evidence for each category group. The resulting table will include the WOE-transformed variable, which can be directly used in predictive models like logistic regression.
| 1 | PROC CAS; |
| 2 | dataPreprocess.catTrans / |
| 3 | TABLE={name='customer_data', caslib='casuser'}, |
| 4 | method='WOE', |
| 5 | inputs={{name='product_category'}}, |
| 6 | targets={{name='purchase'}}, |
| 7 | events={'1'}, |
| 8 | evaluationStats=true, |
| 9 | casOut={name='customer_data_woe', caslib='casuser', replace=true}, |
| 10 | casOutBinDetails={name='bin_details_woe', caslib='casuser', replace=true}, |
| 11 | outVarsNamePrefix='woe', |
| 12 | arguments={overrides={woeDefinition='EVENT', minPerNObsInBin=2}}; |
| 13 | RUN; |
| 14 | QUIT; |