Scénario de test & Cas d'usage
Data cleaning, imputation, and preprocessing.
Discover all actions of dataPreprocessCreation of a dataset with a 'referrer' variable having a 'Zipfian' distribution (few very common, many very rare).
| 1 | DATA casuser.web_logs; |
| 2 | LENGTH referrer $20; |
| 3 | DO i = 1 to 10000; |
| 4 | rand_val = rand('Integer', 1, 100); |
| 5 | /* Skewed distribution: IDs 1-5 are very common, 6-100 are rare */ |
| 6 | IF rand_val <= 5 THEN referrer = cats('Site_', rand_val); |
| 7 | ELSE referrer = cats('Site_', rand('Integer', 6, 500)); /* Generates many unique rare sites */ |
| 8 | OUTPUT; |
| 9 | END; |
| 10 | RUN; |
| 1 | PROC CAS; |
| 2 | dataPreprocess.catTrans / |
| 3 | TABLE={name='web_logs', caslib='casuser'}, |
| 4 | method='GROUPRARE', |
| 5 | inputs={{name='referrer'}}, |
| 6 | casOut={name='web_logs_reduced', caslib='casuser', replace=true}, |
| 7 | arguments={rareThresholdPercent=5}; |
| 8 | RUN; |
| 9 | QUIT; |
The output table 'web_logs_reduced' contains a transformed 'referrer' variable. The most frequent sites (Site_1 to Site_5) remain distinct, while the hundreds of infrequent sites (Site_6+) are consolidated into a single group (likely ID 0 or a specific label depending on format), significantly reducing the dimension of the variable for downstream modeling.