dataPreprocess catTrans

Web Analytics: Reducing High Cardinality with GroupRare

Scénario de test & Cas d'usage

Business Context

An e-commerce platform is analyzing web server logs to understand traffic sources. The 'Referrer_URL' variable has extremely high cardinality (thousands of unique referring sites). To make this variable usable in a clustering model, the Data Science team wants to keep only the top 5% most frequent referrers and group all other 'long tail' infrequent sites into a single category labeled 'Other_LongTail'.
About the Set : dataPreprocess

Data cleaning, imputation, and preprocessing.

Discover all actions of dataPreprocess
Data Preparation

Creation of a dataset with a 'referrer' variable having a 'Zipfian' distribution (few very common, many very rare).

Copied!
1DATA casuser.web_logs;
2 LENGTH referrer $20;
3 DO i = 1 to 10000;
4 rand_val = rand('Integer', 1, 100);
5 /* Skewed distribution: IDs 1-5 are very common, 6-100 are rare */
6 IF rand_val <= 5 THEN referrer = cats('Site_', rand_val);
7 ELSE referrer = cats('Site_', rand('Integer', 6, 500)); /* Generates many unique rare sites */
8 OUTPUT;
9 END;
10RUN;

Étapes de réalisation

1
Apply GroupRare method to consolidate rare levels
Copied!
1PROC CAS;
2 dataPreprocess.catTrans /
3 TABLE={name='web_logs', caslib='casuser'},
4 method='GROUPRARE',
5 inputs={{name='referrer'}},
6 casOut={name='web_logs_reduced', caslib='casuser', replace=true},
7 arguments={rareThresholdPercent=5};
8RUN;
9QUIT;

Expected Result


The output table 'web_logs_reduced' contains a transformed 'referrer' variable. The most frequent sites (Site_1 to Site_5) remain distinct, while the hundreds of infrequent sites (Site_6+) are consolidated into a single group (likely ID 0 or a specific label depending on format), significantly reducing the dimension of the variable for downstream modeling.