Web Analytics: Reducing High Cardinality with GroupRare

Business Context

An e-commerce platform is analyzing web server logs to understand traffic sources. The 'Referrer_URL' variable has extremely high cardinality (thousands of unique referring sites). To make this variable usable in a clustering model, the Data Science team wants to keep only the top 5% most frequent referrers and group all other 'long tail' infrequent sites into a single category labeled 'Other_LongTail'.

About the Set : dataPreprocess

Data cleaning, imputation, and preprocessing.

Discover all actions of dataPreprocess

Data Preparation

Creation of a dataset with a 'referrer' variable having a 'Zipfian' distribution (few very common, many very rare).

Copied!

1	DATA casuser.web_logs;
2	LENGTH referrer $20;
3	DO i = 1 to 10000;
4	rand_val = rand('Integer', 1, 100);
5	/* Skewed distribution: IDs 1-5 are very common, 6-100 are rare */
6	IF rand_val <= 5 THEN referrer = cats('Site_', rand_val);
7	ELSE referrer = cats('Site_', rand('Integer', 6, 500)); /* Generates many unique rare sites */
8	OUTPUT;
9	END;
10	RUN;

Étapes de réalisation

Apply GroupRare method to consolidate rare levels

Copied!

1	PROC CAS;
2	dataPreprocess.catTrans /
3	TABLE={name='web_logs', caslib='casuser'},
4	method='GROUPRARE',
5	inputs={{name='referrer'}},
6	casOut={name='web_logs_reduced', caslib='casuser', replace=true},
7	arguments={rareThresholdPercent=5};
8	RUN;
9	QUIT;

Expected Result

The output table 'web_logs_reduced' contains a transformed 'referrer' variable. The most frequent sites (Site_1 to Site_5) remain distinct, while the hundreds of infrequent sites (Site_6+) are consolidated into a single group (likely ID 0 or a specific label depending on format), significantly reducing the dimension of the variable for downstream modeling.

Voir la documentation technique de catTrans