boolRule brTrain

Performance Stress Test for Email Spam Filtering

Scénario de test & Cas d'usage

Business Context

A huge volume of emails needs to be processed to flag 'SPAM' vs 'LEGIT'. The system must handle a large number of candidate terms and documents without timing out, optimizing thread usage.
About the Set : boolRule

Extraction of Boolean rules for classification.

Discover all actions of boolRule
Data Preparation

Simulation of 50,000 documents with random term occurrences to stress the rule extraction algorithm.

Copied!
1 
2DATA casuser.email_terms;
3DO email_id=1 to 50000;
4DO t=1 to 5;
5term_code=int(rand('uniform')*500);
6OUTPUT;
7END;
8END;
9 
10RUN;
11 
12DATA casuser.email_info;
13DO email_id=1 to 50000;
14IF rand('uniform')>0.8 THEN label='SPAM';
15ELSE label='LEGIT';
16OUTPUT;
17END;
18 
19RUN;
20 

Étapes de réalisation

1
Run brTrain with performance tuning parameters (maxCandidates, nThreads)
Copied!
1 
2PROC CAS;
3boolRule.brTrain / TABLE={name='email_terms'} docId='email_id' termId='term_code' docInfo={TABLE={name='email_info'}, id='email_id', targets={'label'}} maxCandidates=1000 nThreads=4 casOut={name='spam_rules', replace=true};
4 
5RUN;
6 

Expected Result


The action completes within a reasonable time frame, utilizing multiple threads, and produces a rule set even with a high volume of noisy input data.