boolRule

brTrain

Description

The brTrain action extracts Boolean rules from a collection of documents. It is a key part of supervised learning for text categorization, creating a human-readable model that explains the classification logic. This action analyzes the relationship between terms present in documents and their assigned categories (targets) to generate a set of IF-THEN rules. These rules can then be used by the brScore action to classify new documents.

proc cas; boolRule.brTrain / table={name='cas_table_in'} docId='_document_' termId='_termnum_' docInfo={table={name='doc_info_table'}, id='_document_', targets={'target_var'}} casOut={name='rules_out', replace=true}; run;
Settings
ParameterDescription
table Specifies the input data table that contains the document-term information for rule extraction.
docId Specifies the variable in the input table that contains the document ID.
termId Specifies the variable in the input table that contains the term ID.
docInfo Specifies the table containing document metadata, including the target variables for classification.
termInfo Specifies the table containing term metadata, such as the term's text (label).
casOuts Specifies the output tables to be created, which can include the generated rules, the terms used in those rules, and the candidate terms considered.
gPositive Specifies the minimum g-score for a positive term to be considered for rule extraction. Higher values lead to more selective term inclusion.
gNegative Specifies the minimum g-score for a negative term to be considered. This helps in identifying terms that are indicative of a document NOT belonging to a category.
mPositive Specifies the 'm' value for computing estimated precision for positive terms, used in statistical calculations to smooth probability estimates.
mNegative Specifies the 'm' value for computing estimated precision for negative terms.
maxCandidates Specifies the maximum number of term candidates to be selected for each category during the rule generation process.
maxTriesIn Specifies the k-in value for the k-best search in the term ensemble process for creating individual rules.
maxTriesOut Specifies the k-out value for the k-best search in the rule ensemble process for creating the final rule set.
minSupports Specifies the minimum number of documents in which a term must appear to be considered for rule creation.
nThreads Specifies the number of threads to use per node for the computation.
useOldNames When set to TRUE, uses legacy variable names from the HPBOOLRULE procedure for the output tables.
Data Preparation View data prep sheet
Data Creation: Document-Term and Document-Category Data

First, we create two tables. 'doc_term_data' contains the sparse representation of documents, linking document IDs to term IDs. 'doc_info_data' contains the category (target) for each document. This setup is typical for text mining tasks where document content and metadata are stored separately.

Copied!
1DATA mycas.doc_term_data;
2 INFILE DATALINES delimiter=',';
3 INPUT docid termid;
4 DATALINES;
51,1
61,2
72,2
82,3
93,3
103,4
114,1
124,4
13;
14RUN;
15 
16DATA mycas.doc_info_data;
17 INFILE DATALINES delimiter=',';
18 INPUT docid category $;
19 DATALINES;
201,A
212,A
223,B
234,B
24;
25RUN;

Examples

This example performs a basic training operation. It uses the document-term data from 'doc_term_data' and the document category information from 'doc_info_data'. The action will identify rules that predict the 'category' variable and store them in the 'rules_out' table.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 boolRule.brTrain /
3 TABLE={name='doc_term_data'},
4 docId='docid',
5 termId='termid',
6 docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}},
7 casOut={name='rules_out', replace=true};
8RUN;

This example demonstrates a more advanced use case. It specifies all three possible output tables: 'rules_out' for the final rules, 'rule_terms_out' for the terms within each rule, and 'candidate_terms_out' for all terms considered. It also adjusts the statistical parameters 'gPositive' and 'mPositive' to be more selective, requiring a higher statistical significance for terms to be included in a rule.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 boolRule.brTrain /
3 TABLE={name='doc_term_data'},
4 docId='docid',
5 termId='termid',
6 docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}, targetType='MULTICLASS'},
7 gPositive=10,
8 mPositive=1,
9 casOut={name='rules_out', replace=true, candidateTerms={name='candidate_terms_out', replace=true}, ruleTerms={name='rule_terms_out', replace=true}};
10RUN;
11 
12PROC PRINT DATA=mycas.rules_out;
13RUN;
14PROC PRINT DATA=mycas.rule_terms_out;
15RUN;

FAQ

What is the purpose of the brTrain action in SAS Viya?
What does the `docId` parameter specify?
What is the `docInfo` parameter used for?
How do `gPositive` and `gNegative` parameters influence rule extraction?
What is the function of the `maxCandidates` parameter?
What are the `maxTriesIn` and `maxTriesOut` parameters?
What does the `minSupports` parameter define?
What are the `mPositive` and `mNegative` parameters used for?
What information does the `termInfo` parameter require?
What is the purpose of the `casOuts` parameter?

Associated Scenarios

Use Case
Standard Classification of IT Support Tickets

An IT department wants to automatically categorize support tickets into 'Hardware' or 'Software' categories based on keywords found in the ticket descriptions to route them to t...

Use Case
Performance Stress Test for Email Spam Filtering

A huge volume of emails needs to be processed to flag 'SPAM' vs 'LEGIT'. The system must handle a large number of candidate terms and documents without timing out, optimizing th...

Use Case
Edge Case: Strict Filtering for Rare Events

In medical research, we want to identify specific symptom patterns for a rare disease. We need to ensure that rules are NOT created for statistically insignificant coincidences....