brTrain - WeAreCAS

Q: What is the purpose of the brTrain action in SAS Viya?

The brTrain action extracts Boolean rules from text data. It is part of the Boolean Rule action set.

Q: What does the `docId` parameter specify?

The `docId` parameter specifies the variable in the input data table that contains the document ID. Its default value is "_document_".

Q: What is the `docInfo` parameter used for?

The `docInfo` parameter specifies the information about the document table. It includes subparameters like `events`, `id` (which is required and specifies the document ID variable), `table` (the input data table with document info), `targets` (the target variables), and `targetType` (BINARY, MULTICLASS, or MULTILABEL).

Q: How do `gPositive` and `gNegative` parameters influence rule extraction?

The `gPositive` parameter specifies the minimum g-score for a positive term, while `gNegative` specifies the minimum g-score for a negative term to be considered for rule extraction. The default for both is 8.

Q: What is the function of the `maxCandidates` parameter?

The `maxCandidates` parameter specifies the number of term candidates to be selected for each category during the rule creation process. The default value is 500.

Q: What are the `maxTriesIn` and `maxTriesOut` parameters?

`maxTriesIn` specifies the k-in value for k-best search in the term ensemble process for creating rules (default 150). `maxTriesOut` specifies the k-out value for k-best search in the rule ensemble process for creating a rule set (default 50).

Q: What does the `minSupports` parameter define?

The `minSupports` parameter specifies the minimum number of documents in which a term must appear to be used for creating a rule. The default value is 3.

Q: What are the `mPositive` and `mNegative` parameters used for?

The `mPositive` parameter specifies the m value for computing estimated precision for positive terms (default 2), and `mNegative` specifies the m value for computing estimated precision for negative terms (default 4).

Q: What information does the `termInfo` parameter require?

The `termInfo` parameter specifies information about the terms table. It requires the `id` subparameter for the term ID, and can optionally include `label` for the term's text and `table` for the input data table containing term information.

Q: What is the purpose of the `casOuts` parameter?

The `casOuts` parameter specifies the output data tables for the results. This can include `rules` (the generated rules), `ruleTerms` (the terms in each rule), and `candidateTerms` (the terms selected for rule creation).

Description

The brTrain action extracts Boolean rules from a collection of documents. It is a key part of supervised learning for text categorization, creating a human-readable model that explains the classification logic. This action analyzes the relationship between terms present in documents and their assigned categories (targets) to generate a set of IF-THEN rules. These rules can then be used by the brScore action to classify new documents.

proc cas; boolRule.brTrain / table={name='cas_table_in'} docId='_document_' termId='_termnum_' docInfo={table={name='doc_info_table'}, id='_document_', targets={'target_var'}} casOut={name='rules_out', replace=true}; run;

Settings

Parameter	Description
table	Specifies the input data table that contains the document-term information for rule extraction.
docId	Specifies the variable in the input table that contains the document ID.
termId	Specifies the variable in the input table that contains the term ID.
docInfo	Specifies the table containing document metadata, including the target variables for classification.
termInfo	Specifies the table containing term metadata, such as the term's text (label).
casOuts	Specifies the output tables to be created, which can include the generated rules, the terms used in those rules, and the candidate terms considered.
gPositive	Specifies the minimum g-score for a positive term to be considered for rule extraction. Higher values lead to more selective term inclusion.
gNegative	Specifies the minimum g-score for a negative term to be considered. This helps in identifying terms that are indicative of a document NOT belonging to a category.
mPositive	Specifies the 'm' value for computing estimated precision for positive terms, used in statistical calculations to smooth probability estimates.
mNegative	Specifies the 'm' value for computing estimated precision for negative terms.
maxCandidates	Specifies the maximum number of term candidates to be selected for each category during the rule generation process.
maxTriesIn	Specifies the k-in value for the k-best search in the term ensemble process for creating individual rules.
maxTriesOut	Specifies the k-out value for the k-best search in the rule ensemble process for creating the final rule set.
minSupports	Specifies the minimum number of documents in which a term must appear to be considered for rule creation.
nThreads	Specifies the number of threads to use per node for the computation.
useOldNames	When set to TRUE, uses legacy variable names from the HPBOOLRULE procedure for the output tables.

Data Preparation View data prep sheet

Data Creation: Document-Term and Document-Category Data

First, we create two tables. 'doc_term_data' contains the sparse representation of documents, linking document IDs to term IDs. 'doc_info_data' contains the category (target) for each document. This setup is typical for text mining tasks where document content and metadata are stored separately.

Copied!

1	DATA mycas.doc_term_data;
2	INFILE DATALINES delimiter=',';
3	INPUT docid termid;
4	DATALINES;
5	1,1
6	1,2
7	2,2
8	2,3
9	3,3
10	3,4
11	4,1
12	4,4
13	;
14	RUN;
15
16	DATA mycas.doc_info_data;
17	INFILE DATALINES delimiter=',';
18	INPUT docid category $;
19	DATALINES;
20	1,A
21	2,A
22	3,B
23	4,B
24	;
25	RUN;

Examples

This example performs a basic training operation. It uses the document-term data from 'doc_term_data' and the document category information from 'doc_info_data'. The action will identify rules that predict the 'category' variable and store them in the 'rules_out' table.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	boolRule.brTrain /
3	TABLE={name='doc_term_data'},
4	docId='docid',
5	termId='termid',
6	docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}},
7	casOut={name='rules_out', replace=true};
8	RUN;

This example demonstrates a more advanced use case. It specifies all three possible output tables: 'rules_out' for the final rules, 'rule_terms_out' for the terms within each rule, and 'candidate_terms_out' for all terms considered. It also adjusts the statistical parameters 'gPositive' and 'mPositive' to be more selective, requiring a higher statistical significance for terms to be included in a rule.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	boolRule.brTrain /
3	TABLE={name='doc_term_data'},
4	docId='docid',
5	termId='termid',
6	docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}, targetType='MULTICLASS'},
7	gPositive=10,
8	mPositive=1,
9	casOut={name='rules_out', replace=true, candidateTerms={name='candidate_terms_out', replace=true}, ruleTerms={name='rule_terms_out', replace=true}};
10	RUN;
11
12	PROC PRINT DATA=mycas.rules_out;
13	RUN;
14	PROC PRINT DATA=mycas.rule_terms_out;
15	RUN;

FAQ

What is the purpose of the brTrain action in SAS Viya?

What does the `docId` parameter specify?

What is the `docInfo` parameter used for?

How do `gPositive` and `gNegative` parameters influence rule extraction?

What is the function of the `maxCandidates` parameter?

What are the `maxTriesIn` and `maxTriesOut` parameters?

What does the `minSupports` parameter define?

What are the `mPositive` and `mNegative` parameters used for?

What information does the `termInfo` parameter require?

What is the purpose of the `casOuts` parameter?

Associated Scenarios

Use Case

Standard Classification of IT Support Tickets

An IT department wants to automatically categorize support tickets into 'Hardware' or 'Software' categories based on keywords found in the ticket descriptions to route them to t...

View scenario

Use Case

Performance Stress Test for Email Spam Filtering

A huge volume of emails needs to be processed to flag 'SPAM' vs 'LEGIT'. The system must handle a large number of candidate terms and documents without timing out, optimizing th...

View scenario

Use Case

Edge Case: Strict Filtering for Rare Events

In medical research, we want to identify specific symptom patterns for a rare disease. We need to ensure that rules are NOT created for statistically insignificant coincidences....

View scenario

Actions associées

boolRule

brScore

The brScore action scores text data based on a set of Boolean rules. These ru...

Table of Contents

Description

Data Creation: Document-Term and Document-Category Data

Examples

Basic Rule Training

Detailed Rule Training with Multiple Outputs and Tuned Parameters

FAQ

Associated Scenarios

Use Case

Standard Classification of IT Support Tickets

Use Case

Performance Stress Test for Email Spam Filtering

Use Case

Edge Case: Strict Filtering for Rare Events

Actions associées

brScore