Published on :

Rule Extraction for Binary Targets with BOOLRULE

This code is also available in: Deutsch Español Français
Awaiting validation
The BOOLRULE procedure is a powerful tool for discovering 'IF... THEN...' rule relationships in datasets. In this example, we start by creating a table of customer reviews (`mycas.reviews`) including a binary sentiment ('positive'). Then, `PROC TEXTMINE` is employed to convert raw text into a structured representation (terms-documents matrix `mycas.reviews_bow` and terms table `mycas.reviews_terms`). Finally, `PROC BOOLRULE` is applied to generate predictive rules for sentiment based on the presence of certain terms or combinations of terms. Analyzing rule metrics (F1, Precision, Recall) helps evaluate the quality of these rules.
Data Analysis

Type : CREATION_INTERNE


Examples use internally generated data via DATALINES blocks or temporary CAS tables.

1 Code Block
PROC BOOLRULE Data
Explanation :
This example illustrates the minimal use of `PROC BOOLRULE` with a binary target. It follows the steps of data creation, text preprocessing with `PROC TEXTMINE`, then applies `PROC BOOLRULE` to extract rules. The `MINSUPPORTS`, `MPOS`, and `GPOS` options are set for basic control over rule generation.
Copied!
1/* Établir une session CAS et créer la bibliothèque 'mycas' */
2cas;
3LIBNAME mycas cas;
4 
5/* Création de la table de données d'avis */
6DATA mycas.reviews;
7 INFILE DATALINES delimiter='|' missover;
8 LENGTH text $300 category $20;
9 INPUT text$ positive category$ did;
10 DATALINES;
11 This is the greatest phone ever! love it!|1|electronics|1
12 The phone's battery life is too short and screen resolution is low.|0|electronics|2
13 The screen resolution is low, but I love this tv.|1|electronics|3
14 The movie itself is great and I like it, although the resolution is low.|1|movies|4
15 The movie's story is boring and the acting is poor.|0|movies|5
16 I watched this movie on tv, it's not good on a small screen. |0|movies|6
17 watched the movie first and loved it, the book is even better!|1|books |7
18 I like the story in this book, they should put it on screen.|1|books|8
19 I love the author, but this book is a waste of time, don't buy it.|0|books|9
20 ;
21RUN;
22 
23/* Prétraitement du texte avec PROC TEXTMINE */
24PROC TEXTMINE DATA=mycas.reviews;
25 doc_id
26 did;
27 var
28 text;
29 parse
30 nonoungroups
31 notagging
32 entities = none
33 outparent = mycas.reviews_bow
34 outterms = mycas.reviews_terms
35 reducef = 1;
36RUN;
37 
38/* Extraction des règles avec PROC BOOLRULE pour cible binaire 'positive' */
39PROC BOOLRULE
40 DATA = mycas.reviews_bow
41 docid = _document_
42 termid = _termnum_
43 docinfo = mycas.reviews
44 terminfo = mycas.reviews_terms
45 minsupports = 1
46 mpos = 1
47 gpos = 1;
48 docinfo
49 id = did
50 targets = (positive);
51 terminfo
52 id = key
53 label = term;
54 OUTPUT
55 ruleterms = mycas.ruleterms_basic
56 rules = mycas.rules_basic;
57RUN;
58 
59/* Affichage des règles générées */
60DATA rules_basic;
61 SET mycas.rules_basic;
62RUN;
63 
64PROC PRINT DATA=rules_basic;
65 title 'Exemple Basique: Règles pour sentiment positif';
66 var target ruleid rule F1 precision recall;
67RUN;
68 
69/* Nettoyage */
70PROC CASUTIL;
71 droptable casdata='reviews' incaslib='mycas';
72 droptable casdata='reviews_bow' incaslib='mycas';
73 droptable casdata='reviews_terms' incaslib='mycas';
74 droptable casdata='ruleterms_basic' incaslib='mycas';
75 droptable casdata='rules_basic' incaslib='mycas';
76QUIT;
77 
2 Code Block
PROC BOOLRULE Data
Explanation :
This example extends the use of `PROC BOOLRULE` by adding common options to refine rule generation. `NOCUTOFF` prevents truncation of candidate rules, `MINRULELEN` and `MAXRULELEN` control rule complexity, and `NTHREADS` enables parallelism to improve performance.
Copied!
1/* Établir une session CAS et créer la bibliothèque 'mycas' */
2cas;
3LIBNAME mycas cas;
4 
5/* Création de la table de données d'avis */
6DATA mycas.reviews;
7 INFILE DATALINES delimiter='|' missover;
8 LENGTH text $300 category $20;
9 INPUT text$ positive category$ did;
10 DATALINES;
11 This is the greatest phone ever! love it!|1|electronics|1
12 The phone's battery life is too short and screen resolution is low.|0|electronics|2
13 The screen resolution is low, but I love this tv.|1|electronics|3
14 The movie itself is great and I like it, although the resolution is low.|1|movies|4
15 The movie's story is boring and the acting is poor.|0|movies|5
16 I watched this movie on tv, it's not good on a small screen. |0|movies|6
17 watched the movie first and loved it, the book is even better!|1|books |7
18 I like the story in this book, they should put it on screen.|1|books|8
19 I love the author, but this book is a waste of time, don't buy it.|0|books|9
20 ;
21RUN;
22 
23/* Prétraitement du texte avec PROC TEXTMINE */
24PROC TEXTMINE DATA=mycas.reviews;
25 doc_id
26 did;
27 var
28 text;
29 parse
30 nonoungroups
31 notagging
32 entities = none
33 outparent = mycas.reviews_bow
34 outterms = mycas.reviews_terms
35 reducef = 1;
36RUN;
37 
38/* Extraction des règles avec des options supplémentaires */
39PROC BOOLRULE
40 DATA = mycas.reviews_bow
41 docid = _document_
42 termid = _termnum_
43 docinfo = mycas.reviews
44 terminfo = mycas.reviews_terms
45 minsupports = 1
46 mpos = 1
47 gpos = 1
48 nocutoff /* Ne pas couper les règles candidates */
49 minrulelen = 1 /* Longueur minimale de la règle */
50 maxrulelen = 3 /* Longueur maximale de la règle */
51 nthreads = 4; /* Utiliser 4 threads pour le traitement */
52 docinfo
53 id = did
54 targets = (positive);
55 terminfo
56 id = key
57 label = term;
58 OUTPUT
59 ruleterms = mycas.ruleterms_intermediaire
60 rules = mycas.rules_intermediaire;
61RUN;
62 
63/* Affichage des règles générées */
64DATA rules_intermediaire;
65 SET mycas.rules_intermediaire;
66RUN;
67 
68PROC PRINT DATA=rules_intermediaire;
69 title 'Exemple Intermédiaire: Règles avec options courantes';
70 var target ruleid rule F1 precision recall;
71RUN;
72 
73/* Nettoyage */
74PROC CASUTIL;
75 droptable casdata='reviews' incaslib='mycas';
76 droptable casdata='reviews_bow' incaslib='mycas';
77 droptable casdata='reviews_terms' incaslib='mycas';
78 droptable casdata='ruleterms_intermediaire' incaslib='mycas';
79 droptable casdata='rules_intermediaire' incaslib='mycas';
80QUIT;
81 
3 Code Block
PROC BOOLRULE Data
Explanation :
This example demonstrates a more advanced approach by filtering review data for a specific category ('electronics') before applying rule extraction. This allows for obtaining rules more relevant to a subset of the data, showing how `PROC BOOLRULE` can be integrated into a more complex data analysis workflow.
Copied!
1/* Établir une session CAS et créer la bibliothèque 'mycas' */
2cas;
3LIBNAME mycas cas;
4 
5/* Création de la table de données d'avis */
6DATA mycas.reviews;
7 INFILE DATALINES delimiter='|' missover;
8 LENGTH text $300 category $20;
9 INPUT text$ positive category$ did;
10 DATALINES;
11 This is the greatest phone ever! love it!|1|electronics|1
12 The phone's battery life is too short and screen resolution is low.|0|electronics|2
13 The screen resolution is low, but I love this tv.|1|electronics|3
14 The movie itself is great and I like it, although the resolution is low.|1|movies|4
15 The movie's story is boring and the acting is poor.|0|movies|5
16 I watched this movie on tv, it's not good on a small screen. |0|movies|6
17 watched the movie first and loved it, the book is even better!|1|books |7
18 I like the story in this book, they should put it on screen.|1|books|8
19 I love the author, but this book is a waste of time, don't buy it.|0|books|9
20 ;
21RUN;
22 
23/* Filtrage des avis pour une catégorie spécifique (par exemple, 'electronics') */
24DATA mycas.reviews_filtered;
25 SET mycas.reviews;
26 where category = 'electronics';
27RUN;
28 
29/* Prétraitement du texte avec PROC TEXTMINE sur les données filtrées */
30PROC TEXTMINE DATA=mycas.reviews_filtered;
31 doc_id
32 did;
33 var
34 text;
35 parse
36 nonoungroups
37 notagging
38 entities = none
39 outparent = mycas.reviews_bow_filtered
40 outterms = mycas.reviews_terms_filtered
41 reducef = 1;
42RUN;
43 
44/* Extraction des règles avec PROC BOOLRULE sur les données filtrées */
45PROC BOOLRULE
46 DATA = mycas.reviews_bow_filtered
47 docid = _document_
48 termid = _termnum_
49 docinfo = mycas.reviews_filtered
50 terminfo = mycas.reviews_terms_filtered
51 minsupports = 1
52 mpos = 1
53 gpos = 1;
54 docinfo
55 id = did
56 targets = (positive);
57 terminfo
58 id = key
59 label = term;
60 OUTPUT
61 ruleterms = mycas.ruleterms_avance
62 rules = mycas.rules_avance;
63RUN;
64 
65/* Affichage des règles générées */
66DATA rules_avance;
67 SET mycas.rules_avance;
68RUN;
69 
70PROC PRINT DATA=rules_avance;
71 title 'Exemple Avancé: Règles pour les avis électroniques';
72 var target ruleid rule F1 precision recall;
73RUN;
74 
75/* Nettoyage */
76PROC CASUTIL;
77 droptable casdata='reviews' incaslib='mycas';
78 droptable casdata='reviews_filtered' incaslib='mycas';
79 droptable casdata='reviews_bow_filtered' incaslib='mycas';
80 droptable casdata='reviews_terms_filtered' incaslib='mycas';
81 droptable casdata='ruleterms_avance' incaslib='mycas';
82 droptable casdata='rules_avance' incaslib='mycas';
83QUIT;
84 
4 Code Block
PROC BOOLRULE Data
Explanation :
This example highlights explicit CAS session management and parameter adjustment for performance and rule filtering. Support thresholds (`MINSUPPORTS`, `MPOS`, `GPOS`) are set proportionally, `MAXRULELEN` is used to limit rule complexity, and `NTHREADS` is increased to optimize execution in the distributed CAS environment. The input table is slightly enlarged to better simulate a real-world use case with large data volumes.
Copied!
1/* Démarrer une nouvelle session CAS pour un contrôle explicite */
2options casport=5570 cashost='localhost';
3cas my_new_session sessopts=(caslib=casuser timeout=1800 locale="en_US");
4LIBNAME mycas cas;
5 
6/* Création de la table de données d'avis (avec plus de données pour simuler une grande table) */
7DATA mycas.reviews_large;
8 INFILE DATALINES delimiter='|' missover;
9 LENGTH text $300 category $20;
10 INPUT text$ positive category$ did;
11 DATALINES;
12 This is the greatest phone ever! love it!|1|electronics|1
13 The phone's battery life is too short and screen resolution is low.|0|electronics|2
14 The screen resolution is low, but I love this tv.|1|electronics|3
15 The movie itself is great and I like it, although the resolution is low.|1|movies|4
16 The movie's story is boring and the acting is poor.|0|movies|5
17 I watched this movie on tv, it's not good on a small screen. |0|movies|6
18 watched the movie first and loved it, the book is even better!|1|books |7
19 I like the story in this book, they should put it on screen.|1|books|8
20 I love the author, but this book is a waste of time, don't buy it.|0|books|9
21 Fantastic product, highly recommend!|1|electronics|10
22 Battery drains too fast, very disappointed.|0|electronics|11
23 Excellent display and sound quality.|1|electronics|12
24 A captivating story, a must-watch!|1|movies|13
25 Terrible plot and acting, waste of time.|0|movies|14
26 Enjoyed reading every single page.|1|books|15
27 Not worth the hype, prefer other titles.|0|books|16
28 Best purchase this year!|1|electronics|17
29 Worst experience ever, totally unreliable.|0|electronics|18
30 Absolutely brilliant, a masterpiece.|1|movies|19
31 Couldn't put it down, truly amazing.|1|books|20
32 ;
33run;
34 
35/* Prétraitement du texte avec PROC TEXTMINE */
36proc textmine data=mycas.reviews_large;
37 doc_id
38 did;
39 var
40 text;
41 parse
42 nonoungroups
43 notagging
44 entities = none
45 outparent = mycas.reviews_bow_large
46 outterms = mycas.reviews_terms_large
47 reducef = 1;
48run;
49 
50/* Extraction des règles avec PROC BOOLRULE */
51proc boolrule
52 data = mycas.reviews_bow_large
53 docid = _document_
54 termid = _termnum_
55 docinfo = mycas.reviews_large
56 terminfo = mycas.reviews_terms_large
57 minsupports = 0.1 /* Support minimum de 10% */
58 mpos = 0.5 /* Minimum de positivité de 50% */
59 gpos = 0.5 /* Global positivité de 50% */
60 maxrulelen = 2 /* Longueur maximale de la règle */
61 nthreads = 8; /* Utiliser 8 threads pour le traitement */
62 docinfo
63 id = did
64 targets = (positive);
65 terminfo
66 id = key
67 label = term;
68 output
69 ruleterms = mycas.ruleterms_cas
70 rules = mycas.rules_cas;
71run;
72 
73/* Affichage des règles générées */
74data rules_cas;
75 set mycas.rules_cas;
76run;
77 
78proc print data=rules_cas;
79 title 'Exemple Viya/CAS: Règles avec gestion de SESSION et performances';
80 var target ruleid rule F1 precision recall;
81run;
82 
83/* Nettoyage et fin de session CAS */
84proc casutil;
85 droptable casdata='reviews_large' incaslib='mycas';
86 droptable casdata='reviews_bow_large' incaslib='mycas';
87 droptable casdata='reviews_terms_large' incaslib='mycas';
88 droptable casdata='ruleterms_cas' incaslib='mycas';
89 droptable casdata='rules_cas' incaslib='mycas';
90QUIT;
91cas my_new_session terminate;
92 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All rights reserved.