The TEXTMINE procedure is a powerful tool for exploring and preparing textual data. This example demonstrates the use of `SELECT` and `IGNORE` options within the `PARSE` subcommand to control which terms are included or excluded from the analysis. By specifying parts of speech ('prep', 'det', 'prop') and entities ('prop_misc') to ignore, the user can filter out noise and focus on the key concepts of the text. The example uses synonym lists and stop lists for further data preparation. Input data is created internally via a DATA step with datalines, and results are stored in CAS tables (`mycas.outterms`, `mycas.outparent`, `mycas.outchild`, `mycas.outconfig`).
Data Analysis
Type : INTERNAL_CREATION
Examples use fictitious textual data generated using a DATA step with datalines. Synonym and stop word tables are also created via datalines.
1 Code Block
PROC TEXTMINE Data
Explanation : This example initializes the TEXTMINE procedure with basic options for a complete text analysis. It creates a data table (`mycas.CarNominations_Basic`), a synonym list (`mycas.synds_Basic`), and a stop list (`mycas.stopList_Basic`). The `TEXTMINE` procedure is then executed to extract terms, applying synonyms and stop words, but without ignoring specific parts of speech or entities. The results are saved in `mycas.outterms_basic` and displayed.
Copied!
/* Création de la table de données d'exemple */
data mycas.CarNominations_Basic;
infile datalines delimiter='|' missover;
length text $70;
input text$ i;
datalines;
The Ford Taurus is the World Car of the Year. |1
Hyundai won the award last year. |2
Toyota sold the Toyota Tacoma in bright green. |3
The Ford Taurus is sold in all colors except for lime green. |4
The Honda Insight was World Car of the Year in 2008. |5
;
run;
/* Création de la liste de synonymes */
data mycas.synds_Basic;
infile datalines delimiter=',';
length Term $13;
input Term $ TermRole $ Parent $ ParentRole$;
datalines;
honda insight, VEHICLE , honda, COMPANY,
ford taurus, VEHICLE, ford, COMPANY,
toyota tacoma, VEHICLE, toyota, COMPANY,
;
run;
/* Création de la liste de mots vides (stop list) */
data mycas.stopList_Basic;
infile datalines delimiter='|' missover;
length term $25 role $40;
input term$ role$;
datalines;
toyota| COMPANY
;
run;
/* Exécution de PROC TEXTMINE sans filtrage initial (ignore) */
proc textmine data=mycas.CarNominations_Basic;
doc_id i;
var text;
parse
termwgt = none
cellwgt = none
reducef = 1
entities = std
synonym = mycas.synds_Basic
stop = mycas.stopList_Basic
outterms = mycas.outterms_basic
;
run;
/* Affichage des termes extraits */
data work.outterms_basic; set mycas.outterms_basic; run;
proc print data=work.outterms_basic; title 'Termes extraits sans filtrage'; run;
1
/* Création de la table de données d'exemple */
2
DATA mycas.CarNominations_Basic;
3
INFILEDATALINES delimiter='|' missover;
4
LENGTH text $70;
5
INPUT text$ i;
6
DATALINES;
7
The Ford Taurus is the World Car of the Year. |1
8
Hyundai won the award last year. |2
9
Toyota sold the Toyota Tacoma in bright green. |3
10
The Ford Taurus is sold in all colors except for lime green. |4
11
The Honda Insight was World Car of the Year in 2008. |5
12
;
13
RUN;
14
15
/* Création de la liste de synonymes */
16
DATA mycas.synds_Basic;
17
INFILEDATALINES delimiter=',';
18
LENGTH Term $13;
19
INPUT Term $ TermRole $ Parent $ ParentRole$;
20
DATALINES;
21
honda insight, VEHICLE , honda, COMPANY,
22
ford taurus, VEHICLE, ford, COMPANY,
23
toyota tacoma, VEHICLE, toyota, COMPANY,
24
;
25
RUN;
26
27
/* Création de la liste de mots vides (stop list) */
28
DATA mycas.stopList_Basic;
29
INFILEDATALINES delimiter='|' missover;
30
LENGTH term $25 role $40;
31
INPUT term$ role$;
32
DATALINES;
33
toyota| COMPANY
34
;
35
RUN;
36
37
/* Exécution de PROC TEXTMINE sans filtrage initial (ignore) */
38
PROC TEXTMINEDATA=mycas.CarNominations_Basic;
39
doc_id i;
40
var text;
41
parse
42
termwgt = none
43
cellwgt = none
44
reducef = 1
45
entities = std
46
synonym = mycas.synds_Basic
47
stop = mycas.stopList_Basic
48
outterms = mycas.outterms_basic
49
;
50
RUN;
51
52
/* Affichage des termes extraits */
53
DATA work.outterms_basic; SET mycas.outterms_basic; RUN;
54
PROC PRINTDATA=work.outterms_basic; title 'Termes extraits sans filtrage'; RUN;
55
2 Code Block
PROC TEXTMINE Data
Explanation : This intermediate example uses the `SELECT` clause with the `IGNORE` option to specifically exclude determiners ('det') and prepositions ('prep') from the analysis. This helps refine the list of extracted terms by removing words often less significant for semantic analysis. The data and lists from the previous example are reused.
Copied!
/* Création de la table de données d'exemple (réutilisation de l'exemple 1) */
data mycas.CarNominations_Intermediate;
infile datalines delimiter='|' missover;
length text $70;
input text$ i;
datalines;
The Ford Taurus is the World Car of the Year. |1
Hyundai won the award last year. |2
Toyota sold the Toyota Tacoma in bright green. |3
The Ford Taurus is sold in all colors except for lime green. |4
The Honda Insight was World Car of the Year in 2008. |5
;
run;
/* Création de la liste de synonymes (réutilisation de l'exemple 1) */
data mycas.synds_Intermediate;
infile datalines delimiter=',';
length Term $13;
input Term $ TermRole $ Parent $ ParentRole$;
datalines;
honda insight, VEHICLE , honda, COMPANY,
ford taurus, VEHICLE, ford, COMPANY,
toyota tacoma, VEHICLE, toyota, COMPANY,
;
run;
/* Création de la liste de mots vides (stop list) (réutilisation de l'exemple 1) */
data mycas.stopList_Intermediate;
infile datalines delimiter='|' missover;
length term $25 role $40;
input term$ role$;
datalines;
toyota| COMPANY
;
run;
/* Exécution de PROC TEXTMINE en ignorant les déterminants et les prépositions */
proc textmine data=mycas.CarNominations_Intermediate;
doc_id i;
var text;
parse
termwgt = none
cellwgt = none
reducef = 1
entities = std
synonym = mycas.synds_Intermediate
stop = mycas.stopList_Intermediate
outterms = mycas.outterms_intermediate
;
select "det" "prep" / ignore;
run;
/* Affichage des termes extraits */
data work.outterms_intermediate; set mycas.outterms_intermediate; run;
proc print data=work.outterms_intermediate; title 'Termes après suppression des déterminants et prépositions'; run;
1
/* Création de la table de données d'exemple (réutilisation de l'exemple 1) */
2
DATA mycas.CarNominations_Intermediate;
3
INFILEDATALINES delimiter='|' missover;
4
LENGTH text $70;
5
INPUT text$ i;
6
DATALINES;
7
The Ford Taurus is the World Car of the Year. |1
8
Hyundai won the award last year. |2
9
Toyota sold the Toyota Tacoma in bright green. |3
10
The Ford Taurus is sold in all colors except for lime green. |4
11
The Honda Insight was World Car of the Year in 2008. |5
12
;
13
RUN;
14
15
/* Création de la liste de synonymes (réutilisation de l'exemple 1) */
16
DATA mycas.synds_Intermediate;
17
INFILEDATALINES delimiter=',';
18
LENGTH Term $13;
19
INPUT Term $ TermRole $ Parent $ ParentRole$;
20
DATALINES;
21
honda insight, VEHICLE , honda, COMPANY,
22
ford taurus, VEHICLE, ford, COMPANY,
23
toyota tacoma, VEHICLE, toyota, COMPANY,
24
;
25
RUN;
26
27
/* Création de la liste de mots vides (stop list) (réutilisation de l'exemple 1) */
28
DATA mycas.stopList_Intermediate;
29
INFILEDATALINES delimiter='|' missover;
30
LENGTH term $25 role $40;
31
INPUT term$ role$;
32
DATALINES;
33
toyota| COMPANY
34
;
35
RUN;
36
37
/* Exécution de PROC TEXTMINE en ignorant les déterminants et les prépositions */
DATA work.outterms_intermediate; SET mycas.outterms_intermediate; RUN;
55
PROC PRINTDATA=work.outterms_intermediate; title 'Termes après suppression des déterminants et prépositions'; RUN;
56
3 Code Block
PROC TEXTMINE Data
Explanation : This advanced example shows how to ignore multiple types of parts of speech and entities simultaneously. It uses two `SELECT` clauses with the `IGNORE` option: one for proper nouns ('prop') and adjectives ('adj'), and another to ignore 'VEHICLE' type entities. The `group="entities"` option is crucial for targeting specific entities. This allows for very fine control over the granularity of text analysis, removing entire categories of terms considered irrelevant for a given analysis objective.
Copied!
/* Création de la table de données d'exemple (réutilisation de l'exemple 1) */
data mycas.CarNominations_Advanced;
infile datalines delimiter='|' missover;
length text $70;
input text$ i;
datalines;
The Ford Taurus is the World Car of the Year. |1
Hyundai won the award last year. |2
Toyota sold the Toyota Tacoma in bright green. |3
The Ford Taurus is sold in all colors except for lime green. |4
The Honda Insight was World Car of the Year in 2008. |5
;
run;
/* Création de la liste de synonymes (réutilisation de l'exemple 1) */
data mycas.synds_Advanced;
infile datalines delimiter=',';
length Term $13;
input Term $ TermRole $ Parent $ ParentRole$;
datalines;
honda insight, VEHICLE , honda, COMPANY,
ford taurus, VEHICLE, ford, COMPANY,
toyota tacoma, VEHICLE, toyota, COMPANY,
;
run;
/* Création de la liste de mots vides (stop list) (réutilisation de l'exemple 1) */
data mycas.stopList_Advanced;
infile datalines delimiter='|' missover;
length term $25 role $40;
input term$ role$;
datalines;
toyota| COMPANY
;
run;
/* Exécution de PROC TEXTMINE en ignorant des types de mots et entités spécifiques */
proc textmine data=mycas.CarNominations_Advanced;
doc_id i;
var text;
parse
termwgt = none
cellwgt = none
reducef = 1
entities = std
synonym = mycas.synds_Advanced
stop = mycas.stopList_Advanced
outterms = mycas.outterms_advanced
;
select "prop" "adj" / ignore; /* Ignorer les noms propres et les adjectifs */
select "VEHICLE" / group="entities" ignore; /* Ignorer les entités de type VEHICLE */
run;
/* Affichage des termes extraits */
data work.outterms_advanced; set mycas.outterms_advanced; run;
proc print data=work.outterms_advanced; title 'Termes après filtrage avancé (noms propres, adjectifs, entités VEHICLE)'; run;
1
/* Création de la table de données d'exemple (réutilisation de l'exemple 1) */
2
DATA mycas.CarNominations_Advanced;
3
INFILEDATALINES delimiter='|' missover;
4
LENGTH text $70;
5
INPUT text$ i;
6
DATALINES;
7
The Ford Taurus is the World Car of the Year. |1
8
Hyundai won the award last year. |2
9
Toyota sold the Toyota Tacoma in bright green. |3
10
The Ford Taurus is sold in all colors except for lime green. |4
11
The Honda Insight was World Car of the Year in 2008. |5
12
;
13
RUN;
14
15
/* Création de la liste de synonymes (réutilisation de l'exemple 1) */
16
DATA mycas.synds_Advanced;
17
INFILEDATALINES delimiter=',';
18
LENGTH Term $13;
19
INPUT Term $ TermRole $ Parent $ ParentRole$;
20
DATALINES;
21
honda insight, VEHICLE , honda, COMPANY,
22
ford taurus, VEHICLE, ford, COMPANY,
23
toyota tacoma, VEHICLE, toyota, COMPANY,
24
;
25
RUN;
26
27
/* Création de la liste de mots vides (stop list) (réutilisation de l'exemple 1) */
28
DATA mycas.stopList_Advanced;
29
INFILEDATALINES delimiter='|' missover;
30
LENGTH term $25 role $40;
31
INPUT term$ role$;
32
DATALINES;
33
toyota| COMPANY
34
;
35
RUN;
36
37
/* Exécution de PROC TEXTMINE en ignorant des types de mots et entités spécifiques */
38
PROC TEXTMINEDATA=mycas.CarNominations_Advanced;
39
doc_id i;
40
var text;
41
parse
42
termwgt = none
43
cellwgt = none
44
reducef = 1
45
entities = std
46
synonym = mycas.synds_Advanced
47
stop = mycas.stopList_Advanced
48
outterms = mycas.outterms_advanced
49
;
50
select "prop""adj" / ignore; /* Ignorer les noms propres et les adjectifs */
51
select "VEHICLE" / group="entities" ignore; /* Ignorer les entités de type VEHICLE */
52
RUN;
53
54
/* Affichage des termes extraits */
55
DATA work.outterms_advanced; SET mycas.outterms_advanced; RUN;
56
PROC PRINTDATA=work.outterms_advanced; title 'Termes après filtrage avancé (noms propres, adjectifs, entités VEHICLE)'; RUN;
57
4 Code Block
PROC TEXTMINE Data
Explanation : This example highlights integration with the SAS Viya environment and the CAS engine. It begins by ensuring a CAS session is active and a `mycas` library is defined. Then, it simulates loading textual data from an external CSV file (`temp_data.csv`) directly into a CAS table (`casuser.ExternalCarData`) via `PROC CASUTIL`. The `TEXTMINE` procedure is then applied to this CAS table, using more elaborate synonym and stop lists, also stored in CAS. Combined filters are applied to ignore specific parts of speech and entities, demonstrating large-scale data analysis in a distributed environment. The `termwgt=tf` and `cellwgt=log` options are added for a more detailed analysis of term weights.
Copied!
/* Connexion à la session CAS (si non déjà connectée) */
cas;
/* Création d'une bibliothèque CAS pour les données */
libname mycas cas sessref=casauto;
/* Création d'un fichier CSV local pour simuler des données externes */
filename temp_data temp lrecl=200;
data _null_;
file temp_data;
put 'text|i';
put 'The electric car is the future.|1';
put 'Batteries are key for green energy.|2';
put 'Old gasoline cars are slowly disappearing.|3';
put 'Green technology is innovating rapidly.|4';
put 'New car models prioritize efficiency.|5';
run;
/* Chargement des données CSV dans CAS */
proc casutil incaslib="casuser" outcaslib="casuser" ;
load casdata="temp_data.csv" casout="ExternalCarData" format=csv replace;
run;
/* Création d'une liste de synonymes enrichie pour CAS */
data mycas.synds_CAS;
infile datalines delimiter=',';
length Term $15;
input Term $ TermRole $ Parent $ ParentRole$;
datalines;
electric car, VEHICLE , electric_mobility, TECHNOLOGY,
green energy, CONCEPT, sustainability, THEME,
gasoline cars, VEHICLE, old_technology, TECHNOLOGY,
;
run;
/* Création d'une liste de mots vides étendue pour CAS */
data mycas.stopList_CAS;
infile datalines delimiter='|' missover;
length term $25 role $40;
input term$ role$;
datalines;
is| VERB
the| DET
of| PREP
car| NOUN
;
run;
/* Exécution de PROC TEXTMINE sur la table CAS avec filtrage combiné */
proc textmine data=casuser.ExternalCarData;
doc_id i;
var text;
parse
termwgt = tf
cellwgt = log
reducef = 2
entities = std
synonym = mycas.synds_CAS
stop = mycas.stopList_CAS
outterms = mycas.outterms_cas
outparent = mycas.outparent_cas
outchild = mycas.outchild_cas
outconfig = mycas.outconfig_cas
;
select "VERB" "DET" "PREP" / ignore; /* Ignorer verbes, déterminants et prépositions */
select "TECHNOLOGY" / group="entities" ignore; /* Ignorer les entités de catégorie 'TECHNOLOGY' */
run;
/* Affichage des termes extraits de la table CAS */
data work.outterms_cas; set mycas.outterms_cas; run;
proc print data=work.outterms_cas; title 'Termes extraits de la table CAS après filtrage avancé'; run;
1
/* Connexion à la session CAS (si non déjà connectée) */
2
cas;
3
4
/* Création d'une bibliothèque CAS pour les données */
5
LIBNAME mycas cas sessref=casauto;
6
7
/* Création d'un fichier CSV local pour simuler des données externes */
8
filename temp_data temp lrecl=200;
9
DATA _null_;
10
file temp_data;
11
put 'text|i';
12
put 'The electric car is the future.|1';
13
put 'Batteries are key for green energy.|2';
14
put 'Old gasoline cars are slowly disappearing.|3';
select "TECHNOLOGY" / group="entities" ignore; /* Ignorer les entités de catégorie 'TECHNOLOGY' */
67
RUN;
68
69
/* Affichage des termes extraits de la table CAS */
70
DATA work.outterms_cas; SET mycas.outterms_cas; RUN;
71
PROC PRINTDATA=work.outterms_cas; title 'Termes extraits de la table CAS après filtrage avancé'; RUN;
72
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
« Use the outconfig table to verify your parsing settings. It provides a snapshot of the stop lists, synonym tables, and POS filters applied during execution. If your final clusters or topics seem "noisy," review this table to see if specific POS tags like AUX (auxiliary verbs) or CONJ (conjunctions) should be added to your IGNORE list. »
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.