Published on :

Selecting and Ignoring Parts of Speech and Entities in PROC TEXTMINE

This code is also available in: Deutsch Español Français
Awaiting validation
The TEXTMINE procedure is a powerful tool for exploring and preparing textual data. This example demonstrates the use of `SELECT` and `IGNORE` options within the `PARSE` subcommand to control which terms are included or excluded from the analysis. By specifying parts of speech ('prep', 'det', 'prop') and entities ('prop_misc') to ignore, the user can filter out noise and focus on the key concepts of the text. The example uses synonym lists and stop lists for further data preparation. Input data is created internally via a DATA step with datalines, and results are stored in CAS tables (`mycas.outterms`, `mycas.outparent`, `mycas.outchild`, `mycas.outconfig`).
Data Analysis

Type : INTERNAL_CREATION


Examples use fictitious textual data generated using a DATA step with datalines. Synonym and stop word tables are also created via datalines.

1 Code Block
PROC TEXTMINE Data
Explanation :
This example initializes the TEXTMINE procedure with basic options for a complete text analysis. It creates a data table (`mycas.CarNominations_Basic`), a synonym list (`mycas.synds_Basic`), and a stop list (`mycas.stopList_Basic`). The `TEXTMINE` procedure is then executed to extract terms, applying synonyms and stop words, but without ignoring specific parts of speech or entities. The results are saved in `mycas.outterms_basic` and displayed.
Copied!
1/* Création de la table de données d'exemple */
2DATA mycas.CarNominations_Basic;
3 INFILE DATALINES delimiter='|' missover;
4 LENGTH text $70;
5 INPUT text$ i;
6 DATALINES;
7 The Ford Taurus is the World Car of the Year. |1
8 Hyundai won the award last year. |2
9 Toyota sold the Toyota Tacoma in bright green. |3
10 The Ford Taurus is sold in all colors except for lime green. |4
11 The Honda Insight was World Car of the Year in 2008. |5
12 ;
13RUN;
14 
15/* Création de la liste de synonymes */
16DATA mycas.synds_Basic;
17 INFILE DATALINES delimiter=',';
18 LENGTH Term $13;
19 INPUT Term $ TermRole $ Parent $ ParentRole$;
20 DATALINES;
21 honda insight, VEHICLE , honda, COMPANY,
22 ford taurus, VEHICLE, ford, COMPANY,
23 toyota tacoma, VEHICLE, toyota, COMPANY,
24 ;
25RUN;
26 
27/* Création de la liste de mots vides (stop list) */
28DATA mycas.stopList_Basic;
29 INFILE DATALINES delimiter='|' missover;
30 LENGTH term $25 role $40;
31 INPUT term$ role$;
32 DATALINES;
33 toyota| COMPANY
34 ;
35RUN;
36 
37/* Exécution de PROC TEXTMINE sans filtrage initial (ignore) */
38PROC TEXTMINE DATA=mycas.CarNominations_Basic;
39 doc_id i;
40 var text;
41 parse
42 termwgt = none
43 cellwgt = none
44 reducef = 1
45 entities = std
46 synonym = mycas.synds_Basic
47 stop = mycas.stopList_Basic
48 outterms = mycas.outterms_basic
49 ;
50RUN;
51 
52/* Affichage des termes extraits */
53DATA work.outterms_basic; SET mycas.outterms_basic; RUN;
54PROC PRINT DATA=work.outterms_basic; title 'Termes extraits sans filtrage'; RUN;
55 
2 Code Block
PROC TEXTMINE Data
Explanation :
This intermediate example uses the `SELECT` clause with the `IGNORE` option to specifically exclude determiners ('det') and prepositions ('prep') from the analysis. This helps refine the list of extracted terms by removing words often less significant for semantic analysis. The data and lists from the previous example are reused.
Copied!
1/* Création de la table de données d'exemple (réutilisation de l'exemple 1) */
2DATA mycas.CarNominations_Intermediate;
3 INFILE DATALINES delimiter='|' missover;
4 LENGTH text $70;
5 INPUT text$ i;
6 DATALINES;
7 The Ford Taurus is the World Car of the Year. |1
8 Hyundai won the award last year. |2
9 Toyota sold the Toyota Tacoma in bright green. |3
10 The Ford Taurus is sold in all colors except for lime green. |4
11 The Honda Insight was World Car of the Year in 2008. |5
12 ;
13RUN;
14 
15/* Création de la liste de synonymes (réutilisation de l'exemple 1) */
16DATA mycas.synds_Intermediate;
17 INFILE DATALINES delimiter=',';
18 LENGTH Term $13;
19 INPUT Term $ TermRole $ Parent $ ParentRole$;
20 DATALINES;
21 honda insight, VEHICLE , honda, COMPANY,
22 ford taurus, VEHICLE, ford, COMPANY,
23 toyota tacoma, VEHICLE, toyota, COMPANY,
24 ;
25RUN;
26 
27/* Création de la liste de mots vides (stop list) (réutilisation de l'exemple 1) */
28DATA mycas.stopList_Intermediate;
29 INFILE DATALINES delimiter='|' missover;
30 LENGTH term $25 role $40;
31 INPUT term$ role$;
32 DATALINES;
33 toyota| COMPANY
34 ;
35RUN;
36 
37/* Exécution de PROC TEXTMINE en ignorant les déterminants et les prépositions */
38PROC TEXTMINE DATA=mycas.CarNominations_Intermediate;
39 doc_id i;
40 var text;
41 parse
42 termwgt = none
43 cellwgt = none
44 reducef = 1
45 entities = std
46 synonym = mycas.synds_Intermediate
47 stop = mycas.stopList_Intermediate
48 outterms = mycas.outterms_intermediate
49 ;
50 select "det" "prep" / ignore;
51RUN;
52 
53/* Affichage des termes extraits */
54DATA work.outterms_intermediate; SET mycas.outterms_intermediate; RUN;
55PROC PRINT DATA=work.outterms_intermediate; title 'Termes après suppression des déterminants et prépositions'; RUN;
56 
3 Code Block
PROC TEXTMINE Data
Explanation :
This advanced example shows how to ignore multiple types of parts of speech and entities simultaneously. It uses two `SELECT` clauses with the `IGNORE` option: one for proper nouns ('prop') and adjectives ('adj'), and another to ignore 'VEHICLE' type entities. The `group="entities"` option is crucial for targeting specific entities. This allows for very fine control over the granularity of text analysis, removing entire categories of terms considered irrelevant for a given analysis objective.
Copied!
1/* Création de la table de données d'exemple (réutilisation de l'exemple 1) */
2DATA mycas.CarNominations_Advanced;
3 INFILE DATALINES delimiter='|' missover;
4 LENGTH text $70;
5 INPUT text$ i;
6 DATALINES;
7 The Ford Taurus is the World Car of the Year. |1
8 Hyundai won the award last year. |2
9 Toyota sold the Toyota Tacoma in bright green. |3
10 The Ford Taurus is sold in all colors except for lime green. |4
11 The Honda Insight was World Car of the Year in 2008. |5
12 ;
13RUN;
14 
15/* Création de la liste de synonymes (réutilisation de l'exemple 1) */
16DATA mycas.synds_Advanced;
17 INFILE DATALINES delimiter=',';
18 LENGTH Term $13;
19 INPUT Term $ TermRole $ Parent $ ParentRole$;
20 DATALINES;
21 honda insight, VEHICLE , honda, COMPANY,
22 ford taurus, VEHICLE, ford, COMPANY,
23 toyota tacoma, VEHICLE, toyota, COMPANY,
24 ;
25RUN;
26 
27/* Création de la liste de mots vides (stop list) (réutilisation de l'exemple 1) */
28DATA mycas.stopList_Advanced;
29 INFILE DATALINES delimiter='|' missover;
30 LENGTH term $25 role $40;
31 INPUT term$ role$;
32 DATALINES;
33 toyota| COMPANY
34 ;
35RUN;
36 
37/* Exécution de PROC TEXTMINE en ignorant des types de mots et entités spécifiques */
38PROC TEXTMINE DATA=mycas.CarNominations_Advanced;
39 doc_id i;
40 var text;
41 parse
42 termwgt = none
43 cellwgt = none
44 reducef = 1
45 entities = std
46 synonym = mycas.synds_Advanced
47 stop = mycas.stopList_Advanced
48 outterms = mycas.outterms_advanced
49 ;
50 select "prop" "adj" / ignore; /* Ignorer les noms propres et les adjectifs */
51 select "VEHICLE" / group="entities" ignore; /* Ignorer les entités de type VEHICLE */
52RUN;
53 
54/* Affichage des termes extraits */
55DATA work.outterms_advanced; SET mycas.outterms_advanced; RUN;
56PROC PRINT DATA=work.outterms_advanced; title 'Termes après filtrage avancé (noms propres, adjectifs, entités VEHICLE)'; RUN;
57 
4 Code Block
PROC TEXTMINE Data
Explanation :
This example highlights integration with the SAS Viya environment and the CAS engine. It begins by ensuring a CAS session is active and a `mycas` library is defined. Then, it simulates loading textual data from an external CSV file (`temp_data.csv`) directly into a CAS table (`casuser.ExternalCarData`) via `PROC CASUTIL`. The `TEXTMINE` procedure is then applied to this CAS table, using more elaborate synonym and stop lists, also stored in CAS. Combined filters are applied to ignore specific parts of speech and entities, demonstrating large-scale data analysis in a distributed environment. The `termwgt=tf` and `cellwgt=log` options are added for a more detailed analysis of term weights.
Copied!
1/* Connexion à la session CAS (si non déjà connectée) */
2cas;
3 
4/* Création d'une bibliothèque CAS pour les données */
5LIBNAME mycas cas sessref=casauto;
6 
7/* Création d'un fichier CSV local pour simuler des données externes */
8filename temp_data temp lrecl=200;
9DATA _null_;
10 file temp_data;
11 put 'text|i';
12 put 'The electric car is the future.|1';
13 put 'Batteries are key for green energy.|2';
14 put 'Old gasoline cars are slowly disappearing.|3';
15 put 'Green technology is innovating rapidly.|4';
16 put 'New car models prioritize efficiency.|5';
17RUN;
18 
19/* Chargement des données CSV dans CAS */
20PROC CASUTIL incaslib="casuser" outcaslib="casuser" ;
21 load casdata="temp_data.csv" casout="ExternalCarData" FORMAT=csv replace;
22RUN;
23 
24/* Création d'une liste de synonymes enrichie pour CAS */
25DATA mycas.synds_CAS;
26 INFILE DATALINES delimiter=',';
27 LENGTH Term $15;
28 INPUT Term $ TermRole $ Parent $ ParentRole$;
29 DATALINES;
30 electric car, VEHICLE , electric_mobility, TECHNOLOGY,
31 green energy, CONCEPT, sustainability, THEME,
32 gasoline cars, VEHICLE, old_technology, TECHNOLOGY,
33 ;
34RUN;
35 
36/* Création d'une liste de mots vides étendue pour CAS */
37DATA mycas.stopList_CAS;
38 INFILE DATALINES delimiter='|' missover;
39 LENGTH term $25 role $40;
40 INPUT term$ role$;
41 DATALINES;
42 is| VERB
43 the| DET
44 of| PREP
45 car| NOUN
46 ;
47RUN;
48 
49/* Exécution de PROC TEXTMINE sur la table CAS avec filtrage combiné */
50PROC TEXTMINE DATA=casuser.ExternalCarData;
51 doc_id i;
52 var text;
53 parse
54 termwgt = tf
55 cellwgt = log
56 reducef = 2
57 entities = std
58 synonym = mycas.synds_CAS
59 stop = mycas.stopList_CAS
60 outterms = mycas.outterms_cas
61 outparent = mycas.outparent_cas
62 outchild = mycas.outchild_cas
63 outconfig = mycas.outconfig_cas
64 ;
65 select "VERB" "DET" "PREP" / ignore; /* Ignorer verbes, déterminants et prépositions */
66 select "TECHNOLOGY" / group="entities" ignore; /* Ignorer les entités de catégorie 'TECHNOLOGY' */
67RUN;
68 
69/* Affichage des termes extraits de la table CAS */
70DATA work.outterms_cas; SET mycas.outterms_cas; RUN;
71PROC PRINT DATA=work.outterms_cas; title 'Termes extraits de la table CAS après filtrage avancé'; RUN;
72 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All rights reserved.


Expert Advice
Expert
Michael
Responsable de l'infrastructure Viya.
« Use the outconfig table to verify your parsing settings. It provides a snapshot of the stop lists, synonym tables, and POS filters applied during execution. If your final clusters or topics seem "noisy," review this table to see if specific POS tags like AUX (auxiliary verbs) or CONJ (conjunctions) should be added to your IGNORE list. »