Published on :
Text Analysis CREATION_INTERNE

The TEXTMINE procedure

This code is also available in: Deutsch Español Français
Awaiting validation
The TEXTMINE procedure in SAS© Viya 4 is a powerful tool for analyzing textual data. It integrates with the Cloud Analytic Services (CAS) engine to process large volumes of text. Key functionalities include: tokenization (breaking text into words or phrases), stop word filtering, lemmatization/stemming, n-gram extraction, Part-of-Speech (POS) analysis, and named entity extraction. It transforms unstructured text into a numerical representation (e.g., term-document matrix) which can then be used for predictive modeling or clustering tasks. The procedure's execution is done entirely on the CAS server, benefiting from distributed in-memory processing capabilities.
Data Analysis

Type : CREATION_INTERNE


Examples use generated data (datalines) or tables created via DATA steps in CAS memory.

1 Code Block
PROC TEXTMINE Data
Explanation :
This example shows basic tokenization of a small collection of textual documents. It creates an in-memory CAS table and applies the TEXTMINE procedure to extract terms by default.
Copied!
1CAS;
2LIBNAME mycas cas;
3 
4DATA mycas.docs;
5 INFILE DATALINES dlm='|';
6 INPUT doc_id $ text $100.;
7DATALINES;
81|Ceci est un document sur l'analyse de texte.
92|L'analyse de texte est très utile pour l'exploration de données.
103|SAS Viya offre de puissants outils d'analyse de données.
11;
12RUN;
13 
14PROC TEXTMINE DATA=mycas.docs;
15 id doc_id;
16 text text;
17 RUN;
18QUIT;
2 Code Block
PROC TEXTMINE Data
Explanation :
This example illustrates the use of common TEXTMINE procedure options. It filters stop words to ignore irrelevant words and applies stemming to reduce words to their base form, which is useful for grouping terms and facilitating theme analysis.
Copied!
1CAS;
2LIBNAME mycas cas;
3 
4DATA mycas.docs_inter;
5 INFILE DATALINES dlm='|';
6 INPUT doc_id $ text $200.;
7DATALINES;
81|Les données massives sont importantes pour l'apprentissage automatique et l'analyse prédictive.
92|L'apprentissage automatique et l'intelligence artificielle révolutionnent le traitement des données.
103|Le traitement des données est un domaine clé de l'analyse statistique et de l'intelligence artificielle.
11;
12RUN;
13 
14PROC TEXTMINE DATA=mycas.docs_inter;
15 id doc_id;
16 text text;
17 stoplist / default; /* Utilise une liste de mots vides par défaut */
18 stemming; /* Applique la racinisation */
19 RUN;
20QUIT;
3 Code Block
PROC TEXTMINE Data
Explanation :
This advanced example demonstrates the extraction of n-grams (sequences of words like 'artificial intelligence' or 'machine learning'). The OUTNGRAM option generates a table of n-grams, allowing for the capture of more complex semantic relationships than single words. Note that the TEXTMINE procedure focuses on the extraction of terms and n-grams; more sophisticated named entity extraction would require more specific text mining CAS actions or other SAS Text Analytics modules.
Copied!
1CAS;
2LIBNAME mycas cas;
3 
4DATA mycas.docs_adv;
5 INFILE DATALINES dlm='|';
6 INPUT doc_id $ text $200.;
7DATALINES;
81|La conférence SAS Global Forum 2024 a présenté des innovations en intelligence artificielle.
92|Dr. John Smith, expert en machine learning, a donné une présentation clé sur l'analyse de sentiments.
103|Le siège social de SAS est à Cary, en Caroline du Nord, USA.
11;
12run;
13 
14proc textmine data=mycas.docs_adv;
15 id doc_id;
16 text text;
17 outngram out=mycas.ngrams;
18 run;
19quit;
20 
21proc print data=mycas.ngrams;
22 title "N-grammes extraits";
23RUN;
4 Code Block
CAS Action (textmining.sastoken) Data
Explanation :
This example illustrates a more direct approach to text mining in the SAS Viya environment by using the CAS `sastoken` action. This action is one of the fundamental components that the TEXTMINE procedure uses behind the scenes. It enables efficient tokenization and normalization of textual data directly on the CAS server, demonstrating the power of distributed processing for large volumes of textual data.
Copied!
1CAS;
2/* Création d'une session CAS */
3cas sess;
4 
5/* Chargement des données dans CAS */
6DATA mycas.cas_data;
7 INFILE DATALINES dlm='|';
8 INPUT doc_id $ text $100.;
9DATALINES;
101|Le traitement du langage naturel est une branche de l'intelligence artificielle.
112|L'IA et le machine learning transforment l'industrie de la technologie.
123|SAS Viya est une plateforme d'analyse unifiée pour les données et l'IA.
13;
14run;
15 
16/* Utilisation de l'ACTION CAS 'sastoken' pour la tokenisation */
17PROC CAS;
18 textmining.sastoken /
19 caslib='mycas'
20 textinput={
21 caslib='mycas',
22 name='cas_data',
23 id={'doc_id'},
24 text={'text'}
25 }
26 casout={
27 caslib='mycas',
28 name='tokens_cas',
29 replace=TRUE
30 };
31RUN;
32QUIT;
33 
34/* Afficher les tokens générés par l'action CAS */
35PROC PRINT DATA=mycas.tokens_cas;
36 title "Tokens générés par l'ACTION CAS sastoken";
37RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All rights reserved.


Banner
Expert Advice
Expert
Michael
Responsable de l'infrastructure Viya.
« In the world of SAS Viya, PROC TEXTMINE is the cornerstone of unstructured data processing. Unlike traditional text parsing, this procedure leverages the distributed, in-memory architecture of Cloud Analytic Services (CAS) to transform raw text into mathematical representations at massive scale. By converting sentences into numerical vectors, it bridges the gap between human language and machine learning. »