The TEXTMINE procedure

The TEXTMINE procedure in SAS^© Viya^™ 4 is a powerful tool for analyzing textual data. It integrates with the Cloud Analytic Services (CAS) engine to process large volumes of text. Key functionalities include: tokenization (breaking text into words or phrases), stop word filtering, lemmatization/stemming, n-gram extraction, Part-of-Speech (POS) analysis, and named entity extraction. It transforms unstructured text into a numerical representation (e.g., term-document matrix) which can then be used for predictive modeling or clustering tasks. The procedure's execution is done entirely on the CAS server, benefiting from distributed in-memory processing capabilities.

Data Analysis

Type : CREATION_INTERNE

Examples use generated data (datalines) or tables created via DATA steps in CAS memory.

1 Code Block

PROC TEXTMINE Data

Explanation :
This example shows basic tokenization of a small collection of textual documents. It creates an in-memory CAS table and applies the TEXTMINE procedure to extract terms by default.

Copied!

1	CAS;
2	LIBNAME mycas cas;
3
4	DATA mycas.docs;
5	INFILE DATALINES dlm='\|';
6	INPUT doc_id $ text $100.;
7	DATALINES;
8	1\|Ceci est un document sur l'analyse de texte.
9	2\|L'analyse de texte est très utile pour l'exploration de données.
10	3\|SAS Viya offre de puissants outils d'analyse de données.
11	;
12	RUN;
13
14	PROC TEXTMINE DATA=mycas.docs;
15	id doc_id;
16	text text;
17	RUN;
18	QUIT;

2 Code Block

PROC TEXTMINE Data

Explanation :
This example illustrates the use of common TEXTMINE procedure options. It filters stop words to ignore irrelevant words and applies stemming to reduce words to their base form, which is useful for grouping terms and facilitating theme analysis.

Copied!

1	CAS;
2	LIBNAME mycas cas;
3
4	DATA mycas.docs_inter;
5	INFILE DATALINES dlm='\|';
6	INPUT doc_id $ text $200.;
7	DATALINES;
8	1\|Les données massives sont importantes pour l'apprentissage automatique et l'analyse prédictive.
9	2\|L'apprentissage automatique et l'intelligence artificielle révolutionnent le traitement des données.
10	3\|Le traitement des données est un domaine clé de l'analyse statistique et de l'intelligence artificielle.
11	;
12	RUN;
13
14	PROC TEXTMINE DATA=mycas.docs_inter;
15	id doc_id;
16	text text;
17	stoplist / default; /* Utilise une liste de mots vides par défaut */
18	stemming; /* Applique la racinisation */
19	RUN;
20	QUIT;

3 Code Block

PROC TEXTMINE Data

Explanation :
This advanced example demonstrates the extraction of n-grams (sequences of words like 'artificial intelligence' or 'machine learning'). The OUTNGRAM option generates a table of n-grams, allowing for the capture of more complex semantic relationships than single words. Note that the TEXTMINE procedure focuses on the extraction of terms and n-grams; more sophisticated named entity extraction would require more specific text mining CAS actions or other SAS Text Analytics modules.

Copied!

1	CAS;
2	LIBNAME mycas cas;
3
4	DATA mycas.docs_adv;
5	INFILE DATALINES dlm='\|';
6	INPUT doc_id $ text $200.;
7	DATALINES;
8	1\|La conférence SAS Global Forum 2024 a présenté des innovations en intelligence artificielle.
9	2\|Dr. John Smith, expert en machine learning, a donné une présentation clé sur l'analyse de sentiments.
10	3\|Le siège social de SAS est à Cary, en Caroline du Nord, USA.
11	;
12	run;
13
14	proc textmine data=mycas.docs_adv;
15	id doc_id;
16	text text;
17	outngram out=mycas.ngrams;
18	run;
19	quit;
20
21	proc print data=mycas.ngrams;
22	title "N-grammes extraits";
23	RUN;

4 Code Block

CAS Action (textmining.sastoken) Data

Explanation :
This example illustrates a more direct approach to text mining in the SAS Viya environment by using the CAS `sastoken` action. This action is one of the fundamental components that the TEXTMINE procedure uses behind the scenes. It enables efficient tokenization and normalization of textual data directly on the CAS server, demonstrating the power of distributed processing for large volumes of textual data.

Copied!

1	CAS;
2	/* Création d'une session CAS */
3	cas sess;
4
5	/* Chargement des données dans CAS */
6	DATA mycas.cas_data;
7	INFILE DATALINES dlm='\|';
8	INPUT doc_id $ text $100.;
9	DATALINES;
10	1\|Le traitement du langage naturel est une branche de l'intelligence artificielle.
11	2\|L'IA et le machine learning transforment l'industrie de la technologie.
12	3\|SAS Viya est une plateforme d'analyse unifiée pour les données et l'IA.
13	;
14	run;
15
16	/* Utilisation de l'ACTION CAS 'sastoken' pour la tokenisation */
17	PROC CAS;
18	textmining.sastoken /
19	caslib='mycas'
20	textinput={
21	caslib='mycas',
22	name='cas_data',
23	id={'doc_id'},
24	text={'text'}
25	}
26	casout={
27	caslib='mycas',
28	name='tokens_cas',
29	replace=TRUE
30	};
31	RUN;
32	QUIT;
33
34	/* Afficher les tokens générés par l'action CAS */
35	PROC PRINT DATA=mycas.tokens_cas;
36	title "Tokens générés par l'ACTION CAS sastoken";
37	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste

Expert Advice

Michael

Responsable de l'infrastructure Viya.

« In the world of SAS Viya, PROC TEXTMINE is the cornerstone of unstructured data processing. Unlike traditional text parsing, this procedure leverages the distributed, in-memory architecture of Cloud Analytic Services (CAS) to transform raw text into mathematical representations at massive scale. By converting sentences into numerical vectors, it bridges the gap between human language and machine learning. »