searchAnalytics

buildTermIndex

Description

The `buildTermIndex` action creates a term index table from a table of significant terms. This index is essential for features like autocomplete and search joins, as it pre-processes terms to optimize search performance. It can operate on specific fields and supports multiple languages for tokenization.

searchAnalytics.buildTermIndex / casOut={...} fields={"string-1" <, "string-2", ...>} language="ARABIC" | "CHINESE" | "CROATIAN" | "CZECH" | "DANISH" | "DUTCH" | "ENGLISH" | "FARSI" | "FINNISH" | "FRENCH" | "GERMAN" | "GREEK" | "HEBREW" | "HINDI" | "HUNGARIAN" | "INDONESIAN" | "ITALIAN" | "JAPANESE" | "KAZAKH" | "KOREAN" | "NORWEGIAN" | "POLISH" | "PORTUGUESE" | "ROMANIAN" | "RUSSIAN" | "SLOVAK" | "SLOVENE" | "SPANISH" | "SWEDISH" | "TAGALOG" | "THAI" | "TURKISH" | "UNIVERSAL" | "VIETNAMESE" table={...} tokenize=TRUE | FALSE;
Settings
ParameterDescription
casOutSpecifies the output table to store the term index. This table will contain the indexed terms and their associated data, ready for use by other search analytics actions. This parameter is required.
fieldsSpecifies a list of columns from the input table that contain the terms to be indexed. If not specified, the action might use a default column, typically `_Term_`.
languageSpecifies the language for tokenization, which breaks down text into individual terms. This affects how terms are processed and indexed. The default is 'UNIVERSAL' for language-independent tokenization.
tableSpecifies the input table containing the significant terms to be indexed. This table is typically the output of the `significantTerms` action. This parameter is required. Alias: `index`.
tokenizeWhen set to TRUE, the action tokenizes the content of the specified `fields`. If FALSE, it assumes the fields already contain single, ready-to-index terms. Default: FALSE.
Data Preparation View data prep sheet
Data Creation: Table of Significant Terms

This SAS code creates a sample CAS table named 'significant_terms' in the 'casuser' caslib. This table simulates the output of a `significantTerms` action and contains a list of terms that will be indexed by the `buildTermIndex` action.

Copied!
1DATA casuser.significant_terms;
2 LENGTH _Term_ $ 50;
3 INFILE DATALINES dsd;
4 INPUT _Term_ $;
5 DATALINES;
6SAS Viya
7Cloud Analytic Services
8Text Analytics
9Machine Learning
10DATA Science
11Search Index
12Natural Language Processing
13;
14RUN;

Examples

This example demonstrates a basic use of the `buildTermIndex` action. It takes the `significant_terms` table, tokenizes the `_Term_` column, and creates a new output table named `term_index` containing the indexed terms.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 searchAnalytics.buildTermIndex /
3 TABLE={name='significant_terms'},
4 casOut={name='term_index', replace=true},
5 fields={'_Term_'
6 },
7 tokenize=true;
8RUN;
9QUIT;
Result :
The action will produce an output table named `term_index` in the active caslib. This table contains the indexed representation of the terms from the input table, which can be used for subsequent search operations.

This example first creates a sample table with French terms, then shows how to create a term index for a specific language. By setting `language='FRENCH'`, the tokenization and indexing process is optimized for French text. This is crucial for handling language-specific features like stop words, stemming, and character normalization.

SAS® / CAS Code Code awaiting community validation
Copied!
1DATA casuser.significant_terms_french;
2 LENGTH _Term_ $ 50;
3 INFILE DATALINES dsd;
4 INPUT _Term_ $;
5 DATALINES;
6Science des données
7Apprentissage automatique
8Traitement du langage naturel
9Index de recherche
10;
11RUN;
12 
13PROC CAS;
14 searchAnalytics.buildTermIndex /
15 TABLE={name='significant_terms_french'},
16 casOut={name='term_index_french', replace=true},
17 fields={'_Term_'
18 },
19 tokenize=true,
20 language='FRENCH';
21RUN;
22QUIT;
Result :
An output table named `term_index_french` is created. The terms within are indexed according to French linguistic rules, improving the accuracy of search-related tasks for that language.

FAQ

What is the purpose of the buildTermIndex action?
What are the required parameters for the buildTermIndex action?
What does the 'language' parameter control?
What is the function of the 'fields' parameter?
What is the default behavior of the 'tokenize' parameter?