searchAnalytics

buildTermIndex

Description

The `buildTermIndex` action creates a term index table from a table of significant terms. This index is essential for features like autocomplete and search joins, as it pre-processes terms to optimize search performance. It can operate on specific fields and supports multiple languages for tokenization.

searchAnalytics.buildTermIndex / casOut={...} fields={"string-1" <, "string-2", ...>} language="ARABIC" | "CHINESE" | "CROATIAN" | "CZECH" | "DANISH" | "DUTCH" | "ENGLISH" | "FARSI" | "FINNISH" | "FRENCH" | "GERMAN" | "GREEK" | "HEBREW" | "HINDI" | "HUNGARIAN" | "INDONESIAN" | "ITALIAN" | "JAPANESE" | "KAZAKH" | "KOREAN" | "NORWEGIAN" | "POLISH" | "PORTUGUESE" | "ROMANIAN" | "RUSSIAN" | "SLOVAK" | "SLOVENE" | "SPANISH" | "SWEDISH" | "TAGALOG" | "THAI" | "TURKISH" | "UNIVERSAL" | "VIETNAMESE" table={...} tokenize=TRUE | FALSE;
Settings
ParameterDescription
casOutSpecifies the output table to store the term index. This table will contain the indexed terms and their associated data, ready for use by other search analytics actions. This parameter is required.
fieldsSpecifies a list of columns from the input table that contain the terms to be indexed. If not specified, the action might use a default column, typically `_Term_`.
languageSpecifies the language for tokenization, which breaks down text into individual terms. This affects how terms are processed and indexed. The default is 'UNIVERSAL' for language-independent tokenization.
tableSpecifies the input table containing the significant terms to be indexed. This table is typically the output of the `significantTerms` action. This parameter is required. Alias: `index`.
tokenizeWhen set to TRUE, the action tokenizes the content of the specified `fields`. If FALSE, it assumes the fields already contain single, ready-to-index terms. Default: FALSE.
Data Preparation View data prep sheet
Data Creation: Table of Significant Terms

This SAS code creates a sample CAS table named 'significant_terms' in the 'casuser' caslib. This table simulates the output of a `significantTerms` action and contains a list of terms that will be indexed by the `buildTermIndex` action.

Copied!
1DATA casuser.significant_terms;
2 LENGTH _Term_ $ 50;
3 INFILE DATALINES dsd;
4 INPUT _Term_ $;
5 DATALINES;
6SAS Viya
7Cloud Analytic Services
8Text Analytics
9Machine Learning
10DATA Science
11Search Index
12Natural Language Processing
13;
14RUN;

Examples

This example demonstrates a basic use of the `buildTermIndex` action. It takes the `significant_terms` table, tokenizes the `_Term_` column, and creates a new output table named `term_index` containing the indexed terms.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 searchAnalytics.buildTermIndex /
3 TABLE={name='significant_terms'},
4 casOut={name='term_index', replace=true},
5 fields={'_Term_'
6 },
7 tokenize=true;
8RUN;
9QUIT;
Result :
The action will produce an output table named `term_index` in the active caslib. This table contains the indexed representation of the terms from the input table, which can be used for subsequent search operations.

This example first creates a sample table with French terms, then shows how to create a term index for a specific language. By setting `language='FRENCH'`, the tokenization and indexing process is optimized for French text. This is crucial for handling language-specific features like stop words, stemming, and character normalization.

SAS® / CAS Code Code awaiting community validation
Copied!
1DATA casuser.significant_terms_french;
2 LENGTH _Term_ $ 50;
3 INFILE DATALINES dsd;
4 INPUT _Term_ $;
5 DATALINES;
6Science des données
7Apprentissage automatique
8Traitement du langage naturel
9Index de recherche
10;
11RUN;
12 
13PROC CAS;
14 searchAnalytics.buildTermIndex /
15 TABLE={name='significant_terms_french'},
16 casOut={name='term_index_french', replace=true},
17 fields={'_Term_'
18 },
19 tokenize=true,
20 language='FRENCH';
21RUN;
22QUIT;
Result :
An output table named `term_index_french` is created. The terms within are indexed according to French linguistic rules, improving the accuracy of search-related tasks for that language.

FAQ

What is the purpose of the buildTermIndex action?
What are the required parameters for the buildTermIndex action?
What does the 'language' parameter control?
What is the function of the 'fields' parameter?
What is the default behavior of the 'tokenize' parameter?

Associated Scenarios

Use Case
E-commerce Product Search Autocomplete

An online bookstore wants to optimize their search engine. They need to create a term index from a list of popular book titles to enable an efficient autocomplete feature. The d...

Use Case
High Volume Server Log Indexing

The IT Operations team needs to index a large volume of system logs to identify recurring error patterns. The test aims to validate the performance and stability of the action w...

Use Case
Multi-lingual Support Tickets with Dirty Data

A global customer support platform handles tickets in Spanish. The raw data includes special characters, empty fields, and some fields that are already pre-processed tags. The t...