buildTermIndex

Q: What is the purpose of the buildTermIndex action?

The buildTermIndex action creates a term index table for significant terms.

Q: What are the required parameters for the buildTermIndex action?

The required parameters are 'casOut', which specifies the output table to store the term list, and 'table', which specifies the input index table.

Q: What does the 'language' parameter control?

The 'language' parameter specifies the language to use for the index field tokenizer. The default value is UNIVERSAL.

Q: What is the function of the 'fields' parameter?

The 'fields' parameter specifies an optional list of fields where term frequency should be counted.

Q: What is the default behavior of the 'tokenize' parameter?

The 'tokenize' parameter specifies whether the index field is tokenized. By default, its value is FALSE.

Description

The `buildTermIndex` action creates a term index table from a table of significant terms. This index is essential for features like autocomplete and search joins, as it pre-processes terms to optimize search performance. It can operate on specific fields and supports multiple languages for tokenization.

searchAnalytics.buildTermIndex / casOut={...} fields={"string-1" <, "string-2", ...>} language="ARABIC" | "CHINESE" | "CROATIAN" | "CZECH" | "DANISH" | "DUTCH" | "ENGLISH" | "FARSI" | "FINNISH" | "FRENCH" | "GERMAN" | "GREEK" | "HEBREW" | "HINDI" | "HUNGARIAN" | "INDONESIAN" | "ITALIAN" | "JAPANESE" | "KAZAKH" | "KOREAN" | "NORWEGIAN" | "POLISH" | "PORTUGUESE" | "ROMANIAN" | "RUSSIAN" | "SLOVAK" | "SLOVENE" | "SPANISH" | "SWEDISH" | "TAGALOG" | "THAI" | "TURKISH" | "UNIVERSAL" | "VIETNAMESE" table={...} tokenize=TRUE | FALSE;

Settings

Parameter	Description
casOut	Specifies the output table to store the term index. This table will contain the indexed terms and their associated data, ready for use by other search analytics actions. This parameter is required.
fields	Specifies a list of columns from the input table that contain the terms to be indexed. If not specified, the action might use a default column, typically `_Term_`.
language	Specifies the language for tokenization, which breaks down text into individual terms. This affects how terms are processed and indexed. The default is 'UNIVERSAL' for language-independent tokenization.
table	Specifies the input table containing the significant terms to be indexed. This table is typically the output of the `significantTerms` action. This parameter is required. Alias: `index`.
tokenize	When set to TRUE, the action tokenizes the content of the specified `fields`. If FALSE, it assumes the fields already contain single, ready-to-index terms. Default: FALSE.

Data Preparation View data prep sheet

Data Creation: Table of Significant Terms

This SAS code creates a sample CAS table named 'significant_terms' in the 'casuser' caslib. This table simulates the output of a `significantTerms` action and contains a list of terms that will be indexed by the `buildTermIndex` action.

Copied!

1	DATA casuser.significant_terms;
2	LENGTH _Term_ $ 50;
3	INFILE DATALINES dsd;
4	INPUT _Term_ $;
5	DATALINES;
6	SAS Viya
7	Cloud Analytic Services
8	Text Analytics
9	Machine Learning
10	DATA Science
11	Search Index
12	Natural Language Processing
13	;
14	RUN;

Examples

This example demonstrates a basic use of the `buildTermIndex` action. It takes the `significant_terms` table, tokenizes the `_Term_` column, and creates a new output table named `term_index` containing the indexed terms.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	searchAnalytics.buildTermIndex /
3	TABLE={name='significant_terms'},
4	casOut={name='term_index', replace=true},
5	fields={'_Term_'
6	},
7	tokenize=true;
8	RUN;
9	QUIT;

Result :
The action will produce an output table named `term_index` in the active caslib. This table contains the indexed representation of the terms from the input table, which can be used for subsequent search operations.

This example first creates a sample table with French terms, then shows how to create a term index for a specific language. By setting `language='FRENCH'`, the tokenization and indexing process is optimized for French text. This is crucial for handling language-specific features like stop words, stemming, and character normalization.

SAS® / CAS Code Code awaiting community validation

Copied!

1	DATA casuser.significant_terms_french;
2	LENGTH _Term_ $ 50;
3	INFILE DATALINES dsd;
4	INPUT _Term_ $;
5	DATALINES;
6	Science des données
7	Apprentissage automatique
8	Traitement du langage naturel
9	Index de recherche
10	;
11	RUN;
12
13	PROC CAS;
14	searchAnalytics.buildTermIndex /
15	TABLE={name='significant_terms_french'},
16	casOut={name='term_index_french', replace=true},
17	fields={'_Term_'
18	},
19	tokenize=true,
20	language='FRENCH';
21	RUN;
22	QUIT;

Result :
An output table named `term_index_french` is created. The terms within are indexed according to French linguistic rules, improving the accuracy of search-related tasks for that language.

FAQ

What is the purpose of the buildTermIndex action?

What are the required parameters for the buildTermIndex action?

What does the 'language' parameter control?

What is the function of the 'fields' parameter?

What is the default behavior of the 'tokenize' parameter?

Associated Scenarios

Use Case

E-commerce Product Search Autocomplete

An online bookstore wants to optimize their search engine. They need to create a term index from a list of popular book titles to enable an efficient autocomplete feature. The d...

View scenario

Use Case

High Volume Server Log Indexing

The IT Operations team needs to index a large volume of system logs to identify recurring error patterns. The test aims to validate the performance and stability of the action w...

View scenario

Use Case

Multi-lingual Support Tickets with Dirty Data

A global customer support platform handles tickets in Spanish. The raw data includes special characters, empty fields, and some fields that are already pre-processed tags. The t...

View scenario

Actions associées

searchAnalytics

Table of Contents

Description

Data Creation: Table of Significant Terms

Examples

FAQ

Associated Scenarios

Use Case

E-commerce Product Search Autocomplete

Use Case

High Volume Server Log Indexing

Use Case

Multi-lingual Support Tickets with Dirty Data

Actions associées

buildAutoComplete

Table of Contents

Description

Data Creation: Table of Significant Terms

Examples

Basic Term Index Creation

Language-Specific Term Indexing

FAQ

Associated Scenarios

Use Case

E-commerce Product Search Autocomplete

Use Case

High Volume Server Log Indexing

Use Case

Multi-lingual Support Tickets with Dirty Data

Actions associées

buildAutoComplete