searchAnalytics buildTermIndex

Multi-lingual Support Tickets with Dirty Data

Scénario de test & Cas d'usage

Business Context

A global customer support platform handles tickets in Spanish. The raw data includes special characters, empty fields, and some fields that are already pre-processed tags. The test verifies robustness against mixed data quality and the behavior of the tokenize parameter.
About the Set : searchAnalytics

Data indexing and search functionalities.

Discover all actions of searchAnalytics
Data Preparation

Creation of a dataset with Spanish text, empty values, and pre-tokenized tags.

Copied!
1DATA casuser.dirty_support; LENGTH comment $ 100 category $ 50; INFILE DATALINES dsd; INPUT comment $ category $; DATALINES;
2¡Hola! Necesito ayuda,General
3,Billing
4Error de conexión,Technical
5¿Por qué?,General
6PRE_TAGGED_ITEM,Legacy
7; RUN;

Étapes de réalisation

1
Load the dataset containing Spanish text and edge cases.
Copied!
1/* Data loaded in data_prep step */
2
Index the 'comment' field with Spanish settings (tokenize=TRUE).
Copied!
1PROC CAS;
2 searchAnalytics.buildTermIndex /
3 TABLE={name='dirty_support'}
4 casOut={name='support_index_es', replace=true}
5 fields={'comment'}
6 tokenize=true
7 language='SPANISH';
8RUN;
3
Index the 'category' field relying on default tokenize=FALSE (treating content as whole terms).
Copied!
1PROC CAS;
2 searchAnalytics.buildTermIndex /
3 TABLE={name='dirty_support'}
4 casOut={name='category_index', replace=true}
5 fields={'category'};
6RUN;

Expected Result


Two indexes are created. 'support_index_es' correctly handles Spanish characters and skips empty rows. 'category_index' treats the 'category' values as single terms (not tokenized), validating the default parameter behavior.