Multi-lingual Support Tickets with Dirty Data

Business Context

A global customer support platform handles tickets in Spanish. The raw data includes special characters, empty fields, and some fields that are already pre-processed tags. The test verifies robustness against mixed data quality and the behavior of the tokenize parameter.

About the Set : searchAnalytics

Data indexing and search functionalities.

Discover all actions of searchAnalytics

Data Preparation

Creation of a dataset with Spanish text, empty values, and pre-tokenized tags.

Copied!

1	DATA casuser.dirty_support; LENGTH comment $ 100 category $ 50; INFILE DATALINES dsd; INPUT comment $ category $; DATALINES;
2	¡Hola! Necesito ayuda,General
3	,Billing
4	Error de conexión,Technical
5	¿Por qué?,General
6	PRE_TAGGED_ITEM,Legacy
7	; RUN;

Étapes de réalisation

Load the dataset containing Spanish text and edge cases.

Copied!

1	/* Data loaded in data_prep step */

Index the 'comment' field with Spanish settings (tokenize=TRUE).

Copied!

1	PROC CAS;
2	searchAnalytics.buildTermIndex /
3	TABLE={name='dirty_support'}
4	casOut={name='support_index_es', replace=true}
5	fields={'comment'}
6	tokenize=true
7	language='SPANISH';
8	RUN;

Index the 'category' field relying on default tokenize=FALSE (treating content as whole terms).

Copied!

1	PROC CAS;
2	searchAnalytics.buildTermIndex /
3	TABLE={name='dirty_support'}
4	casOut={name='category_index', replace=true}
5	fields={'category'};
6	RUN;

Expected Result

Two indexes are created. 'support_index_es' correctly handles Spanish characters and skips empty rows. 'category_index' treats the 'category' values as single terms (not tokenized), validating the default parameter behavior.

Voir la documentation technique de buildTermIndex