Multi-Language Support Ticket Classification (Japanese)

Business Context

A global software vendor needs to classify support tickets originating from Japan. The text contains CJK characters which require specific tokenization strategies. This scenario validates the action's ability to handle non-English languages and specific tokenizer settings.

Data Preparation

Creation of a rule table containing Japanese characters for Network and Hardware issues.

Copied!

1
2	DATA mycas.jp_support_rules;
3	LENGTH config $200;
4	INFILE DATALINES dsd;
5	INPUT config $;
6	DATALINES;
7	"CATEGORY:Network,(OR, 'ネットワーク', '接続', '遅延')" "CATEGORY:Hardware,(OR, 'ハードウェア', '故障', '画面')" ;
8
9	RUN;
10

Étapes de réalisation

Load the Japanese rule definitions.

Copied!

1
2	PROC CAS;
3	TABLE.loadTable / path='jp_support_rules.sashdat' caslib='casuser' casout={name='jp_support_rules', replace=true};
4
5	RUN;
6

Compile the category model specifying Japanese language and BASIC tokenizer.

Copied!

1
2	PROC CAS;
3	textRuleDevelop.compileCategory / TABLE={name='jp_support_rules'} config='config' language='Japanese' tokenizer='BASIC' casOut={name='jp_support_model', replace=true};
4
5	RUN;
6

Expected Result

The action must compile the Japanese rules without encoding errors. The resulting 'jp_support_model' should be optimized for CJK text processing due to the 'BASIC' tokenizer setting.

Voir la documentation technique de compileCategory