Japanese Log Analysis with Basic Tokenizer

Business Context

An IT support center in Tokyo analyzes server logs written in Japanese. The standard tokenizer struggles with their specific technical jargon and log formats. They require the 'BASIC' tokenizer to rely on simple whitespace/punctuation delimiters for better accuracy on specific error codes mixed with Japanese characters.

Data Preparation

Creating Japanese rules for error detection, including mixed ASCII and Japanese characters.

Copied!

1	DATA casuser.jp_log_rules;
2	LENGTH rule_code $32 rule_logic $200;
3	INFILE DATALINES delimiter='\|';
4	INPUT rule_code $ rule_logic $;
5	DATALINES;
6	ERR_CRIT\|CONCEPT:致命的なエラー
7	ERR_WARN\|CONCEPT:警告
8	ERR_CODE\|CONCEPT_RULE:(REGEX)=\"[0-9]{3}-FAIL\"
9	;
10	RUN;

Étapes de réalisation

Compiling the model using the BASIC tokenizer specifically for CJK languages.

Copied!

1	PROC CAS;
2	textRuleDevelop.compileConcept /
3	TABLE={caslib="casuser", name="jp_log_rules"},
4	ruleId="rule_code",
5	config="rule_logic",
6	language="JAPANESE",
7	tokenizer="BASIC",
8	casOut={caslib="casuser", name="jp_error_model", replace=TRUE};
9	RUN;

Expected Result

The model is successfully compiled using the 'BASIC' tokenizer logic. The system acknowledges the 'JAPANESE' language setting. This ensures that when applied later, the text '致命的なエラー' (Fatal Error) is correctly identified even if surrounded by log timestamps.

Voir la documentation technique de compileConcept