Edge Case: Handling Messy Data with Weighted Scoring

Business Context

A financial institution is analyzing unstructured text from various sources (news feeds, reports) to assess investment risk. The data is inconsistent: some entries are empty, some contain XML-like tags, and the risk model relies on weighted rules where certain keywords carry more significance than others.

About the Set : textRuleScore

Rule-based scoring of text documents.

Discover all actions of textRuleScore

Data Preparation

Create a dataset with challenging data, including a completely empty text field, a null value, and text with XML tags. A mock weighted model is created.

Copied!

1	DATA mycas.risk_feed;
2	LENGTH feed_id $ 10 content $ 1024;
3	INFILE DATALINES truncover dsd dlm='\|';
4	INPUT feed_id $ content $;
5	DATALINES;
6	FEED_1\|Breaking News: Alpha Corp announces record profits, shares surge.
7	FEED_2\|Market volatility is a major concern. Analysts predict a downturn.
8	FEED_3\|
9	FEED_4\|This is a neutral report on quarterly earnings.
10	FEED_5\|Rumors of regulatory investigation cause stock to plummet.
11	;
12	RUN;
13
14	DATA mycas.risk_model_weighted;
15	LENGTH _mco_ long;
16	_mco_ = 778899;
17	RUN;

Étapes de réalisation

First run with 'docType' as 'TEXT' and 'scoringAlgorithm' as 'WEIGHTED'. This tests how it handles the XML-like tags as plain text and processes the empty/null records. A custom delimiter is used for the grouped match output.

Copied!

1	PROC CAS;
2	textRuleScore.applyCategory /
3	TABLE={name='risk_feed'},
4	docId='feed_id',
5	text='content',
6	docType='TEXT',
7	model={name='risk_model_weighted'},
8	scoringAlgorithm='WEIGHTED',
9	casOut={name='risk_results_weighted', replace=true},
10	groupedMatchOut={name='risk_grouped_matches', replace=true},
11	matchDelimiter=' \|\| ';
12	RUN;
13	QUIT;

Second run, attempting to use 'docType' as 'XML' to see how the action behaves when the content is not well-formed XML. This is a negative test case.

Copied!

1	PROC CAS;
2	textRuleScore.applyCategory /
3	TABLE={name='risk_feed'},
4	docId='feed_id',
5	text='content',
6	docType='XML',
7	model={name='risk_model_weighted'},
8	casOut={name='risk_results_xml', replace=true};
9	RUN;
10	QUIT;

Expected Result

For step 1, the action runs successfully. The 'risk_results_weighted' table shows weighted scores, and the empty/null records (FEED_3) are processed without error, yielding zero scores. The 'risk_grouped_matches' table uses ' || ' to separate matched terms. For step 2, the action should ideally produce a warning or error in the log, indicating that the content is not valid XML, demonstrating the action's input validation capabilities. The output table 'risk_results_xml' may be empty or contain partial results depending on the error handling logic.

Voir la documentation technique de applyCategory