textRuleScore applyCategory

Edge Case: Handling Messy Data with Weighted Scoring

Scénario de test & Cas d'usage

Business Context

A financial institution is analyzing unstructured text from various sources (news feeds, reports) to assess investment risk. The data is inconsistent: some entries are empty, some contain XML-like tags, and the risk model relies on weighted rules where certain keywords carry more significance than others.
About the Set : textRuleScore

Rule-based scoring of text documents.

Discover all actions of textRuleScore
Data Preparation

Create a dataset with challenging data, including a completely empty text field, a null value, and text with XML tags. A mock weighted model is created.

Copied!
1DATA mycas.risk_feed;
2 LENGTH feed_id $ 10 content $ 1024;
3 INFILE DATALINES truncover dsd dlm='|';
4 INPUT feed_id $ content $;
5 DATALINES;
6FEED_1|Breaking News: Alpha Corp announces record profits, shares surge.
7FEED_2|Market volatility is a major concern. Analysts predict a downturn.
8FEED_3|
9FEED_4|This is a neutral report on quarterly earnings.
10FEED_5|Rumors of regulatory investigation cause stock to plummet.
11;
12RUN;
13 
14DATA mycas.risk_model_weighted;
15 LENGTH _mco_ long;
16 _mco_ = 778899;
17RUN;

Étapes de réalisation

1
First run with 'docType' as 'TEXT' and 'scoringAlgorithm' as 'WEIGHTED'. This tests how it handles the XML-like tags as plain text and processes the empty/null records. A custom delimiter is used for the grouped match output.
Copied!
1PROC CAS;
2 textRuleScore.applyCategory /
3 TABLE={name='risk_feed'},
4 docId='feed_id',
5 text='content',
6 docType='TEXT',
7 model={name='risk_model_weighted'},
8 scoringAlgorithm='WEIGHTED',
9 casOut={name='risk_results_weighted', replace=true},
10 groupedMatchOut={name='risk_grouped_matches', replace=true},
11 matchDelimiter=' || ';
12RUN;
13QUIT;
2
Second run, attempting to use 'docType' as 'XML' to see how the action behaves when the content is not well-formed XML. This is a negative test case.
Copied!
1PROC CAS;
2 textRuleScore.applyCategory /
3 TABLE={name='risk_feed'},
4 docId='feed_id',
5 text='content',
6 docType='XML',
7 model={name='risk_model_weighted'},
8 casOut={name='risk_results_xml', replace=true};
9RUN;
10QUIT;

Expected Result


For step 1, the action runs successfully. The 'risk_results_weighted' table shows weighted scores, and the empty/null records (FEED_3) are processed without error, yielding zero scores. The 'risk_grouped_matches' table uses ' || ' to separate matched terms. For step 2, the action should ideally produce a warning or error in the log, indicating that the content is not valid XML, demonstrating the action's input validation capabilities. The output table 'risk_results_xml' may be empty or contain partial results depending on the error handling logic.