textRuleScore applyConcept

Edge Case: Financial Compliance Screening with Overlapping Concepts and Filtering

Scénario de test & Cas d'usage

Business Context

A financial compliance team must screen internal communications for mentions of specific, restricted projects ('Project Chimera') to prevent information leaks. They need to distinguish these from common, overlapping terms ('project') and filter out irrelevant, noisy concepts.
About the Set : textRuleScore

Rule-based scoring of text documents.

Discover all actions of textRuleScore
Data Preparation

Create a 'communications' table with tricky data: overlapping terms, null/empty text, and non-English text. The LITI model ('compliance_liti') defines a generic concept to be dropped and a specific, longer concept to be matched.

Copied!
1DATA mycas.communications;
2 LENGTH msg_id $ 10 msg_text $ 300;
3 INFILE DATALINES delimiter='|';
4 INPUT msg_id $ msg_text $;
5 DATALINES;
6msg1|The team discussed Project Chimera in the meeting.
7msg2|This new project is very demanding.
8msg3|Let's talk about the Project Chimera funding.
9msg4|
10msg5|Ceci est un test dans une autre langue.
11msg6|Just a regular project update.
12;
13RUN;
14 
15DATA mycas.compliance_liti;
16 LENGTH model_id $ 10 model_txt $ 200;
17 INPUT model_id $ model_txt $;
18 DATALINES;
19comp1|CONCEPT:GENERIC_PROJECT@project
20comp1|CONCEPT:RESTRICTED_PROJECT@Project Chimera
21;
22RUN;

Étapes de réalisation

1
Load the communication data and the compliance LITI model into CAS.
Copied!
1/*
2Data is prepared and loaded in the data_prep step */
2
Run applyConcept with 'matchType' set to 'LONGEST' to ensure 'Project Chimera' is matched over just 'project'. Use 'dropConcepts' to filter out the noisy 'GENERIC_PROJECT' concept from the results.
Copied!
1PROC CAS;
2 textRuleScore.applyConcept /
3 TABLE={name='communications'},
4 docId='msg_id',
5 text='msg_text',
6 model={name='compliance_liti'},
7 matchType='LONGEST',
8 dropConcepts={'GENERIC_PROJECT'},
9 casOut={name='compliance_hits', replace=true};
10RUN;
11QUIT;

Expected Result


The action runs without errors, ignoring the empty and non-English records. The output table 'compliance_hits' contains exactly two rows, one for 'msg1' and one for 'msg3', both identifying the concept 'RESTRICTED_PROJECT'. The 'GENERIC_PROJECT' concept is not present in the output, demonstrating that both 'matchType=LONGEST' and 'dropConcepts' worked as expected.