Edge Case: Financial Compliance Screening with Overlapping Concepts and Filtering

Business Context

A financial compliance team must screen internal communications for mentions of specific, restricted projects ('Project Chimera') to prevent information leaks. They need to distinguish these from common, overlapping terms ('project') and filter out irrelevant, noisy concepts.

About the Set : textRuleScore

Rule-based scoring of text documents.

Discover all actions of textRuleScore

Data Preparation

Create a 'communications' table with tricky data: overlapping terms, null/empty text, and non-English text. The LITI model ('compliance_liti') defines a generic concept to be dropped and a specific, longer concept to be matched.

Copied!

1	DATA mycas.communications;
2	LENGTH msg_id $ 10 msg_text $ 300;
3	INFILE DATALINES delimiter='\|';
4	INPUT msg_id $ msg_text $;
5	DATALINES;
6	msg1\|The team discussed Project Chimera in the meeting.
7	msg2\|This new project is very demanding.
8	msg3\|Let's talk about the Project Chimera funding.
9	msg4\|
10	msg5\|Ceci est un test dans une autre langue.
11	msg6\|Just a regular project update.
12	;
13	RUN;
14
15	DATA mycas.compliance_liti;
16	LENGTH model_id $ 10 model_txt $ 200;
17	INPUT model_id $ model_txt $;
18	DATALINES;
19	comp1\|CONCEPT:GENERIC_PROJECT@project
20	comp1\|CONCEPT:RESTRICTED_PROJECT@Project Chimera
21	;
22	RUN;

Étapes de réalisation

Load the communication data and the compliance LITI model into CAS.

Copied!

1	/*
2	Data is prepared and loaded in the data_prep step */

Run applyConcept with 'matchType' set to 'LONGEST' to ensure 'Project Chimera' is matched over just 'project'. Use 'dropConcepts' to filter out the noisy 'GENERIC_PROJECT' concept from the results.

Copied!

1	PROC CAS;
2	textRuleScore.applyConcept /
3	TABLE={name='communications'},
4	docId='msg_id',
5	text='msg_text',
6	model={name='compliance_liti'},
7	matchType='LONGEST',
8	dropConcepts={'GENERIC_PROJECT'},
9	casOut={name='compliance_hits', replace=true};
10	RUN;
11	QUIT;

Expected Result

The action runs without errors, ignoring the empty and non-English records. The output table 'compliance_hits' contains exactly two rows, one for 'msg1' and one for 'msg3', both identifying the concept 'RESTRICTED_PROJECT'. The 'GENERIC_PROJECT' concept is not present in the output, demonstrating that both 'matchType=LONGEST' and 'dropConcepts' worked as expected.

Voir la documentation technique de applyConcept