textRuleScore applyCategory

Performance Case: High-Volume E-Discovery Document Screening

Scénario de test & Cas d'usage

Business Context

A legal firm is undergoing an e-discovery process and needs to categorize millions of documents (emails, memos) as 'Relevant', 'Privileged', or 'Non-Relevant' based on a complex legal rule set. Speed and efficiency are critical.
About the Set : textRuleScore

Rule-based scoring of text documents.

Discover all actions of textRuleScore
Data Preparation

Simulate a very large dataset of documents. We don't need complex text, just a high volume of rows to test the action's throughput. A simple mock model is also created.

Copied!
1DATA mycas.legal_docs_large;
2 LENGTH doc_uuid $ 36 doc_content $ 256;
3 DO i = 1 to 2000000;
4 doc_uuid = uuidgen();
5 IF mod(i, 100) = 0 THEN doc_content = 'This document discusses the confidential merger agreement and is privileged.';
6 ELSE doc_content = 'Please find attached the weekly status report.';
7 OUTPUT;
8 END;
9RUN;
10 
11DATA mycas.legal_model;
12 LENGTH _mco_ long;
13 _mco_ = 445566;
14RUN;

Étapes de réalisation

1
Run applyCategory on the 2 million row table. Only the primary output table is requested to maximize performance, omitting the more detailed 'matchOut' and 'groupedMatchOut' tables.
Copied!
1PROC CAS;
2 textRuleScore.applyCategory /
3 TABLE={name='legal_docs_large'},
4 docId='doc_uuid',
5 text='doc_content',
6 model={name='legal_model'},
7 casOut={name='legal_docs_categorized', replace=true};
8RUN;
9QUIT;

Expected Result


The action successfully processes the 2 million documents without errors and in a timely manner. A single output table, 'legal_docs_categorized', is created. The test validates the scalability and stability of the action under a heavy load.