Performance Case: High-Volume E-Discovery Document Screening

Business Context

A legal firm is undergoing an e-discovery process and needs to categorize millions of documents (emails, memos) as 'Relevant', 'Privileged', or 'Non-Relevant' based on a complex legal rule set. Speed and efficiency are critical.

About the Set : textRuleScore

Rule-based scoring of text documents.

Discover all actions of textRuleScore

Data Preparation

Simulate a very large dataset of documents. We don't need complex text, just a high volume of rows to test the action's throughput. A simple mock model is also created.

Copied!

1	DATA mycas.legal_docs_large;
2	LENGTH doc_uuid $ 36 doc_content $ 256;
3	DO i = 1 to 2000000;
4	doc_uuid = uuidgen();
5	IF mod(i, 100) = 0 THEN doc_content = 'This document discusses the confidential merger agreement and is privileged.';
6	ELSE doc_content = 'Please find attached the weekly status report.';
7	OUTPUT;
8	END;
9	RUN;
10
11	DATA mycas.legal_model;
12	LENGTH _mco_ long;
13	_mco_ = 445566;
14	RUN;

Étapes de réalisation

Run applyCategory on the 2 million row table. Only the primary output table is requested to maximize performance, omitting the more detailed 'matchOut' and 'groupedMatchOut' tables.

Copied!

1	PROC CAS;
2	textRuleScore.applyCategory /
3	TABLE={name='legal_docs_large'},
4	docId='doc_uuid',
5	text='doc_content',
6	model={name='legal_model'},
7	casOut={name='legal_docs_categorized', replace=true};
8	RUN;
9	QUIT;

Expected Result

The action successfully processes the 2 million documents without errors and in a timely manner. A single output table, 'legal_docs_categorized', is created. The test validates the scalability and stability of the action under a heavy load.

Voir la documentation technique de applyCategory