High-Volume Fraud Detection with Parallel Forest Training

Business Context

A credit card processor needs to validate a Random Forest model for fraud detection on a large dataset. Due to the data volume and the complexity of the forest, they want to utilize parallel processing capabilities to speed up the 10-fold cross-validation process.

Data Preparation

Simulation of a larger dataset (100,000 transactions) representing credit card activity.

Copied!

1
2	DATA casuser.fraud_transactions;
3	call streaminit(777);
4	DO i = 1 to 100000;
5	amount = rand('exponential', 50);
6	merchant_cat = rand('integer', 1, 20);
7	time_since_last = rand('uniform', 0, 60);
8	IF rand('uniform') < 0.02 THEN is_fraud = 1;
9	ELSE is_fraud = 0;
10	OUTPUT;
11	END;
12
13	RUN;
14

Étapes de réalisation

Load the large transaction dataset.

Copied!

1
2	PROC CAS;
3
4	dataStep.runCode / code="
5	data casuser.fraud_transactions;
6	set casuser.fraud_transactions;
7	";
8
9	QUIT;
10

Run 10-fold cross-validation with 'FOREST' model, explicitly enabling parallel folds and allocating worker nodes.

Copied!

1	PROC CAS;
2	mlTools.crossValidate /
3	TABLE={name="fraud_transactions"}
4	modelType="FOREST"
5	kFolds=10
6	parallelFolds=TRUE
7	nSubsessionWorkers=4
8	casOut={name="cv_fraud_scored", replace=TRUE}
9	trainOptions={
10	target="is_fraud",
11	inputs={"amount", "merchant_cat", "time_since_last"},
12	nominals={"is_fraud", "merchant_cat"}
13	};
14	QUIT;

Expected Result

The system distributes the 10 folds across available worker nodes. The execution time is reduced compared to serial execution. The output table contains scored data for all 100,000 records, compiled from the validation hold-out sets of each fold.

Voir la documentation technique de crossValidate