mlTools crossValidate

High-Volume Fraud Detection with Parallel Forest Training

Scénario de test & Cas d'usage

Business Context

A credit card processor needs to validate a Random Forest model for fraud detection on a large dataset. Due to the data volume and the complexity of the forest, they want to utilize parallel processing capabilities to speed up the 10-fold cross-validation process.
Data Preparation

Simulation of a larger dataset (100,000 transactions) representing credit card activity.

Copied!
1 
2DATA casuser.fraud_transactions;
3call streaminit(777);
4DO i = 1 to 100000;
5amount = rand('exponential', 50);
6merchant_cat = rand('integer', 1, 20);
7time_since_last = rand('uniform', 0, 60);
8IF rand('uniform') < 0.02 THEN is_fraud = 1;
9ELSE is_fraud = 0;
10OUTPUT;
11END;
12 
13RUN;
14 

Étapes de réalisation

1
Load the large transaction dataset.
Copied!
1 
2PROC CAS;
3 
4dataStep.runCode / code="
5data casuser.fraud_transactions;
6set casuser.fraud_transactions;
7";
8 
9QUIT;
10 
2
Run 10-fold cross-validation with 'FOREST' model, explicitly enabling parallel folds and allocating worker nodes.
Copied!
1PROC CAS;
2 mlTools.crossValidate /
3 TABLE={name="fraud_transactions"}
4 modelType="FOREST"
5 kFolds=10
6 parallelFolds=TRUE
7 nSubsessionWorkers=4
8 casOut={name="cv_fraud_scored", replace=TRUE}
9 trainOptions={
10 target="is_fraud",
11 inputs={"amount", "merchant_cat", "time_since_last"},
12 nominals={"is_fraud", "merchant_cat"}
13 };
14QUIT;

Expected Result


The system distributes the 10 folds across available worker nodes. The execution time is reduced compared to serial execution. The output table contains scored data for all 100,000 records, compiled from the validation hold-out sets of each fold.