bart bartProbit

Performance Test: Large-Scale Fraud Detection with Distributed Chains

Scénario de test & Cas d'usage

Business Context

A credit card company needs to process a very large dataset of transactions to build a fraud detection model. Due to the data volume and urgency, the model training must be efficient. This scenario tests the action's performance under load and its ability to leverage a distributed computing environment.
About the Set : bart

Bayesian Additive Regression Trees models.

Discover all actions of bart
Data Preparation

Create a large dataset (1.5 million transactions) with features like transaction amount, time of day, and merchant category. The target 'Is_Fraud' is a rare event.

Copied!
1DATA transactions_large;
2 call streaminit(555);
3 DO i = 1 to 1500000;
4 Transaction_Amount = 5 + rand('Exponential', 50);
5 Time_Of_Day = floor(rand('Uniform') * 24); /* 0-23 hours */
6 IF rand('Uniform') < 0.3 THEN Merchant_Category = 'Online';
7 ELSE IF rand('Uniform') < 0.6 THEN Merchant_Category = 'Retail';
8 ELSE Merchant_Category = 'Services';
9 /* Fraud is more likely with high amounts and late at night */
10 logit_p = -8 + (Transaction_Amount / 100) + (ifc(Time_Of_Day < 6, 1.5, 0)) + (ifc(Merchant_Category='Online', 0.5, 0));
11 p = 1 / (1 + exp(-logit_p));
12 Is_Fraud = rand('Bernoulli', p);
13 OUTPUT;
14 END;
15RUN;
16 
17DATA casuser.transactions_large;
18 SET transactions_large;
19RUN;

Étapes de réalisation

1
Load the large transaction dataset into CAS.
Copied!
1PROC CAS;
2 TABLE.promote TABLE={caslib='CASUSER', name='transactions_large'};
3RUN;
4QUIT;
2
Run bart.bartProbit using performance-oriented parameters: distribute chains across 4 workers, set a max training time, and enable in-memory options.
Copied!
1PROC CAS;
2 bart.bartProbit TABLE={name='transactions_large'},
3 target='Is_Fraud',
4 inputs={'Transaction_Amount', 'Time_Of_Day'},
5 nominals={'Merchant_Category'},
6 nTree=250, /* More trees for complex interactions */
7 nBI=500,
8 nMC=2500,
9 distributeChains=4, /* Simulate running on 4 workers */
10 trainInMem=TRUE, /* Use more memory for speed */
11 obsLeafMapInMem=TRUE,
12 maxTrainTime=1800, /* Set a 30-minute time limit */
13 seed=911,
14 outputTables={names={ModelInfo='fraud_model_info', NObs='fraud_nobs', VariableSelection='fraud_var_select'}};
15RUN;
16QUIT;
3
Verify that the specified output tables were created.
Copied!
1PROC CAS;
2 TABLE.tableInfo TABLE={name='fraud_model_info'};
3 TABLE.tableInfo TABLE={name='fraud_nobs'};
4 TABLE.tableInfo TABLE={name='fraud_var_select'};
5RUN;
6QUIT;

Expected Result


The action should complete the training within the 30-minute 'maxTrainTime' limit. The use of 'distributeChains' should demonstrate the action's capability to handle distributed computation (in a real grid environment). The log should indicate that the training data was stored in memory. Three CAS tables ('fraud_model_info', 'fraud_nobs', 'fraud_var_select') should be created containing the results.