Performance Test: Large-Scale Fraud Detection with Distributed Chains

Business Context

A credit card company needs to process a very large dataset of transactions to build a fraud detection model. Due to the data volume and urgency, the model training must be efficient. This scenario tests the action's performance under load and its ability to leverage a distributed computing environment.

About the Set : bart

Bayesian Additive Regression Trees models.

Discover all actions of bart

Data Preparation

Create a large dataset (1.5 million transactions) with features like transaction amount, time of day, and merchant category. The target 'Is_Fraud' is a rare event.

Copied!

1	DATA transactions_large;
2	call streaminit(555);
3	DO i = 1 to 1500000;
4	Transaction_Amount = 5 + rand('Exponential', 50);
5	Time_Of_Day = floor(rand('Uniform') * 24); /* 0-23 hours */
6	IF rand('Uniform') < 0.3 THEN Merchant_Category = 'Online';
7	ELSE IF rand('Uniform') < 0.6 THEN Merchant_Category = 'Retail';
8	ELSE Merchant_Category = 'Services';
9	/* Fraud is more likely with high amounts and late at night */
10	logit_p = -8 + (Transaction_Amount / 100) + (ifc(Time_Of_Day < 6, 1.5, 0)) + (ifc(Merchant_Category='Online', 0.5, 0));
11	p = 1 / (1 + exp(-logit_p));
12	Is_Fraud = rand('Bernoulli', p);
13	OUTPUT;
14	END;
15	RUN;
16
17	DATA casuser.transactions_large;
18	SET transactions_large;
19	RUN;

Étapes de réalisation

Load the large transaction dataset into CAS.

Copied!

1	PROC CAS;
2	TABLE.promote TABLE={caslib='CASUSER', name='transactions_large'};
3	RUN;
4	QUIT;

Run bart.bartProbit using performance-oriented parameters: distribute chains across 4 workers, set a max training time, and enable in-memory options.

Copied!

1	PROC CAS;
2	bart.bartProbit TABLE={name='transactions_large'},
3	target='Is_Fraud',
4	inputs={'Transaction_Amount', 'Time_Of_Day'},
5	nominals={'Merchant_Category'},
6	nTree=250, /* More trees for complex interactions */
7	nBI=500,
8	nMC=2500,
9	distributeChains=4, /* Simulate running on 4 workers */
10	trainInMem=TRUE, /* Use more memory for speed */
11	obsLeafMapInMem=TRUE,
12	maxTrainTime=1800, /* Set a 30-minute time limit */
13	seed=911,
14	outputTables={names={ModelInfo='fraud_model_info', NObs='fraud_nobs', VariableSelection='fraud_var_select'}};
15	RUN;
16	QUIT;

Verify that the specified output tables were created.

Copied!

1	PROC CAS;
2	TABLE.tableInfo TABLE={name='fraud_model_info'};
3	TABLE.tableInfo TABLE={name='fraud_nobs'};
4	TABLE.tableInfo TABLE={name='fraud_var_select'};
5	RUN;
6	QUIT;

Expected Result

The action should complete the training within the 30-minute 'maxTrainTime' limit. The use of 'distributeChains' should demonstrate the action's capability to handle distributed computation (in a real grid environment). The log should indicate that the training data was stored in memory. Three CAS tables ('fraud_model_info', 'fraud_nobs', 'fraud_var_select') should be created containing the results.

Voir la documentation technique de bartProbit