Performance Test: Large-Scale Fraud Detection with Distributed Chains
Scénario de test & Cas d'usage
Business Context
A credit card company needs to process a very large dataset of transactions to build a fraud detection model. Due to the data volume and urgency, the model training must be efficient. This scenario tests the action's performance under load and its ability to leverage a distributed computing environment.
Create a large dataset (1.5 million transactions) with features like transaction amount, time of day, and merchant category. The target 'Is_Fraud' is a rare event.
Copied!
data transactions_large;\n call streaminit(555);\n do i = 1 to 1500000;\n Transaction_Amount = 5 + rand('Exponential', 50);\n Time_Of_Day = floor(rand('Uniform') * 24); /* 0-23 hours */\n if rand('Uniform') < 0.3 then Merchant_Category = 'Online';\n else if rand('Uniform') < 0.6 then Merchant_Category = 'Retail';\n else Merchant_Category = 'Services';\n /* Fraud is more likely with high amounts and late at night */\n logit_p = -8 + (Transaction_Amount / 100) + (ifc(Time_Of_Day < 6, 1.5, 0)) + (ifc(Merchant_Category='Online', 0.5, 0));\n p = 1 / (1 + exp(-logit_p));\n Is_Fraud = rand('Bernoulli', p);\n output;\n end;\nrun;\n\ndata casuser.transactions_large;\n set transactions_large;\nrun;
Run bart.bartProbit using performance-oriented parameters: distribute chains across 4 workers, set a max training time, and enable in-memory options.
Copied!
proc cas;\n bart.bartProbit table={name='transactions_large'},\n target='Is_Fraud',\n inputs={'Transaction_Amount', 'Time_Of_Day'},\n nominals={'Merchant_Category'},
nTree=250, /* More trees for complex interactions */\n nBI=500,\n nMC=2500,\n distributeChains=4, /* Simulate running on 4 workers */\n trainInMem=TRUE, /* Use more memory for speed */\n obsLeafMapInMem=TRUE,\n maxTrainTime=1800, /* Set a 30-minute time limit */\n seed=911,\n outputTables={names={ModelInfo='fraud_model_info', NObs='fraud_nobs', VariableSelection='fraud_var_select'}};\nrun;\nquit;
The action should complete the training within the 30-minute 'maxTrainTime' limit. The use of 'distributeChains' should demonstrate the action's capability to handle distributed computation (in a real grid environment). The log should indicate that the training data was stored in memory. Three CAS tables ('fraud_model_info', 'fraud_nobs', 'fraud_var_select') should be created containing the results.