Performance Test: Large-Scale Fraud Detection with Distributed Chains
Scénario de test & Cas d'usage
Business Context
A credit card company needs to process a very large dataset of transactions to build a fraud detection model. Due to the data volume and urgency, the model training must be efficient. This scenario tests the action's performance under load and its ability to leverage a distributed computing environment.
Create a large dataset (1.5 million transactions) with features like transaction amount, time of day, and merchant category. The target 'Is_Fraud' is a rare event.
Copied!
data transactions_large;\n call streaminit(555);\n do i = 1 to 1500000;\n Transaction_Amount = 5 + rand('Exponential', 50);\n Time_Of_Day = floor(rand('Uniform') * 24); /* 0-23 hours */\n if rand('Uniform') < 0.3 then Merchant_Category = 'Online';\n else if rand('Uniform') < 0.6 then Merchant_Category = 'Retail';\n else Merchant_Category = 'Services';\n /* Fraud is more likely with high amounts and late at night */\n logit_p = -8 + (Transaction_Amount / 100) + (ifc(Time_Of_Day < 6, 1.5, 0)) + (ifc(Merchant_Category='Online', 0.5, 0));\n p = 1 / (1 + exp(-logit_p));\n Is_Fraud = rand('Bernoulli', p);\n output;\n end;\nrun;\n\ndata casuser.transactions_large;\n set transactions_large;\nrun;
Run bart.bartProbit using performance-oriented parameters: distribute chains across 4 workers, set a max training time, and enable in-memory options.
Copied!
proc cas;\n bart.bartProbit table={name='transactions_large'},\n target='Is_Fraud',\n inputs={'Transaction_Amount', 'Time_Of_Day'},\n nominals={'Merchant_Category'},
nTree=250, /* More trees for complex interactions */\n nBI=500,\n nMC=2500,\n distributeChains=4, /* Simulate running on 4 workers */\n trainInMem=TRUE, /* Use more memory for speed */\n obsLeafMapInMem=TRUE,\n maxTrainTime=1800, /* Set a 30-minute time limit */\n seed=911,\n outputTables={names={ModelInfo='fraud_model_info', NObs='fraud_nobs', VariableSelection='fraud_var_select'}};\nrun;\nquit;
The action should complete the training within the 30-minute 'maxTrainTime' limit. The use of 'distributeChains' should demonstrate the action's capability to handle distributed computation (in a real grid environment). The log should indicate that the training data was stored in memory. Three CAS tables ('fraud_model_info', 'fraud_nobs', 'fraud_var_select') should be created containing the results.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.