bart bartGauss

Performance Case: Large-Scale Customer Value Modeling with Time Constraints

Scénario de test & Cas d'usage

Business Context

A financial services company needs to model the potential future value of a large customer base (millions of records). The modeling process must be completed within a strict 45-minute time window to fit into a nightly batch process. The model will use demographic and behavioral data to predict customer value.
About the Set : bart

Bayesian Additive Regression Trees models.

Discover all actions of bart
Data Preparation

Creates a large dataset 'customer_value_large' (simulated with 50,000 records) with a continuous target 'future_value' and several numeric and categorical predictors. A partitioning variable 'data_role' is included.

Copied!
1DATA customer_value_large;
2 call streaminit(567);
3 DO customer_id = 1 to 50000;
4 age = 20 + floor(rand('UNIFORM') * 50);
5 income = 30000 + rand('UNIFORM') * 150000;
6 months_active = 1 + floor(rand('UNIFORM') * 120);
7 product_count = 1 + floor(rand('UNIFORM') * 5);
8 region = byte(65 + floor(rand('UNIFORM') * 5)); /* A-E */
9 future_value = 500 + (income / 100) + (months_active * 10) * (product_count) - (age * 5) + rand('NORMAL', 0, 200);
10 IF rand('UNIFORM') < 0.7 THEN data_role = 'TRAIN';
11 ELSE data_role = 'TEST';
12 OUTPUT;
13 END;
14RUN;

Étapes de réalisation

1
Load the large customer dataset into the CAS server.
Copied!
1PROC CASUTIL;
2 load DATA=customer_value_large casout='customer_value_large' replace;
3RUN;
4QUIT;
2
Run bartGauss with performance tuning options: binning for numeric inputs, a time limit, and data partitioning using a variable.
Copied!
1PROC CAS;
2 LOADACTIONSET 'bart';
3 bart.bartGauss /
4 TABLE={name='customer_value_large'},
5 target='future_value',
6 inputs={'age', 'income', 'months_active', 'product_count', 'region'},
7 nominals={'region'},
8 partByVar={name='data_role', train='TRAIN', test='TEST'},
9 nTree=100,
10 nBins=50,
11 quantileBin=true,
12 maxTrainTime=2700, /* 45 minutes */
13 seed=2025,
14 outputTables={names={'ModelInfo', 'FitStatistics'}};
15RUN;
16QUIT;

Expected Result


The action must complete the training process in less than 2700 seconds. The 'FitStatistics' table should be generated, showing model performance metrics (like ASE) for both the training and testing partitions. The use of 'nBins' and 'quantileBin' should allow the action to handle the large volume of data efficiently. The log should confirm that the run terminated naturally before the time limit was hit.