bart

bartProbit

Description

The bartProbit action fits a probit Bayesian Additive Regression Trees (BART) model to data where the response variable is binary. This is particularly useful for classification problems where the outcome is one of two categories (e.g., yes/no, success/failure, 0/1). The probit model assumes that the binary outcome is the result of an unobserved continuous latent variable following a standard normal distribution. The BART model itself is a non-parametric, ensemble method that combines multiple simple regression trees to create a powerful predictive model, offering a flexible alternative to traditional parametric models.

bart.bartProbit { alpha=double, applyRowOrder=TRUE | FALSE, attributes={{...}, ...}, class={{...}, ...}, differences={{...}, ...}, display={...}, distributeChains=integer, freq="variable-name", inputs={{...}, ...}, leafSigmaK=double, margins={{...}, ...}, maxTrainTime=double, minLeafSize=integer, missing="MACBIG" | "MACSMALL" | "NONE" | "SEPARATE", model={...}, nBI=integer, nBins=integer, nClassLevelsPrint=integer, nMC=integer, nominals={{...}, ...}, nThin=integer, nTree=integer, obsLeafMapInMem=TRUE | FALSE, offset="variable-name", orderSplit=integer, output={...}, outputMargins={...}, outputTables={...}, partByFrac={...}, partByVar={...}, quantileBin=TRUE | FALSE, sampleSummary={...}, seed=64-bit-integer, store={...}, table={...}, target="variable-name", trainInMem=TRUE | FALSE, treePrior={...} };
Settings
ParameterDescription
alphaSpecifies the significance level for constructing equal-tail credible limits for predictive margins. Default: 0.05.
applyRowOrderWhen set to TRUE, applies a specific row order for processing. Default: FALSE.
attributesChanges the attributes of variables used in the action. Alias: attribute.
classSpecifies the classification variables to be used as explanatory variables. Alias: classVars.
differencesSpecifies differences of predictive margins. Alias: diffs.
displaySpecifies a list of results tables to be displayed on the client.
distributeChainsSpecifies a distributed mode that divides the MCMC sampling in a grid environment. A value of 0 runs a single chain with data partitioned across workers. A value greater than 0 replicates the data to that many workers, each running a separate chain. Minimum: 0.
freqNames the numeric variable that contains the frequency of occurrence for each observation.
inputsSpecifies the input variables to use in the analysis. Alias: input.
leafSigmaKSpecifies the value used to determine the prior variance for the leaf parameter. Default: 2. Must be greater than 0.
marginsSpecifies a predictive margin for scenario analysis. Alias: scenarios.
maxTrainTimeSpecifies an upper limit in seconds for MCMC sampling. Alias: maxTime. Must be greater than 0.
minLeafSizeSpecifies the minimum number of observations that each child of a split must contain. Default: 5. Minimum: 1. Alias: leafSize.
missingSpecifies how to handle missing values in predictor variables. Options include MACBIG, MACSMALL, NONE, and SEPARATE. Default: SEPARATE.
modelSpecifies the model by naming the dependent variable and explanatory effects.
nBISpecifies the number of burn-in iterations before saving samples. Default: 100. Minimum: 1. Alias: burnin.
nBinsSpecifies the number of bins for continuous input variables. Default: 50. Minimum: 2.
nClassLevelsPrintLimits the display of class variable levels. A value of 0 suppresses all levels. Minimum: 0.
nMCSpecifies the number of MCMC iterations, excluding burn-in. This is the MCMC sample size if thinning is 1. Default: 1000. Minimum: 1.
nominalsSpecifies the nominal input variables to use in the analysis. Alias: nominal.
nThinSpecifies the thinning rate of the simulation. Default: 1. Minimum: 1. Alias: thin.
nTreeSpecifies the number of trees in the sum-of-trees ensemble. Default: 200. Minimum: 1.
obsLeafMapInMemWhen set to TRUE, stores a mapping of each observation to terminal nodes in memory. Default: FALSE.
offsetSpecifies a numeric offset variable.
orderSplitSpecifies the minimum cardinality for which a categorical input uses splitting rules based on level ordering. Default: 50. Must be greater than 0.
outputCreates an output table containing observation-wise statistics computed after model fitting.
outputMarginsSpecifies an output table for predictive margins.
outputTablesLists the names of results tables to save as CAS tables. Alias: displayOut.
partByFracPartitions the data by specifying the fraction to be used for testing.
partByVarPartitions the data into training and testing roles based on a variable's values.
quantileBinWhen set to TRUE, bin boundaries are set at quantiles of numeric inputs. Default: TRUE. Aliases: qbin, qtbin.
sampleSummaryCreates an output table containing a summary of the sum-of-trees ensemble samples.
seedSpecifies the seed for the pseudorandom number generator. Default: 0.
storeStores the fitted model in a binary table for later scoring. Aliases: savemodel, save, savestate.
tableSpecifies the input data table for the analysis.
targetSpecifies the target (dependent) variable for the model.
trainInMemWhen set to TRUE, stores training data in memory. Default: FALSE.
treePriorSpecifies the regularization prior for the sum-of-trees ensemble.
Data Preparation View data prep sheet
Data Creation

This example creates a sample dataset named 'purchase_data'. The dataset simulates customer purchasing behavior, where 'Purchased' is a binary outcome (1 for purchase, 0 for no purchase). Predictors include 'Age', 'Income', and 'Gender'. This data is then loaded into a CAS table for modeling.

Copied!
1DATA purchase_data;
2 call streaminit(123);
3 DO i = 1 to 1000;
4 Age = 20 + int(rand('Uniform') * 50);
5 Income = 30000 + int(rand('Uniform') * 70000);
6 IF rand('Uniform') > 0.5 THEN Gender = 'Male';
7 ELSE Gender = 'Female';
8 logit_p = -4 + 0.05*Age + 0.00002*Income + (Gender='Female');
9 p = 1 / (1 + exp(-logit_p));
10 Purchased = rand('Bernoulli', p);
11 OUTPUT;
12 END;
13RUN;
14 
15DATA casuser.purchase_data;
16 SET purchase_data;
17RUN;

Examples

This example fits a basic probit BART model using the 'bart.bartProbit' action. It models the binary 'Purchased' variable as a function of 'Age', 'Income', and 'Gender'. The model uses default settings, including 200 trees and 1000 MCMC iterations after a 100-iteration burn-in.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 bart.bartProbit TABLE={name='purchase_data'},
3 model={depVars={{name='Purchased'}},
4 effects={{vars={'Age', 'Income', 'Gender'}}}
5 };
6RUN;
7QUIT;
Result :
The action will produce several output tables, including 'ModelInfo' with details about the model, 'NObs' showing the number of observations used, and 'VariableSelection' indicating the importance of each predictor. Since this is a Bayesian procedure, results will vary slightly with each run.

This example demonstrates a more complex use case. It partitions the data, using 70% for training and 30% for testing. The model is configured with 100 trees ('nTree'), 500 burn-in iterations ('nBI'), and 2000 sampling iterations ('nMC'). The resulting model is saved to a CAS table named 'bart_probit_model' for future scoring, and observation-wise predictions for the input data are saved to 'bart_probit_output'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 bart.bartProbit TABLE={name='purchase_data'},
3 target='Purchased',
4 inputs={'Age', 'Income'},
5 nominals={'Gender'},
6 partByFrac={test=0.3, seed=456},
7 nTree=100,
8 nBI=500,
9 nMC=2000,
10 seed=54321,
11 OUTPUT={casOut={name='bart_probit_output', replace=true}, pred='PredictedProb', resid='Residual'},
12 store={name='bart_probit_model', replace=true};
13RUN;
14QUIT;
Result :
This run will produce standard output tables like 'ModelInfo' and 'VariableSelection'. Additionally, it will create two new CAS tables in the active caslib: 'bart_probit_model', which contains the binary representation of the fitted model, and 'bart_probit_output', which includes the original data plus new columns for predicted probabilities ('PredictedProb') and residuals ('Residual'). The 'FitStatistics' table will show metrics for both the training and testing partitions.

FAQ

What is the primary function of the bart.bartProbit action?
How can I specify the number of trees in the model?
What are the burn-in and sampling iterations and how are they controlled?
How does the action handle missing values in predictor variables?
Can I save the trained model for later use?

Associated Scenarios

Use Case
Standard Case: Customer Response Prediction for a Marketing Campaign

A financial services company wants to predict which customers are likely to respond to a new loan offer. The goal is to build a reliable classification model using customer demo...

Use Case
Performance Test: Large-Scale Fraud Detection with Distributed Chains

A credit card company needs to process a very large dataset of transactions to build a fraud detection model. Due to the data volume and urgency, the model training must be effi...

Use Case
Edge Case: Handling Missing Data in Clinical Trial Analysis

A pharmaceutical research company is analyzing patient data from a clinical trial to predict patient recovery. The dataset is incomplete, with missing values for a key biomarker...