bartProbit - WeAreCAS

Q: What is the primary function of the bart.bartProbit action?

The bart.bartProbit action is designed to fit probit Bayesian additive regression trees (BART) models specifically for binary distributed response data.

Q: How can I specify the number of trees in the model?

You can use the 'nTree' parameter to specify the number of trees in the sum-of-trees ensemble. The default value is 200.

Q: What are the burn-in and sampling iterations and how are they controlled?

The process involves a burn-in phase followed by the main sampling phase. The 'nBI' parameter sets the number of burn-in iterations (default 100), which are discarded. The 'nMC' parameter specifies the number of subsequent MCMC iterations that are saved for prediction (default 1000).

Q: How does the action handle missing values in predictor variables?

The 'missing' parameter controls this. The default is 'SEPARATE', which treats missing values as a distinct group. Other options include 'NONE' (excludes observations with missing values), 'MACBIG' (treats missing as the largest machine value), and 'MACSMALL' (treats missing as the smallest machine value).

Q: Can I save the trained model for later use?

Yes, you can use the 'store' parameter to save the model into a binary table object. This stored model can then be used for scoring new data with the bartScore action.

At a glance

For predictive scenarios requiring the categorization of outcomes into two distinct classes, simple linear classifiers often struggle with complex data structures. The bartProbit action extends the Bayesian Additive Regression Trees methodology to binary response data through a Probit link formulation. This capability is essential for analysts aiming to capture non-linear decision boundaries with high fidelity in fraud detection or customer churn models. This page serves as a technical knowledge base, addressing frequent inquiries regarding model specification, hyperparameter optimization, and convergence diagnostics within the CAS environment.

Description

The bartProbit action fits a probit Bayesian Additive Regression Trees (BART) model to data where the response variable is binary. This is particularly useful for classification problems where the outcome is one of two categories (e.g., yes/no, success/failure, 0/1). The probit model assumes that the binary outcome is the result of an unobserved continuous latent variable following a standard normal distribution. The BART model itself is a non-parametric, ensemble method that combines multiple simple regression trees to create a powerful predictive model, offering a flexible alternative to traditional parametric models.

bart.bartProbit { alpha=double, applyRowOrder=TRUE | FALSE, attributes={{...}, ...}, class={{...}, ...}, differences={{...}, ...}, display={...}, distributeChains=integer, freq="variable-name", inputs={{...}, ...}, leafSigmaK=double, margins={{...}, ...}, maxTrainTime=double, minLeafSize=integer, missing="MACBIG" | "MACSMALL" | "NONE" | "SEPARATE", model={...}, nBI=integer, nBins=integer, nClassLevelsPrint=integer, nMC=integer, nominals={{...}, ...}, nThin=integer, nTree=integer, obsLeafMapInMem=TRUE | FALSE, offset="variable-name", orderSplit=integer, output={...}, outputMargins={...}, outputTables={...}, partByFrac={...}, partByVar={...}, quantileBin=TRUE | FALSE, sampleSummary={...}, seed=64-bit-integer, store={...}, table={...}, target="variable-name", trainInMem=TRUE | FALSE, treePrior={...} };

Settings

Parameter	Description
alpha	Specifies the significance level for constructing equal-tail credible limits for predictive margins. Default: 0.05.
applyRowOrder	When set to TRUE, applies a specific row order for processing. Default: FALSE.
attributes	Changes the attributes of variables used in the action. Alias: attribute.
class	Specifies the classification variables to be used as explanatory variables. Alias: classVars.
differences	Specifies differences of predictive margins. Alias: diffs.
display	Specifies a list of results tables to be displayed on the client.
distributeChains	Specifies a distributed mode that divides the MCMC sampling in a grid environment. A value of 0 runs a single chain with data partitioned across workers. A value greater than 0 replicates the data to that many workers, each running a separate chain. Minimum: 0.
freq	Names the numeric variable that contains the frequency of occurrence for each observation.
inputs	Specifies the input variables to use in the analysis. Alias: input.
leafSigmaK	Specifies the value used to determine the prior variance for the leaf parameter. Default: 2. Must be greater than 0.
margins	Specifies a predictive margin for scenario analysis. Alias: scenarios.
maxTrainTime	Specifies an upper limit in seconds for MCMC sampling. Alias: maxTime. Must be greater than 0.
minLeafSize	Specifies the minimum number of observations that each child of a split must contain. Default: 5. Minimum: 1. Alias: leafSize.
missing	Specifies how to handle missing values in predictor variables. Options include MACBIG, MACSMALL, NONE, and SEPARATE. Default: SEPARATE.
model	Specifies the model by naming the dependent variable and explanatory effects.
nBI	Specifies the number of burn-in iterations before saving samples. Default: 100. Minimum: 1. Alias: burnin.
nBins	Specifies the number of bins for continuous input variables. Default: 50. Minimum: 2.
nClassLevelsPrint	Limits the display of class variable levels. A value of 0 suppresses all levels. Minimum: 0.
nMC	Specifies the number of MCMC iterations, excluding burn-in. This is the MCMC sample size if thinning is 1. Default: 1000. Minimum: 1.
nominals	Specifies the nominal input variables to use in the analysis. Alias: nominal.
nThin	Specifies the thinning rate of the simulation. Default: 1. Minimum: 1. Alias: thin.
nTree	Specifies the number of trees in the sum-of-trees ensemble. Default: 200. Minimum: 1.
obsLeafMapInMem	When set to TRUE, stores a mapping of each observation to terminal nodes in memory. Default: FALSE.
offset	Specifies a numeric offset variable.
orderSplit	Specifies the minimum cardinality for which a categorical input uses splitting rules based on level ordering. Default: 50. Must be greater than 0.
output	Creates an output table containing observation-wise statistics computed after model fitting.
outputMargins	Specifies an output table for predictive margins.
outputTables	Lists the names of results tables to save as CAS tables. Alias: displayOut.
partByFrac	Partitions the data by specifying the fraction to be used for testing.
partByVar	Partitions the data into training and testing roles based on a variable's values.
quantileBin	When set to TRUE, bin boundaries are set at quantiles of numeric inputs. Default: TRUE. Aliases: qbin, qtbin.
sampleSummary	Creates an output table containing a summary of the sum-of-trees ensemble samples.
seed	Specifies the seed for the pseudorandom number generator. Default: 0.
store	Stores the fitted model in a binary table for later scoring. Aliases: savemodel, save, savestate.
table	Specifies the input data table for the analysis.
target	Specifies the target (dependent) variable for the model.
trainInMem	When set to TRUE, stores training data in memory. Default: FALSE.
treePrior	Specifies the regularization prior for the sum-of-trees ensemble.

Data Preparation View data prep sheet

Data Creation

This example creates a sample dataset named 'purchase_data'. The dataset simulates customer purchasing behavior, where 'Purchased' is a binary outcome (1 for purchase, 0 for no purchase). Predictors include 'Age', 'Income', and 'Gender'. This data is then loaded into a CAS table for modeling.

Copied!

1	DATA purchase_data;
2	call streaminit(123);
3	DO i = 1 to 1000;
4	Age = 20 + int(rand('Uniform') * 50);
5	Income = 30000 + int(rand('Uniform') * 70000);
6	IF rand('Uniform') > 0.5 THEN Gender = 'Male';
7	ELSE Gender = 'Female';
8	logit_p = -4 + 0.05Age + 0.00002Income + (Gender='Female');
9	p = 1 / (1 + exp(-logit_p));
10	Purchased = rand('Bernoulli', p);
11	OUTPUT;
12	END;
13	RUN;
14
15	DATA casuser.purchase_data;
16	SET purchase_data;
17	RUN;

Examples

This example fits a basic probit BART model using the 'bart.bartProbit' action. It models the binary 'Purchased' variable as a function of 'Age', 'Income', and 'Gender'. The model uses default settings, including 200 trees and 1000 MCMC iterations after a 100-iteration burn-in.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	bart.bartProbit TABLE={name='purchase_data'},
3	model={depVars={{name='Purchased'}},
4	effects={{vars={'Age', 'Income', 'Gender'}}}
5	};
6	RUN;
7	QUIT;

Result :
The action will produce several output tables, including 'ModelInfo' with details about the model, 'NObs' showing the number of observations used, and 'VariableSelection' indicating the importance of each predictor. Since this is a Bayesian procedure, results will vary slightly with each run.

This example demonstrates a more complex use case. It partitions the data, using 70% for training and 30% for testing. The model is configured with 100 trees ('nTree'), 500 burn-in iterations ('nBI'), and 2000 sampling iterations ('nMC'). The resulting model is saved to a CAS table named 'bart_probit_model' for future scoring, and observation-wise predictions for the input data are saved to 'bart_probit_output'.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	bart.bartProbit TABLE={name='purchase_data'},
3	target='Purchased',
4	inputs={'Age', 'Income'},
5	nominals={'Gender'},
6	partByFrac={test=0.3, seed=456},
7	nTree=100,
8	nBI=500,
9	nMC=2000,
10	seed=54321,
11	OUTPUT={casOut={name='bart_probit_output', replace=true}, pred='PredictedProb', resid='Residual'},
12	store={name='bart_probit_model', replace=true};
13	RUN;
14	QUIT;

Result :
This run will produce standard output tables like 'ModelInfo' and 'VariableSelection'. Additionally, it will create two new CAS tables in the active caslib: 'bart_probit_model', which contains the binary representation of the fitted model, and 'bart_probit_output', which includes the original data plus new columns for predicted probabilities ('PredictedProb') and residuals ('Residual'). The 'FitStatistics' table will show metrics for both the training and testing partitions.

FAQ

What is the primary function of the bart.bartProbit action?

How can I specify the number of trees in the model?

What are the burn-in and sampling iterations and how are they controlled?

How does the action handle missing values in predictor variables?

Can I save the trained model for later use?

Associated Scenarios

Use Case

Standard Case: Customer Response Prediction for a Marketing Campaign

A financial services company wants to predict which customers are likely to respond to a new loan offer. The goal is to build a reliable classification model using customer demo...

View scenario

Use Case

Performance Test: Large-Scale Fraud Detection with Distributed Chains

A credit card company needs to process a very large dataset of transactions to build a fraud detection model. Due to the data volume and urgency, the model training must be effi...

View scenario

Use Case

Edge Case: Handling Missing Data in Clinical Trial Analysis

A pharmaceutical research company is analyzing patient data from a clinical trial to predict patient recovery. The dataset is incomplete, with missing values for a key biomarker...

View scenario

Actions associées

bart

bartGauss

The bartGauss action fits Bayesian additive regression trees (BART) models fo...

bart

bartScoreMargin

The bartScoreMargin action computes predictive margins by using a fitted Baye...

bart

bartScore

The bartScore action scores a data table using a previously fitted Bayesian a...

Table of Contents

At a glance

Description

Data Creation

Examples

Basic Probit BART Model

Detailed Probit BART Model with Data Partitioning and Output Storage

FAQ

Associated Scenarios

Use Case

Standard Case: Customer Response Prediction for a Marketing Campaign

Use Case

Performance Test: Large-Scale Fraud Detection with Distributed Chains

Use Case

Edge Case: Handling Missing Data in Clinical Trial Analysis

Actions associées

bartGauss

bartScoreMargin

bartScore