The bartProbit action fits a probit Bayesian Additive Regression Trees (BART) model to data where the response variable is binary. This is particularly useful for classification problems where the outcome is one of two categories (e.g., yes/no, success/failure, 0/1). The probit model assumes that the binary outcome is the result of an unobserved continuous latent variable following a standard normal distribution. The BART model itself is a non-parametric, ensemble method that combines multiple simple regression trees to create a powerful predictive model, offering a flexible alternative to traditional parametric models.
| Parameter | Description |
|---|---|
| alpha | Specifies the significance level for constructing equal-tail credible limits for predictive margins. Default: 0.05. |
| applyRowOrder | When set to TRUE, applies a specific row order for processing. Default: FALSE. |
| attributes | Changes the attributes of variables used in the action. Alias: attribute. |
| class | Specifies the classification variables to be used as explanatory variables. Alias: classVars. |
| differences | Specifies differences of predictive margins. Alias: diffs. |
| display | Specifies a list of results tables to be displayed on the client. |
| distributeChains | Specifies a distributed mode that divides the MCMC sampling in a grid environment. A value of 0 runs a single chain with data partitioned across workers. A value greater than 0 replicates the data to that many workers, each running a separate chain. Minimum: 0. |
| freq | Names the numeric variable that contains the frequency of occurrence for each observation. |
| inputs | Specifies the input variables to use in the analysis. Alias: input. |
| leafSigmaK | Specifies the value used to determine the prior variance for the leaf parameter. Default: 2. Must be greater than 0. |
| margins | Specifies a predictive margin for scenario analysis. Alias: scenarios. |
| maxTrainTime | Specifies an upper limit in seconds for MCMC sampling. Alias: maxTime. Must be greater than 0. |
| minLeafSize | Specifies the minimum number of observations that each child of a split must contain. Default: 5. Minimum: 1. Alias: leafSize. |
| missing | Specifies how to handle missing values in predictor variables. Options include MACBIG, MACSMALL, NONE, and SEPARATE. Default: SEPARATE. |
| model | Specifies the model by naming the dependent variable and explanatory effects. |
| nBI | Specifies the number of burn-in iterations before saving samples. Default: 100. Minimum: 1. Alias: burnin. |
| nBins | Specifies the number of bins for continuous input variables. Default: 50. Minimum: 2. |
| nClassLevelsPrint | Limits the display of class variable levels. A value of 0 suppresses all levels. Minimum: 0. |
| nMC | Specifies the number of MCMC iterations, excluding burn-in. This is the MCMC sample size if thinning is 1. Default: 1000. Minimum: 1. |
| nominals | Specifies the nominal input variables to use in the analysis. Alias: nominal. |
| nThin | Specifies the thinning rate of the simulation. Default: 1. Minimum: 1. Alias: thin. |
| nTree | Specifies the number of trees in the sum-of-trees ensemble. Default: 200. Minimum: 1. |
| obsLeafMapInMem | When set to TRUE, stores a mapping of each observation to terminal nodes in memory. Default: FALSE. |
| offset | Specifies a numeric offset variable. |
| orderSplit | Specifies the minimum cardinality for which a categorical input uses splitting rules based on level ordering. Default: 50. Must be greater than 0. |
| output | Creates an output table containing observation-wise statistics computed after model fitting. |
| outputMargins | Specifies an output table for predictive margins. |
| outputTables | Lists the names of results tables to save as CAS tables. Alias: displayOut. |
| partByFrac | Partitions the data by specifying the fraction to be used for testing. |
| partByVar | Partitions the data into training and testing roles based on a variable's values. |
| quantileBin | When set to TRUE, bin boundaries are set at quantiles of numeric inputs. Default: TRUE. Aliases: qbin, qtbin. |
| sampleSummary | Creates an output table containing a summary of the sum-of-trees ensemble samples. |
| seed | Specifies the seed for the pseudorandom number generator. Default: 0. |
| store | Stores the fitted model in a binary table for later scoring. Aliases: savemodel, save, savestate. |
| table | Specifies the input data table for the analysis. |
| target | Specifies the target (dependent) variable for the model. |
| trainInMem | When set to TRUE, stores training data in memory. Default: FALSE. |
| treePrior | Specifies the regularization prior for the sum-of-trees ensemble. |
This example creates a sample dataset named 'purchase_data'. The dataset simulates customer purchasing behavior, where 'Purchased' is a binary outcome (1 for purchase, 0 for no purchase). Predictors include 'Age', 'Income', and 'Gender'. This data is then loaded into a CAS table for modeling.
| 1 | DATA purchase_data; |
| 2 | call streaminit(123); |
| 3 | DO i = 1 to 1000; |
| 4 | Age = 20 + int(rand('Uniform') * 50); |
| 5 | Income = 30000 + int(rand('Uniform') * 70000); |
| 6 | IF rand('Uniform') > 0.5 THEN Gender = 'Male'; |
| 7 | ELSE Gender = 'Female'; |
| 8 | logit_p = -4 + 0.05*Age + 0.00002*Income + (Gender='Female'); |
| 9 | p = 1 / (1 + exp(-logit_p)); |
| 10 | Purchased = rand('Bernoulli', p); |
| 11 | OUTPUT; |
| 12 | END; |
| 13 | RUN; |
| 14 | |
| 15 | DATA casuser.purchase_data; |
| 16 | SET purchase_data; |
| 17 | RUN; |
This example fits a basic probit BART model using the 'bart.bartProbit' action. It models the binary 'Purchased' variable as a function of 'Age', 'Income', and 'Gender'. The model uses default settings, including 200 trees and 1000 MCMC iterations after a 100-iteration burn-in.
| 1 | PROC CAS; |
| 2 | bart.bartProbit TABLE={name='purchase_data'}, |
| 3 | model={depVars={{name='Purchased'}}, |
| 4 | effects={{vars={'Age', 'Income', 'Gender'}}} |
| 5 | }; |
| 6 | RUN; |
| 7 | QUIT; |
This example demonstrates a more complex use case. It partitions the data, using 70% for training and 30% for testing. The model is configured with 100 trees ('nTree'), 500 burn-in iterations ('nBI'), and 2000 sampling iterations ('nMC'). The resulting model is saved to a CAS table named 'bart_probit_model' for future scoring, and observation-wise predictions for the input data are saved to 'bart_probit_output'.
| 1 | PROC CAS; |
| 2 | bart.bartProbit TABLE={name='purchase_data'}, |
| 3 | target='Purchased', |
| 4 | inputs={'Age', 'Income'}, |
| 5 | nominals={'Gender'}, |
| 6 | partByFrac={test=0.3, seed=456}, |
| 7 | nTree=100, |
| 8 | nBI=500, |
| 9 | nMC=2000, |
| 10 | seed=54321, |
| 11 | OUTPUT={casOut={name='bart_probit_output', replace=true}, pred='PredictedProb', resid='Residual'}, |
| 12 | store={name='bart_probit_model', replace=true}; |
| 13 | RUN; |
| 14 | QUIT; |
A financial services company wants to predict which customers are likely to respond to a new loan offer. The goal is to build a reliable classification model using customer demo...
A credit card company needs to process a very large dataset of transactions to build a fraud detection model. Due to the data volume and urgency, the model training must be effi...
A pharmaceutical research company is analyzing patient data from a clinical trial to predict patient recovery. The dataset is incomplete, with missing values for a key biomarker...