bart

bartGauss

L'essentiel
At a glance
When traditional linear regression methods fail to capture the nuances of high-dimensional datasets, the bartGauss action provides a robust alternative within the SAS Viya environment. By leveraging a sum-of-trees ensemble approach, this function allows analysts to fit Bayesian Additive Regression models specifically tailored for continuous outcomes with normal distributions. It excels at automatically detecting interactions and non-linear patterns, significantly reducing the burden of manual feature engineering. To help you integrate this machine learning algorithm into your pipelines, we have curated a list of Frequently Asked Questions addressing configuration details, memory management, and model interpretation.

Description

The bartGauss action fits Bayesian additive regression trees (BART) models for a continuous response variable that is assumed to follow a normal distribution. BART is a non-parametric regression method that uses a sum of regression trees to model the relationship between predictors and a response. It is particularly effective for capturing complex, non-linear relationships and interactions in the data without requiring pre-specification of the model form. The method is Bayesian, meaning it uses priors for the model parameters and provides a full posterior distribution for predictions, allowing for robust uncertainty quantification.

bart.bartGauss result=results status=rc / alpha=double, attributes={{name="variable-name", format="string", formattedLength=integer, label="string", nfd=integer, nfl=integer}, ...}, class={{vars={"variable-name-1", ...}, descending=TRUE | FALSE, order="FORMATTED" | "FREQ" | "FREQFORMATTED" | "FREQINTERNAL" | "INTERNAL", ref="FIRST" | "LAST" | double | "string"}, ...}, distributeChains=integer, freq="variable-name", inputs={{name="variable-name", ...}, ...}, leafSigmaK=double, maxTrainTime=double, minLeafSize=integer, missing="MACBIG" | "MACSMALL" | "NONE" | "SEPARATE", model={depVars={{name="variable-name"}, ...}, effects={{vars={"string-1", ...}}, ...}}, nBI=integer, nBins=integer, nClassLevelsPrint=integer, nMC=integer, nMCDist=integer, nominals={{name="variable-name", ...}, ...}, nThin=integer, nTree=integer, obsLeafMapInMem=TRUE | FALSE, orderSplit=integer, output={alpha=double, avgOnly=TRUE | FALSE, casOut={caslib="string", ...}, copyVars="ALL" | "ALL_MODEL" | "ALL_NUMERIC" | {"variable-name-1", ...}, lcl="string", pred="string", resid="string", role="string", ucl="string"}, outputTables={names={"string-1", ...} | {key-1={casouttable-1}, ...}, ...}, partByFrac={seed=integer, test=double}, partByVar={name="variable-name", test="string", train="string"}, quantileBin=TRUE | FALSE, sampleSummary={casout={caslib="string", ...}, avgNode="string", propAccepted="string", sampSaved="string", variance="string"}, seed=64-bit-integer, sigmaDF=double, sigmaLambda=double, sigmaQuantile=double, store={caslib="string", name="table-name", ...}, table={name="table-name", ...}, target="variable-name", trainInMem=TRUE | FALSE, treePrior={depthBase=double, depthPower=double, pPrune=double, pSplit=double}, varAutoCorr={integer-1, ...}, varEst=double;
Settings
ParameterDescription
alpha Specifies the significance level for constructing equal-tail credible limits for predictive margins.
attributes Changes the attributes of variables used in the action.
class Names the classification variables to use as explanatory variables in the analysis.
distributeChains Specifies a distributed mode that divides the MCMC sampling in a grid environment. When you specify a value of 0, a single chain is run, and each worker node is assigned a portion of the training data.
freq Names the numeric variable that contains the frequency of occurrence for each observation.
inputs Specifies the input variables to use in the analysis.
leafSigmaK Specifies the value used to determine the prior variance for the leaf parameter.
maxTrainTime Specifies an upper limit (in seconds) on the time for MCMC sampling.
minLeafSize Specifies the minimum number of observations that each child of a split must contain in the training data for the split to be considered.
missing Specifies how to handle missing values in predictor variables. 'SEPARATE' is often a good default as it treats missingness as a potentially informative category.
model Defines the model structure, including the dependent variable (target) and the explanatory variables (effects).
nBI Specifies the number of burn-in iterations to perform before the action starts to save samples for prediction. These initial samples are discarded to allow the Markov chain to reach its stationary distribution.
nBins Specifies the number of bins to use for discretizing continuous input variables, which can improve performance.
nClassLevelsPrint Limits the display of class levels in the output tables. A value of 0 suppresses all levels.
nMC Specifies the number of MCMC iterations to perform after the burn-in phase. This is the main sample size for posterior inference.
nMCDist Specifies the number of MCMC iterations for each chain when running in distributed mode.
nominals Specifies the nominal (categorical) input variables to use in the analysis.
nThin Specifies the thinning rate of the simulation, which saves one sample every 'nThin' iterations to reduce autocorrelation in the saved chain.
nTree Specifies the number of trees in the sum-of-trees ensemble. A larger number of trees can capture more complex patterns but increases computation time.
obsLeafMapInMem When set to True, stores a mapping of each observation to terminal nodes in memory, which can speed up certain post-processing tasks.
orderSplit Specifies the minimum cardinality for which a categorical input uses splitting rules according to level ordering.
output Creates an output table containing observation-wise statistics, such as predicted values and residuals.
outputTables Lists the names of results tables (e.g., ModelInfo, VarImp) to save as CAS tables on the server.
partByFrac Specifies the fraction of the data to be used for testing, allowing for random partitioning.
partByVar Names a variable in the input table whose values are used to partition the data into training and testing roles.
quantileBin When set to True, bin boundaries are set at quantiles of numeric inputs, which can handle skewed distributions better than equal-width bins.
sampleSummary Creates an output table that contains a summary of the sum-of-trees ensemble samples.
seed Specifies a seed for the pseudorandom number generator to ensure reproducibility of the analysis.
sigmaDF Specifies the degrees of freedom of the scaled inverse chi-square prior for the error variance parameter.
sigmaLambda Specifies the scale parameter of the scaled inverse chi-square prior for the error variance parameter.
sigmaQuantile Specifies the quantile level to determine the scale parameter of the inverse chi-square prior for the error variance.
store Saves the fitted model to a binary object in a CAS table, which can be used later for scoring new data with the bart.bartScore action.
table Specifies the input data table for the analysis.
target Specifies the target (dependent or response) variable for the model.
trainInMem When set to True, stores the training data in memory to potentially speed up the training process.
treePrior Specifies the parameters of the regularization prior for the tree structure, controlling its complexity.
varAutoCorr Specifies the autocorrelation lags to compute for the variance parameter in the MCMC chain diagnostics.
varEst Specifies an initial value for the error variance. If not specified, it's estimated from an initial linear regression.
Data Preparation View data prep sheet
Data Creation

This example creates a sample dataset named 'sample_data'. The target variable 'y' is generated from a combination of linear and non-linear functions of the predictors 'x1', 'x2', 'x3', and 'c1', with added Gaussian noise. This makes it suitable for the bartGauss action, which models a normally distributed response.

Copied!
1DATA sample_data;
2 call streaminit(123);
3 DO i = 1 to 1000;
4 x1 = rand('UNIFORM');
5 x2 = rand('UNIFORM') * 2 - 1;
6 x3 = rand('NORMAL');
7 IF rand('UNIFORM') < 0.5 THEN c1 = 'A';
8 ELSE c1 = 'B';
9 y = 10 * sin(3.14 * x1) + 20 * (x2 - 0.5)**2 + 10 * x3 + (ifc(c1='A', 5, 0)) + rand('NORMAL');
10 OUTPUT;
11 END;
12RUN;

Examples

This example demonstrates a basic call to the bartGauss action. It loads the 'sample_data' table into a CAS session, then fits a BART model with 'y' as the target and 'x1', 'x2', 'x3', and 'c1' as predictors. This is the simplest way to run the action, relying on default settings for the number of trees, MCMC iterations, and other hyperparameters.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 LOADACTIONSET 'bart';
3 bart.bartGauss /
4 TABLE='sample_data',
5 target='y',
6 inputs={'x1', 'x2', 'x3', 'c1'};
7RUN;

This example shows a more advanced usage of the bartGauss action. It specifies the number of trees (nTree=100), burn-in iterations (nBI=500), and MCMC samples (nMC=2000). It also partitions the data, using 25% for testing (partByFrac). The fitted model is saved to a CAS table named 'bart_model_store' for later use. Additionally, an output table 'bart_predictions' is created to store predicted values and residuals for each observation.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 LOADACTIONSET 'bart';
3 bart.bartGauss /
4 TABLE={name='sample_data'},
5 target='y',
6 inputs={'x1', 'x2', 'x3'},
7 nominals={'c1'},
8 nTree=100,
9 nBI=500,
10 nMC=2000,
11 seed=456,
12 partByFrac={test=0.25, seed=123},
13 store={name='bart_model_store', replace=true},
14 OUTPUT={casOut={name='bart_predictions', replace=true}, pred='predicted_y', resid='residual_y'},
15 display={'FitStatistics', 'VarImp'};
16RUN;

FAQ

What is the purpose of the bart.bartGauss action?
Which parameter is used to specify the input data table?
How can I define the model's dependent and independent variables?
What does the 'nTree' parameter control?
How are the MCMC iterations managed in this action?
How can the trained model be saved for future scoring?
How does the action handle missing values in predictor variables?

Associated Scenarios

Use Case
Standard Case: Predicting Remaining Useful Life of Industrial Equipment

An industrial manufacturing company wants to predict the Remaining Useful Life (RUL) in hours for a critical machine component. The goal is to move from a fixed-schedule mainten...

Use Case
Performance Case: Large-Scale Customer Value Modeling with Time Constraints

A financial services company needs to model the potential future value of a large customer base (millions of records). The modeling process must be completed within a strict 45-...

Use Case
Edge Case: Modeling Patient Response with Incomplete Biometric Data

A clinical research organization is analyzing patient trial data. Some biometric sensor readings, which are important predictors for treatment effectiveness, are missing due to ...