bartGauss - WeAreCAS

Q: What is the purpose of the bart.bartGauss action?

The bart.bartGauss action fits Bayesian additive regression trees (BART) models to normally distributed response data.

Q: Which parameter is used to specify the input data table?

The 'table' parameter is required to specify the input data table for the analysis.

Q: How can I define the model's dependent and independent variables?

The 'model' parameter is used for this. It requires a 'depVars' subparameter for the target variable and an 'effects' subparameter to list the explanatory variables.

Q: What does the 'nTree' parameter control?

The 'nTree' parameter specifies the number of trees in the sum-of-trees ensemble. Its default value is 200.

Q: How are the MCMC iterations managed in this action?

The 'nBI' parameter sets the number of burn-in iterations (default 100) which are discarded. The 'nMC' parameter sets the number of subsequent MCMC iterations that are saved for prediction (default 1000).

Q: How can the trained model be saved for future scoring?

You can use the 'store' parameter to save the model in a binary table object. This object can then be used for scoring new data with the bartScore action.

Q: How does the action handle missing values in predictor variables?

The 'missing' parameter controls this behavior. The default is 'SEPARATE', which treats missing values as a distinct category. Other options include 'NONE' (exclude observations), 'MACBIG', and 'MACSMALL' (treat as largest/smallest machine value).

At a glance

When traditional linear regression methods fail to capture the nuances of high-dimensional datasets, the bartGauss action provides a robust alternative within the SAS Viya environment. By leveraging a sum-of-trees ensemble approach, this function allows analysts to fit Bayesian Additive Regression models specifically tailored for continuous outcomes with normal distributions. It excels at automatically detecting interactions and non-linear patterns, significantly reducing the burden of manual feature engineering. To help you integrate this machine learning algorithm into your pipelines, we have curated a list of Frequently Asked Questions addressing configuration details, memory management, and model interpretation.

Description

The bartGauss action fits Bayesian additive regression trees (BART) models for a continuous response variable that is assumed to follow a normal distribution. BART is a non-parametric regression method that uses a sum of regression trees to model the relationship between predictors and a response. It is particularly effective for capturing complex, non-linear relationships and interactions in the data without requiring pre-specification of the model form. The method is Bayesian, meaning it uses priors for the model parameters and provides a full posterior distribution for predictions, allowing for robust uncertainty quantification.

bart.bartGauss result=results status=rc / alpha=double, attributes={{name="variable-name", format="string", formattedLength=integer, label="string", nfd=integer, nfl=integer}, ...}, class={{vars={"variable-name-1", ...}, descending=TRUE | FALSE, order="FORMATTED" | "FREQ" | "FREQFORMATTED" | "FREQINTERNAL" | "INTERNAL", ref="FIRST" | "LAST" | double | "string"}, ...}, distributeChains=integer, freq="variable-name", inputs={{name="variable-name", ...}, ...}, leafSigmaK=double, maxTrainTime=double, minLeafSize=integer, missing="MACBIG" | "MACSMALL" | "NONE" | "SEPARATE", model={depVars={{name="variable-name"}, ...}, effects={{vars={"string-1", ...}}, ...}}, nBI=integer, nBins=integer, nClassLevelsPrint=integer, nMC=integer, nMCDist=integer, nominals={{name="variable-name", ...}, ...}, nThin=integer, nTree=integer, obsLeafMapInMem=TRUE | FALSE, orderSplit=integer, output={alpha=double, avgOnly=TRUE | FALSE, casOut={caslib="string", ...}, copyVars="ALL" | "ALL_MODEL" | "ALL_NUMERIC" | {"variable-name-1", ...}, lcl="string", pred="string", resid="string", role="string", ucl="string"}, outputTables={names={"string-1", ...} | {key-1={casouttable-1}, ...}, ...}, partByFrac={seed=integer, test=double}, partByVar={name="variable-name", test="string", train="string"}, quantileBin=TRUE | FALSE, sampleSummary={casout={caslib="string", ...}, avgNode="string", propAccepted="string", sampSaved="string", variance="string"}, seed=64-bit-integer, sigmaDF=double, sigmaLambda=double, sigmaQuantile=double, store={caslib="string", name="table-name", ...}, table={name="table-name", ...}, target="variable-name", trainInMem=TRUE | FALSE, treePrior={depthBase=double, depthPower=double, pPrune=double, pSplit=double}, varAutoCorr={integer-1, ...}, varEst=double;

Settings

Parameter	Description
alpha	Specifies the significance level for constructing equal-tail credible limits for predictive margins.
attributes	Changes the attributes of variables used in the action.
class	Names the classification variables to use as explanatory variables in the analysis.
distributeChains	Specifies a distributed mode that divides the MCMC sampling in a grid environment. When you specify a value of 0, a single chain is run, and each worker node is assigned a portion of the training data.
freq	Names the numeric variable that contains the frequency of occurrence for each observation.
inputs	Specifies the input variables to use in the analysis.
leafSigmaK	Specifies the value used to determine the prior variance for the leaf parameter.
maxTrainTime	Specifies an upper limit (in seconds) on the time for MCMC sampling.
minLeafSize	Specifies the minimum number of observations that each child of a split must contain in the training data for the split to be considered.
missing	Specifies how to handle missing values in predictor variables. 'SEPARATE' is often a good default as it treats missingness as a potentially informative category.
model	Defines the model structure, including the dependent variable (target) and the explanatory variables (effects).
nBI	Specifies the number of burn-in iterations to perform before the action starts to save samples for prediction. These initial samples are discarded to allow the Markov chain to reach its stationary distribution.
nBins	Specifies the number of bins to use for discretizing continuous input variables, which can improve performance.
nClassLevelsPrint	Limits the display of class levels in the output tables. A value of 0 suppresses all levels.
nMC	Specifies the number of MCMC iterations to perform after the burn-in phase. This is the main sample size for posterior inference.
nMCDist	Specifies the number of MCMC iterations for each chain when running in distributed mode.
nominals	Specifies the nominal (categorical) input variables to use in the analysis.
nThin	Specifies the thinning rate of the simulation, which saves one sample every 'nThin' iterations to reduce autocorrelation in the saved chain.
nTree	Specifies the number of trees in the sum-of-trees ensemble. A larger number of trees can capture more complex patterns but increases computation time.
obsLeafMapInMem	When set to True, stores a mapping of each observation to terminal nodes in memory, which can speed up certain post-processing tasks.
orderSplit	Specifies the minimum cardinality for which a categorical input uses splitting rules according to level ordering.
output	Creates an output table containing observation-wise statistics, such as predicted values and residuals.
outputTables	Lists the names of results tables (e.g., ModelInfo, VarImp) to save as CAS tables on the server.
partByFrac	Specifies the fraction of the data to be used for testing, allowing for random partitioning.
partByVar	Names a variable in the input table whose values are used to partition the data into training and testing roles.
quantileBin	When set to True, bin boundaries are set at quantiles of numeric inputs, which can handle skewed distributions better than equal-width bins.
sampleSummary	Creates an output table that contains a summary of the sum-of-trees ensemble samples.
seed	Specifies a seed for the pseudorandom number generator to ensure reproducibility of the analysis.
sigmaDF	Specifies the degrees of freedom of the scaled inverse chi-square prior for the error variance parameter.
sigmaLambda	Specifies the scale parameter of the scaled inverse chi-square prior for the error variance parameter.
sigmaQuantile	Specifies the quantile level to determine the scale parameter of the inverse chi-square prior for the error variance.
store	Saves the fitted model to a binary object in a CAS table, which can be used later for scoring new data with the bart.bartScore action.
table	Specifies the input data table for the analysis.
target	Specifies the target (dependent or response) variable for the model.
trainInMem	When set to True, stores the training data in memory to potentially speed up the training process.
treePrior	Specifies the parameters of the regularization prior for the tree structure, controlling its complexity.
varAutoCorr	Specifies the autocorrelation lags to compute for the variance parameter in the MCMC chain diagnostics.
varEst	Specifies an initial value for the error variance. If not specified, it's estimated from an initial linear regression.

Data Preparation View data prep sheet

Data Creation

This example creates a sample dataset named 'sample_data'. The target variable 'y' is generated from a combination of linear and non-linear functions of the predictors 'x1', 'x2', 'x3', and 'c1', with added Gaussian noise. This makes it suitable for the bartGauss action, which models a normally distributed response.

Copied!

1	DATA sample_data;
2	call streaminit(123);
3	DO i = 1 to 1000;
4	x1 = rand('UNIFORM');
5	x2 = rand('UNIFORM') * 2 - 1;
6	x3 = rand('NORMAL');
7	IF rand('UNIFORM') < 0.5 THEN c1 = 'A';
8	ELSE c1 = 'B';
9	y = 10 * sin(3.14 * x1) + 20 * (x2 - 0.5)*2 + 10 x3 + (ifc(c1='A', 5, 0)) + rand('NORMAL');
10	OUTPUT;
11	END;
12	RUN;

Examples

This example demonstrates a basic call to the bartGauss action. It loads the 'sample_data' table into a CAS session, then fits a BART model with 'y' as the target and 'x1', 'x2', 'x3', and 'c1' as predictors. This is the simplest way to run the action, relying on default settings for the number of trees, MCMC iterations, and other hyperparameters.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	LOADACTIONSET 'bart';
3	bart.bartGauss /
4	TABLE='sample_data',
5	target='y',
6	inputs={'x1', 'x2', 'x3', 'c1'};
7	RUN;

This example shows a more advanced usage of the bartGauss action. It specifies the number of trees (nTree=100), burn-in iterations (nBI=500), and MCMC samples (nMC=2000). It also partitions the data, using 25% for testing (partByFrac). The fitted model is saved to a CAS table named 'bart_model_store' for later use. Additionally, an output table 'bart_predictions' is created to store predicted values and residuals for each observation.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	LOADACTIONSET 'bart';
3	bart.bartGauss /
4	TABLE={name='sample_data'},
5	target='y',
6	inputs={'x1', 'x2', 'x3'},
7	nominals={'c1'},
8	nTree=100,
9	nBI=500,
10	nMC=2000,
11	seed=456,
12	partByFrac={test=0.25, seed=123},
13	store={name='bart_model_store', replace=true},
14	OUTPUT={casOut={name='bart_predictions', replace=true}, pred='predicted_y', resid='residual_y'},
15	display={'FitStatistics', 'VarImp'};
16	RUN;

FAQ

What is the purpose of the bart.bartGauss action?

Which parameter is used to specify the input data table?

How can I define the model's dependent and independent variables?

What does the 'nTree' parameter control?

How are the MCMC iterations managed in this action?

How can the trained model be saved for future scoring?

How does the action handle missing values in predictor variables?

Associated Scenarios

Use Case

Standard Case: Predicting Remaining Useful Life of Industrial Equipment

An industrial manufacturing company wants to predict the Remaining Useful Life (RUL) in hours for a critical machine component. The goal is to move from a fixed-schedule mainten...

View scenario

Use Case

Performance Case: Large-Scale Customer Value Modeling with Time Constraints

A financial services company needs to model the potential future value of a large customer base (millions of records). The modeling process must be completed within a strict 45-...

View scenario

Use Case

Edge Case: Modeling Patient Response with Incomplete Biometric Data

A clinical research organization is analyzing patient trial data. Some biometric sensor readings, which are important predictors for treatment effectiveness, are missing due to ...

View scenario

Actions associées

bart

bartProbit

The bartProbit action fits a probit Bayesian Additive Regression Trees (BART)...

bart

bartScoreMargin

The bartScoreMargin action computes predictive margins by using a fitted Baye...

bart

bartScore

The bartScore action scores a data table using a previously fitted Bayesian a...

Table of Contents

At a glance

Description

Data Creation

Examples

Basic Model Fitting

Detailed Model with Output and Storage

FAQ

Associated Scenarios

Use Case

Standard Case: Predicting Remaining Useful Life of Industrial Equipment

Use Case

Performance Case: Large-Scale Customer Value Modeling with Time Constraints

Use Case

Edge Case: Modeling Patient Response with Incomplete Biometric Data

Actions associées

bartProbit

bartScoreMargin

bartScore