specifies the value to use for minimal cost-complexity pruning for regression trees. Minimum value: 0

Specifies that you wish the action use a prespecified row ordering. This requires using the orderby and groupby parameters on a preliminary table.partition action call. Alias: reproducibleRowOrder Default: FALSE

specifies temporary attributes, such as a format, to apply to input variables. For more information about specifying the attributes parameter, see the common casinvardesc parameter. Aliases: attribute, attrs, attr, varAttrs format: specifies the format to apply to the variable. formattedLength: specifies the length of the format field plus the length of the format precision. label: specifies the descriptive label for the variable. name: specifies the name for the variable. nfd: specifies the length of the format precision. nfl: specifies the length of the format field.

by default, the bin order is preserved for numeric variables. When set to False, the bin order is ignored for numeric variables. Default: TRUE

specifies the fraction of the data for the bootstrap sample. Default: 0.63212055882 Range: (0–1]

specifies the table to store the decision tree model in. When not specified, a random name is generated. For more information about specifying the casOut parameter, see the common casouttable parameter.

specifies the aggressiveness of tree pruning according to the C4.5 algorithm. Default: 0.25

requests that the action produce SAS score code. Specify additional parameters. For more information about specifying the code parameter, see the common codegen parameter. casOut: specifies the settings for an output table. For more information about specifying the casOut parameter, see the common casouttable parameter. comment: when set to True, adds comments to the DATA step code. Default: FALSE fmtWdth: specifies the width to use for formatting derived numbers such as parameter estimates in the DATA step code. Alias: fmtWidth Default: 20 Range: 0–32 indentSize: specifies the number of spaces to indent the DATA step code for each level. Default: 3 Range: 0–10 labelId: specifies the label ID to use in array names and statement labels in the DATA step code. By default, a random positive integer is used. lineSize: specifies the line size for the generated code. Default: 120 Range: 64–254 noTrim: when set to True, bases the comparison of variables with formatted values on the full format width with padding. By default, leading and trailing blanks are removed from the formatted values. Default: FALSE tabForm: when set to True, generates the code in a way that is appropriate for storing in a table. Alias: tableForm Default: FALSE

requests that the action produce SAS score code to create variables encoding interactions. You must also request variable interactions of at least degree 2. The viicodegen value can be one or more of the following: casOut: specifies the settings for an output table. For more information about specifying the casOut parameter, see the common casouttable parameter. comment: when set to True, adds comments to the DATA step code. Default: FALSE fmtWdth: specifies the width to use for formatting derived numbers such as parameter estimates in the DATA step code. Alias: fmtWidth Default: 20 Range: 0–32 indentSize: specifies the number of spaces to indent the DATA step code for each level. Default: 3 Range: 0–10 labelId: specifies the label ID to use in array names and statement labels in the DATA step code. By default, a random positive integer is used. lineSize: specifies the line size for the generated code. Default: 120 Range: 64–254 noTrim: when set to True, bases the comparison of variables with formatted values on the full format width with padding. By default, leading and trailing blanks are removed from the formatted values. Default: FALSE tabForm: when set to True, generates the code in a way that is appropriate for storing in a table. Alias: tableForm Default: FALSE

specifies the split criterion for each tree node.

specifies whether to encode the variable names such as predicted probabilities of a binary or nominal target in the generated casout table. The predicted probabilities are named with the prefix P_ instead of _DT_P_. Default: FALSE

specifies the event values of the target variable. This parameter is combined with the eventFreq parameter to specify the frequency for each specific event. Observations with the specified event are reweighted with the value from the eventFreq parameter. Specifying this parameter is useful for rare-event sampling.

specifies the frequency for each corresponding event in the event parameter.

specifies a numeric variable that contains the frequency of occurrence of each observation.

by default, a greedy search or exhaustive search is used to determine the best split for each variable of each tree node. When set to False, a fast and efficient algorithm that is based on clustering is applied. Setting this parameter to False is recommended for variables with high cardinality. Default: TRUE

by default, observations with missing values are included. When set to False, observations with missing values for the variables used in the tree model are ignored when scoring. Default: TRUE

specifies the input variables to use in the analysis. For more information about specifying the inputs parameter, see the common casinvardesc parameter. Alias: input format: specifies the format to apply to the variable. formattedLength: specifies the length of the format field plus the length of the format precision. label: specifies the descriptive label for the variable. name: specifies the name for the variable. nfd: specifies the length of the format precision. nfl: specifies the length of the format field.

specifies training an isolation forest. Default: FALSE

specifies the minimum number of observations on each node. Default: 5 Minimum value: 1

specifies number of variables to split with using the LOH method. Default: 0

specifies the number of input variables to consider for splitting on a node. The variables are selected at random from the input variables for each tree. By default, forest uses the square root of the number of input variables is used, rounded up to the nearest integer. For gradient boosting, the number of input variables is used. Minimum value: 1

specifies the maximum number of children (branches) allowed for each level of the tree. Default: 2 Minimum value: 1

specifies the maximum number of the tree level. Default: 6 Minimum value: 1

by default, when the largest value in one bin matches the lowest value in a neighboring bin, the values are merged into the lower bin. When set to False, the action does not try to merge bins. Default: TRUE

specifies a threshold for utilizing missing values in the split search when the missing parameter is set to USEINSEARCH. If the number of observations in which the splitting variable has missing values in a node is greater than or equal to the specified value, then the action initiates the USEINSEARCH policy. Otherwise, the missing values are assigned to a popular branch. Default: 1

specifies the missing policy to handle missing values. Default: USEINSEARCH MACSMALL: specifies to treat the missing values for numeric variables as the smallest machine value and to treat missing values for nominal variables as a separate level. USEINSEARCH: specifies to incorporate missing values in the calculation of the worth of a splitting rule, and consequently to produce a splitting rule that associates missing values with a branch that maximizes the worth of the split.

specifies the model ID variable name to use when generating SAS score code. By default, DT_ is prefixed to the target variable name.

specifies the number of bins to use for numeric variables in the calculation of the decision tree. Default: 50 Minimum value: 1

specifies the number of bins to use for a numerical target variable. Default: 0

specifies the nominal input variables to use in the analysis. For more information about specifying the nominals parameter, see the common casinvardesc parameter. Alias: nominal format: specifies the format to apply to the variable. formattedLength: specifies the length of the format field plus the length of the format precision. label: specifies the descriptive label for the variable. name: specifies the name for the variable. nfd: specifies the length of the format precision. nfl: specifies the length of the format field.

specifies the method for finding a split on a nominal input. Alias: nomSearch handling: CLASSIC | ENHANCED maxCategories: specifies the maximum number of levels for a splitting rule to include. Aliases: maxCats, maxLevels, maxValues, cluster, minCardCluster Default: 128 Minimum value: 0 shrinkage: specifies how much weight to give the category average in the sort method. Default: 10 Minimum value: 0 sort: specifies the minimum cardinality of an input to use the sort method. Alias: minCardSort Default: 10 Minimum value: 0 sortBy: COUNT | TARGET

specifies the number of trees to create. Alias: nTrees Default: 50 Minimum value: 1

when set to True, specifies that the out-of-bag error is computed when building a forest. This also generates a result table with the OOB error for each tree. Default: FALSE

specify true to use a C4.5 pruning method for classification trees or minimal cost-complexity pruning for regression trees. Default: FALSE

specifies bin boundaries at quantiles of numerical inputs instead of bins of equal width. Aliases: qbin, qtbin Default: TRUE

specifies variable importance using the random branch assignments (RBA) method. Default: FALSE

Default: 100 Range: (0–MACINT]

specifies the table to store the generated aStore model. For more information about specifying the saveState parameter, see the common casouttable parameter.

specifies the seed for the random number generator. By default, the random number stream is based on the computer clock. Negative values also result in random number streams based on the computer clock. If you want a reproducible random number sequence between runs, specify a value that is greater than zero. Default: 0 Range: 0–MACINT

specifies the settings for an input table. Long form: table={name="table-name"} Shortcut form: table="table-name" The castable value can be one or more of the following: caslib: specifies the caslib for the input table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib. computedOnDemand: when set to True, creates the computed variables when the table is loaded instead of when the action begins. Alias: compOnDemand Default: FALSE computedVars: specifies the names of the computed variables to create. Specify an expression for each variable in the computedVarsProgram parameter. If you do not specify this parameter, then all variables from computedVarsProgram are automatically included. Alias: compVars format: specifies the format to apply to the variable. formattedLength: specifies the length of the format field plus the length of the format precision. label: specifies the descriptive label for the variable. name: specifies the name for the variable. nfd: specifies the length of the format precision. nfl: specifies the length of the format field. computedVarsProgram: specifies an expression for each computed variable that you include in the computedVars parameter. Alias: compPgm dataSourceOptions: specifies data source options. Aliases: options, dataSource importOptions: specifies the settings for reading a table from a data source. Alias: import For more information about specifying the importOptions parameter, see the common importOptions parameter. name: specifies the name of the input table. singlePass: when set to True, does not create a transient table on the server. Setting this parameter to True can be efficient, but the data might not have stable ordering upon repeated runs. Default: FALSE vars: specifies the variables to use in the action. format: specifies the format to apply to the variable. formattedLength: specifies the length of the format field plus the length of the format precision. label: specifies the descriptive label for the variable. name: specifies the name for the variable. nfd: specifies the length of the format precision. nfl: specifies the length of the format field. where: specifies an expression for subsetting the input data. whereTable: specifies an input table that contains rows to use as a WHERE filter. If the vars parameter is not specified, then all the variable names that are common to the input table and the filtering table are used to find matching rows. If the where parameter for the input table and this parameter are specified, then this filtering table is applied first. casLib: specifies the caslib for the filter table. By default, the active caslib is used. dataSourceOptions: specifies data source options. Aliases: options, dataSource For more information about specifying the dataSourceOptions parameter, see the common dataSourceOptions parameter. importOptions: specifies the settings for reading a table from a data source. Alias: import For more information about specifying the importOptions parameter, see the common importOptions parameter. name: specifies the name of the filter table. vars: specifies the variable names to use from the filter table. format: specifies the format to apply to the variable. formattedLength: specifies the length of the format field plus the length of the format precision. label: specifies the descriptive label for the variable. name: specifies the name for the variable. nfd: specifies the length of the format precision. nfl: specifies the length of the format field. where: specifies an expression for subsetting the data from the filter table.

specifies the target or response variable for training. If the variable is numeric, but not specified in the nominal= parameter and nbinstarget= is not specified, then a regression tree is trained.

specifies whether the variable importance information is generated. The importance value is determined by the total Gini reduction. Default: FALSE

requests variable interaction importance and specifies the maximum degree of interaction. Default: 1 Range: 0–3

specifies the vote strategy for classification. The specified value influences the generated SAS scoring code and the OOB error computations. Default: MAJORITY MAJORITY: specifies to use the majority vote to predict. PROB: specifies to use the average probability from the forest to predict.

specifies a numeric variable that contains the weight of each observation.

forestTrain - WeAreCAS

Description

The forestTrain action trains a forest model, which is an ensemble of decision trees used for classification, regression, or isolation forest tasks. This action requires a SAS Visual Data Mining and Machine Learning license. It provides options for bootstrap sampling, various splitting criteria (CHAID, GINI, etc.), pruning (C4.5 or cost-complexity), and handling missing values. It can generate SAS score code, computed variables, and save the model as an analytic store (aStore).

Settings

Parameter	Description
alpha	Specifies the value to use for minimal cost-complexity pruning for regression trees.
applyRowOrder	Specifies that you wish the action use a prespecified row ordering. Requires using orderby and groupby on a preliminary table.partition call.
attributes	Specifies temporary attributes, such as a format, to apply to input variables.
binOrder	When set to True (default), the bin order is preserved for numeric variables.
bootstrap	Specifies the fraction of the data for the bootstrap sample. Range (0-1].
casOut	Specifies the table to store the decision tree model in.
cfLev	Specifies the aggressiveness of tree pruning according to the C4.5 algorithm.
code	Requests that the action produce SAS score code.
codeInteractions	Requests that the action produce SAS score code to create variables encoding interactions.
crit	Specifies the split criterion for each tree node (e.g., GINI, CHAID, VARIANCE).
encodeName	Specifies whether to encode the variable names such as predicted probabilities of a binary or nominal target in the generated casout table.
event	Specifies the event values of the target variable for use with eventFreq.
eventFreq	Specifies the frequency for each corresponding event in the event parameter (useful for rare-event sampling).
freq	Specifies a numeric variable that contains the frequency of occurrence of each observation.
greedy	When set to True (default), a greedy/exhaustive search is used. False uses a fast clustering-based algorithm.
includeMissing	By default, observations with missing values are included. If False, they are ignored during scoring.
inputs	Specifies the input variables to use in the analysis.
isolation	Specifies training an isolation forest (default False).
leafSize	Specifies the minimum number of observations on each node (default 5).
loh	Specifies number of variables to split with using the LOH method.
m	Specifies the number of input variables to consider for splitting on a node.
maxBranch	Specifies the maximum number of children (branches) allowed for each level of the tree (default 2).
maxLevel	Specifies the maximum number of the tree level (default 6).
mergeBin	When set to True (default), merges bins where the largest value matches the lowest value of the neighbor.
minUseInSearch	Specifies a threshold for utilizing missing values in the split search when missing='USEINSEARCH'.
missing	Specifies the missing policy ('MACSMALL' or 'USEINSEARCH').
modelId	Specifies the model ID variable name to use when generating SAS score code.
nBins	Specifies the number of bins to use for numeric variables (default 50).
nBinsTarget	Specifies the number of bins to use for a numerical target variable.
nominals	Specifies the nominal input variables to use in the analysis.
nominalSearch	Specifies the method for finding a split on a nominal input (e.g., handling='ENHANCED').
nTree	Specifies the number of trees to create (default 50).
oob	When set to True, specifies that the out-of-bag error is computed.
prune	Specify true to use a C4.5 pruning method or minimal cost-complexity pruning.
quantileBin	Specifies bin boundaries at quantiles of numerical inputs instead of bins of equal width (default True).
rbaImp	Specifies variable importance using the random branch assignments (RBA) method.
sampleN	Specifies the sample size (default 100).
saveState	Specifies the table to store the generated aStore model.
seed	Specifies the seed for the random number generator.
table	Specifies the settings for the input table.
target	Specifies the target or response variable for training.
varImp	Specifies whether the variable importance information is generated.
varIntImp	Requests variable interaction importance and specifies the maximum degree of interaction.
vote	Specifies the vote strategy for classification ('MAJORITY' or 'PROB').
weight	Specifies a numeric variable that contains the weight of each observation.

Data Preparation View data prep sheet

Load Data to CAS

Loads the sashelp.class dataset into the 'casuser' caslib for analysis.

Copied!

1
2	PROC CAS;
3
4	DATA casuser.class;
5	SET sashelp.class;
6
7	RUN;
8

Examples

Trains a forest to predict 'Sex' using 'Height' and 'Weight' with default settings.

SAS® / CAS Code Code awaiting community validation

Copied!

1
2	PROC CAS;
3	decisionTree.forestTrain / TABLE={name="class", caslib="casuser"} target="Sex" inputs={"Height", "Weight"};
4
5	RUN;
6

Result :
Generates a forest model for Sex classification.

Trains a forest on 'Weight' (numeric target), requests variable importance, sets seed, creates 100 trees, and saves the model as an analytic store.

SAS® / CAS Code Code awaiting community validation

Copied!

1
2	PROC CAS;
3	decisionTree.forestTrain / TABLE={name="class", caslib="casuser"} target="Weight" inputs={"Height", "Age"} nTree=100 seed=12345 varImp=TRUE oob=TRUE saveState={name="forest_astore", caslib="casuser", replace=TRUE};
4
5	RUN;
6

Result :
Generates a regression forest, outputs variable importance, OOB error, and saves the 'forest_astore' table.

FAQ

alpha

applyRowOrder

attributes

binOrder

bootstrap

casOut

cfLev

code

codeInteractions

crit

encodeName

event

eventFreq

freq

greedy

includeMissing

inputs

isolation

leafSize

loh

maxBranch

maxLevel

mergeBin

minUseInSearch

missing

modelId

nBins

nBinsTarget

nominals

nominalSearch

nTree

oob

prune

quantileBin

rbaImp

sampleN

saveState

seed

table

target

varImp

varIntImp

vote

weight

Actions associées

decisionTree

forestCode

The forestCode action generates SAS DATA step scoring code from a trained for...

decisionTree

forestScore

The forestScore action scores an input table using a previously trained fores...

decisionTree

gbtreeCode

Generates DATA step scoring code from a gradient boosting tree model.

decisionTree

gbtreeScore

Scores a table using a gradient boosting tree model.

Table of Contents

Description

Load Data to CAS

Examples

Simple Forest Classification

Advanced Forest Regression with Model Saving

FAQ

Actions associées

forestCode

forestScore

gbtreeCode

gbtreeScore