crossValidate - WeAreCAS

Q: What is the purpose of the crossValidate action?

The crossValidate action is used to perform cross-validation with specified machine learning actions.

Q: Which model types are supported by the crossValidate action?

The crossValidate action supports several model types, including 'BNET' (Bayesian Network Classifier), 'DECISIONTREE', 'FACTMAC' (Factorization Machine), 'FOREST', 'GRADBOOST' (Gradient Boosting Tree), 'NEURALNET', and 'SVM' (Support Vector Machine).

Q: What is the function of the 'kFolds' parameter?

The 'kFolds' parameter specifies the number of folds to use for the cross-validation process. The default value is 5, with a minimum required value of 2.

Q: How can I control the training process within cross-validation?

The 'trainOptions' parameter is a required dictionary that allows you to specify all the necessary parameters for the model training action that will be used during the cross-validation.

Q: Is it possible to run the cross-validation folds in parallel?

Yes, by default, the folds are evaluated in parallel. You can control this behavior with the 'parallelFolds' parameter (defaulting to True) and specify the number of worker nodes for each subsession with the 'nSubsessionWorkers' parameter.

Q: How can I manage the randomness of fold sampling?

You can use the 'seed' parameter to specify a seed for fold sampling, which ensures reproducibility of the cross-validation process. The default value is 0.

Q: How can I adjust the amount of log information displayed?

The 'logLevel' parameter controls the verbosity of log messages. It ranges from 0 (no logs) to 3 (most detailed logs, including fold start and completion). The default is 3.

Description

The crossValidate action performs cross-validation with specified machine learning actions. It divides the input data into a specified number of folds, trains a model on each fold, and assesses the model's performance. This is crucial for evaluating the generalization ability of a model and preventing overfitting.

mlTools.crossValidate { casOut={caslib="string", compress=TRUE|FALSE, indexVars={"variable-name-1", ...}, label="string", lifetime=64-bit-integer, maxMemSize=64-bit-integer, memoryFormat="DVR"|"INHERIT"|"STANDARD", name="table-name", promote=TRUE|FALSE, replace=TRUE|FALSE, replication=integer, tableRedistUpPolicy="DEFER"|"NOREDIST"|"REBALANCE", threadBlockSize=64-bit-integer, timeStamp="string", where={"string-1", ...}}, kFolds=integer, logLevel=integer, modelType="BNET"|"DECISIONTREE"|"FACTMAC"|"FOREST"|"GRADBOOST"|"NEURALNET"|"SVM", nSubsessionWorkers=integer, parallelFolds=TRUE|FALSE, seed=integer, targetEvent="string", trainOptions={<key-1>=<any-list-or-data-type-1>, ...} };

Settings

Parameter	Description
casOut	Specifies the score output table name and details.
kFolds	Specifies the number of folds to use for cross validation. Default: 5.
logLevel	Specifies the level of log messages to be written: no logs (0), initialization and completion logs (1), setup summary logs added (2), fold begin and complete logs added (3). Default: 3.
modelType	Specifies the model type to which cross validation is applied. Supported types include BNET, DECISIONTREE, FACTMAC, FOREST, GRADBOOST, NEURALNET, and SVM. Default: DECISIONTREE.
nSubsessionWorkers	Specifies the number of worker nodes for each subsession to use for parallel fold evaluation. Default: 0.
parallelFolds	When set to True, evaluates folds in parallel. Default: TRUE.
seed	Specifies the seed to use for fold sampling for cross validation. Default: 0.
targetEvent	Specifies the name of the nominal target event to use for model assessment.
trainOptions	Specifies a list of parameters for the model training action to use in the cross validation process. This is a required parameter.

Examples

FAQ

What is the purpose of the crossValidate action?

Which model types are supported by the crossValidate action?

What is the function of the 'kFolds' parameter?

How can I control the training process within cross-validation?

Is it possible to run the cross-validation folds in parallel?

How can I manage the randomness of fold sampling?

How can I adjust the amount of log information displayed?

Associated Scenarios

Use Case

Standard Churn Prediction with Gradient Boosting

A retail bank wants to estimate the generalization error of a Gradient Boosting model designed to predict customer churn. They need to ensure the model performs consistently acr...

View scenario

Use Case

High-Volume Fraud Detection with Parallel Forest Training

A credit card processor needs to validate a Random Forest model for fraud detection on a large dataset. Due to the data volume and the complexity of the forest, they want to uti...

View scenario

Use Case

Rare Disease Diagnosis with SVM and Specific Target Event

A medical research facility is testing a Support Vector Machine (SVM) classifier for a rare disease. They need to validate the model using a low number of folds due to small sam...

View scenario

Table of Contents

Description

Examples

FAQ

Associated Scenarios

Use Case

Standard Churn Prediction with Gradient Boosting

Use Case

High-Volume Fraud Detection with Parallel Forest Training

Use Case

Rare Disease Diagnosis with SVM and Specific Target Event