mbcFit - WeAreCAS

Q: What is the purpose of the mbcFit action in SAS Viya?

The mbcFit action performs model-based clustering using the Expectation-Maximization (EM) algorithm.

Q: What does the 'nClusters' parameter control in the mbcFit action?

The 'nClusters' parameter specifies the number of Gaussian clusters to be used in the model.

Q: How can I specify the covariance structure for the clustering model with mbcFit?

You can use the 'covStruct' parameter to specify the covariance model. Options include various structures like 'EEE', 'EEV', 'VII', 'VVI', etc., which define constraints on the volume, shape, and orientation of the covariance matrices across clusters.

Q: What initialization methods are available for the mbcFit action?

The 'initMethod' parameter allows you to choose the initialization method. The available options are 'KMEANS' and 'RANDOM'.

Q: Is it possible to account for noise in the data with the mbcFit action?

Yes, you can use the 'noise' parameter by setting it to 'Y' to include a noise cluster in the model to capture outliers or data points that do not fit well into any of the Gaussian clusters.

Q: How can I save the trained clustering model for later use?

The 'store' parameter allows you to save the resulting model in a CAS table as a binary large object (blob) for future scoring with the 'mbcScore' action.

Q: Which model selection criterion can be used in the mbcFit action to find the best model?

The 'criterion' parameter lets you specify the model selection criterion. Available options are 'AIC' (Akaike Information Criterion), 'AICC' (Corrected Akaike Information Criterion), 'BIC' (Bayesian Information Criterion), and 'LOGL' (Log-Likelihood).

Q: What is the function of the 'technique' parameter in the mbcFit action?

The 'technique' parameter specifies the algorithm to use for clustering. You can choose 'EM' for the standard Expectation-Maximization algorithm or 'CEM' for the Classification Expectation-Maximization algorithm.

Description

The mbcFit action performs model-based clustering on a given data set using the Expectation-Maximization (EM) algorithm. This technique fits a mixture of Gaussian distributions to the data, where each distribution represents a distinct cluster. It is a powerful method for unsupervised classification, allowing for flexible cluster shapes and sizes based on the specified covariance structure.

mbc.mbcFit <result=results> <status=rc> / attributes={{...}}, convergenceTest="AITKEN" | "LOGL", covStruct="ALL" | "ALLGMIX" | "ALLPGMIX" | "CCC" | "CCU" | "CUC" | "CUU" | "EEE" | "EEI" | "EEV" | "EII" | "EVI" | "EVV" | "UCC" | "UCU" | "UUC" | "UUU" | "VII" | "VVI" | "VVV" | {"ALL", "ALLGMIX", "ALLPGMIX", "CCC", "CCU", "CUC", "CUU", "EEE", "EEI", "EEV", "EII", "EVI", "EVV", "UCC", "UCU", "UUC", "UUU", "VII", "VVI", "VVV"}, criterion="AIC" | "AICC" | "BIC" | "LOGL" | "NONE", display={...}, emEpsilon=double, factorDetails=TRUE | FALSE, groupByLimit=64-bit-integer, initMethod="KMEANS" | "RANDOM", itHist="DETAILS" | "NONE" | "SUMMARY", maxIter=integer, model={...}, nClusters=64-bit-integer | {64-bit-integer-1 <, 64-bit-integer-2, ...>}, nFactors=64-bit-integer | {64-bit-integer-1 <, 64-bit-integer-2, ...>}, noise="N" | "Y" | {"N", "Y"}, output={...}, outputTables={...}, parameterEpsilon=double, seed=integer, singularEpsilon=double, store={...}, table={...}, technique="CEM" | "EM", topModels=64-bit-integer;

Settings

Parameter	Description
table	Specifies the input data table for clustering.
model	Specifies the variables to use for the analysis. This includes the 'effects' which are the input variables for clustering.
nClusters	Specifies the number of Gaussian clusters to fit. You can provide a single integer or a list of integers to test multiple cluster counts.
covStruct	Specifies the covariance model(s) to be used for the clusters. Different structures allow for different shapes, volumes, and orientations of clusters. You can provide a single structure or a list to test multiple models.
criterion	Specifies the model selection criterion (e.g., BIC, AIC) to determine the best model when multiple cluster counts or covariance structures are tested.
noise	Specifies whether to include a noise cluster in the model to capture observations that do not belong to any primary cluster.
seed	Specifies the random seed for initialization, ensuring reproducibility of the results.
output	Specifies the creation of an output table containing observation-wise results, such as cluster membership probabilities.
store	Specifies a location to save the fitted model as a binary object for later use with the `mbcScore` action.
maxIter	Sets the maximum number of iterations for the EM algorithm.
technique	Specifies the algorithm variant to use, either standard Expectation-Maximization (EM) or Classification EM (CEM).
initMethod	Defines the method for generating the initial cluster assignments, which can be random or based on k-means.

Data Preparation View data prep sheet

Creating Sample Data for Clustering

This example demonstrates how to create a CAS table named `my_cas_table` with synthetic data. The table contains four numeric variables generated from different distributions, making it suitable for a clustering task.

Copied!

1	DATA my_cas_table;
2	call streaminit(123);
3	DO i = 1 to 200;
4	x1 = rand('NORMAL', 10, 2);
5	x2 = rand('NORMAL', 20, 5);
6	x3 = rand('NORMAL', 5, 1);
7	x4 = rand('UNIFORM') * 15;
8	OUTPUT;
9	END;
10	RUN;
11
12	PROC CASUTIL;
13	load DATA=my_cas_table outcaslib="casuser" casout="my_cas_table" replace;
14	RUN;

Examples

This example performs a simple model-based clustering on the `my_cas_table`. It analyzes the variables `x1` and `x2` and searches for the optimal number of clusters between 2 and 4 using the default BIC criterion and EEE covariance structure.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	mbc.mbcFit /
3	TABLE={name='my_cas_table'},
4	model={effects={{vars={'x1', 'x2'}}} ,
5	nClusters={2,3,4},
6	seed=12345;
7	RUN;

Result :
The action returns several tables. The 'Model Information' table describes the setup, and the 'Fit Statistics' table lists the BIC for each model (2, 3, and 4 clusters). The model with the lowest BIC is considered the best. Parameter estimates for the best model are also displayed.

This advanced example demonstrates fitting models with a range of cluster counts (2 to 5) and multiple covariance structures (EEE, EEV, VVV). It includes a noise component, uses the AICC for model selection, and saves both the fitted model to a store and the scored data to an output table.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	mbc.mbcFit /
3	TABLE='my_cas_table',
4	model={effects={{vars={'x1', 'x2', 'x3', 'x4'}}} ,
5	nClusters={2,3,4,5},
6	covStruct={'EEE', 'EEV', 'VVV'},
7	noise='Y',
8	criterion='AICC',
9	store={name='mbc_model_store', replace=true},
10	OUTPUT={casOut={name='mbc_output_table', replace=true}, copyVars={'i'}},
11	seed=54321;
12	RUN;

Result :
The action evaluates 12 different models (4 cluster counts x 3 covariance structures). The 'Fit Statistics' table will show the AICC for each, and the best model will be identified. The model's parameters will be detailed, and a binary model store named `mbc_model_store` will be created in the `casuser` caslib. Additionally, a new CAS table `mbc_output_table` will be generated, containing the original variable `i` and new columns for posterior probabilities and the final cluster assignment for each observation.

FAQ

What is the purpose of the mbcFit action in SAS Viya?

What does the 'nClusters' parameter control in the mbcFit action?

How can I specify the covariance structure for the clustering model with mbcFit?

What initialization methods are available for the mbcFit action?

Is it possible to account for noise in the data with the mbcFit action?

How can I save the trained clustering model for later use?

Which model selection criterion can be used in the mbcFit action to find the best model?

What is the function of the 'technique' parameter in the mbcFit action?

Actions associées

mbc

mbcScore

The mbcScore action processes a model created by the mbcFit action and saved ...

Table of Contents

Description

Creating Sample Data for Clustering

Examples

Basic Model-Based Clustering

Detailed Clustering with Multiple Covariance Structures and Output Storage

FAQ

Actions associées

mbcScore