mbc

mbcFit

Description

The mbcFit action performs model-based clustering on a given data set using the Expectation-Maximization (EM) algorithm. This technique fits a mixture of Gaussian distributions to the data, where each distribution represents a distinct cluster. It is a powerful method for unsupervised classification, allowing for flexible cluster shapes and sizes based on the specified covariance structure.

mbc.mbcFit <result=results> <status=rc> / attributes={{...}}, convergenceTest="AITKEN" | "LOGL", covStruct="ALL" | "ALLGMIX" | "ALLPGMIX" | "CCC" | "CCU" | "CUC" | "CUU" | "EEE" | "EEI" | "EEV" | "EII" | "EVI" | "EVV" | "UCC" | "UCU" | "UUC" | "UUU" | "VII" | "VVI" | "VVV" | {"ALL", "ALLGMIX", "ALLPGMIX", "CCC", "CCU", "CUC", "CUU", "EEE", "EEI", "EEV", "EII", "EVI", "EVV", "UCC", "UCU", "UUC", "UUU", "VII", "VVI", "VVV"}, criterion="AIC" | "AICC" | "BIC" | "LOGL" | "NONE", display={...}, emEpsilon=double, factorDetails=TRUE | FALSE, groupByLimit=64-bit-integer, initMethod="KMEANS" | "RANDOM", itHist="DETAILS" | "NONE" | "SUMMARY", maxIter=integer, model={...}, nClusters=64-bit-integer | {64-bit-integer-1 <, 64-bit-integer-2, ...>}, nFactors=64-bit-integer | {64-bit-integer-1 <, 64-bit-integer-2, ...>}, noise="N" | "Y" | {"N", "Y"}, output={...}, outputTables={...}, parameterEpsilon=double, seed=integer, singularEpsilon=double, store={...}, table={...}, technique="CEM" | "EM", topModels=64-bit-integer;
Settings
ParameterDescription
tableSpecifies the input data table for clustering.
modelSpecifies the variables to use for the analysis. This includes the 'effects' which are the input variables for clustering.
nClustersSpecifies the number of Gaussian clusters to fit. You can provide a single integer or a list of integers to test multiple cluster counts.
covStructSpecifies the covariance model(s) to be used for the clusters. Different structures allow for different shapes, volumes, and orientations of clusters. You can provide a single structure or a list to test multiple models.
criterionSpecifies the model selection criterion (e.g., BIC, AIC) to determine the best model when multiple cluster counts or covariance structures are tested.
noiseSpecifies whether to include a noise cluster in the model to capture observations that do not belong to any primary cluster.
seedSpecifies the random seed for initialization, ensuring reproducibility of the results.
outputSpecifies the creation of an output table containing observation-wise results, such as cluster membership probabilities.
storeSpecifies a location to save the fitted model as a binary object for later use with the `mbcScore` action.
maxIterSets the maximum number of iterations for the EM algorithm.
techniqueSpecifies the algorithm variant to use, either standard Expectation-Maximization (EM) or Classification EM (CEM).
initMethodDefines the method for generating the initial cluster assignments, which can be random or based on k-means.
Data Preparation View data prep sheet
Creating Sample Data for Clustering

This example demonstrates how to create a CAS table named `my_cas_table` with synthetic data. The table contains four numeric variables generated from different distributions, making it suitable for a clustering task.

Copied!
1DATA my_cas_table;
2 call streaminit(123);
3 DO i = 1 to 200;
4 x1 = rand('NORMAL', 10, 2);
5 x2 = rand('NORMAL', 20, 5);
6 x3 = rand('NORMAL', 5, 1);
7 x4 = rand('UNIFORM') * 15;
8 OUTPUT;
9 END;
10RUN;
11 
12PROC CASUTIL;
13 load DATA=my_cas_table outcaslib="casuser" casout="my_cas_table" replace;
14RUN;

Examples

This example performs a simple model-based clustering on the `my_cas_table`. It analyzes the variables `x1` and `x2` and searches for the optimal number of clusters between 2 and 4 using the default BIC criterion and EEE covariance structure.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 mbc.mbcFit /
3 TABLE={name='my_cas_table'},
4 model={effects={{vars={'x1', 'x2'}}} ,
5 nClusters={2,3,4},
6 seed=12345;
7RUN;
Result :
The action returns several tables. The 'Model Information' table describes the setup, and the 'Fit Statistics' table lists the BIC for each model (2, 3, and 4 clusters). The model with the lowest BIC is considered the best. Parameter estimates for the best model are also displayed.

This advanced example demonstrates fitting models with a range of cluster counts (2 to 5) and multiple covariance structures (EEE, EEV, VVV). It includes a noise component, uses the AICC for model selection, and saves both the fitted model to a store and the scored data to an output table.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 mbc.mbcFit /
3 TABLE='my_cas_table',
4 model={effects={{vars={'x1', 'x2', 'x3', 'x4'}}} ,
5 nClusters={2,3,4,5},
6 covStruct={'EEE', 'EEV', 'VVV'},
7 noise='Y',
8 criterion='AICC',
9 store={name='mbc_model_store', replace=true},
10 OUTPUT={casOut={name='mbc_output_table', replace=true}, copyVars={'i'}},
11 seed=54321;
12RUN;
Result :
The action evaluates 12 different models (4 cluster counts x 3 covariance structures). The 'Fit Statistics' table will show the AICC for each, and the best model will be identified. The model's parameters will be detailed, and a binary model store named `mbc_model_store` will be created in the `casuser` caslib. Additionally, a new CAS table `mbc_output_table` will be generated, containing the original variable `i` and new columns for posterior probabilities and the final cluster assignment for each observation.

FAQ

What is the purpose of the mbcFit action in SAS Viya?
What does the 'nClusters' parameter control in the mbcFit action?
How can I specify the covariance structure for the clustering model with mbcFit?
What initialization methods are available for the mbcFit action?
Is it possible to account for noise in the data with the mbcFit action?
How can I save the trained clustering model for later use?
Which model selection criterion can be used in the mbcFit action to find the best model?
What is the function of the 'technique' parameter in the mbcFit action?