The mbcFit action performs model-based clustering on a given data set using the Expectation-Maximization (EM) algorithm. This technique fits a mixture of Gaussian distributions to the data, where each distribution represents a distinct cluster. It is a powerful method for unsupervised classification, allowing for flexible cluster shapes and sizes based on the specified covariance structure.
| Parameter | Description |
|---|---|
| table | Specifies the input data table for clustering. |
| model | Specifies the variables to use for the analysis. This includes the 'effects' which are the input variables for clustering. |
| nClusters | Specifies the number of Gaussian clusters to fit. You can provide a single integer or a list of integers to test multiple cluster counts. |
| covStruct | Specifies the covariance model(s) to be used for the clusters. Different structures allow for different shapes, volumes, and orientations of clusters. You can provide a single structure or a list to test multiple models. |
| criterion | Specifies the model selection criterion (e.g., BIC, AIC) to determine the best model when multiple cluster counts or covariance structures are tested. |
| noise | Specifies whether to include a noise cluster in the model to capture observations that do not belong to any primary cluster. |
| seed | Specifies the random seed for initialization, ensuring reproducibility of the results. |
| output | Specifies the creation of an output table containing observation-wise results, such as cluster membership probabilities. |
| store | Specifies a location to save the fitted model as a binary object for later use with the `mbcScore` action. |
| maxIter | Sets the maximum number of iterations for the EM algorithm. |
| technique | Specifies the algorithm variant to use, either standard Expectation-Maximization (EM) or Classification EM (CEM). |
| initMethod | Defines the method for generating the initial cluster assignments, which can be random or based on k-means. |
This example demonstrates how to create a CAS table named `my_cas_table` with synthetic data. The table contains four numeric variables generated from different distributions, making it suitable for a clustering task.
| 1 | DATA my_cas_table; |
| 2 | call streaminit(123); |
| 3 | DO i = 1 to 200; |
| 4 | x1 = rand('NORMAL', 10, 2); |
| 5 | x2 = rand('NORMAL', 20, 5); |
| 6 | x3 = rand('NORMAL', 5, 1); |
| 7 | x4 = rand('UNIFORM') * 15; |
| 8 | OUTPUT; |
| 9 | END; |
| 10 | RUN; |
| 11 | |
| 12 | PROC CASUTIL; |
| 13 | load DATA=my_cas_table outcaslib="casuser" casout="my_cas_table" replace; |
| 14 | RUN; |
This example performs a simple model-based clustering on the `my_cas_table`. It analyzes the variables `x1` and `x2` and searches for the optimal number of clusters between 2 and 4 using the default BIC criterion and EEE covariance structure.
| 1 | PROC CAS; |
| 2 | mbc.mbcFit / |
| 3 | TABLE={name='my_cas_table'}, |
| 4 | model={effects={{vars={'x1', 'x2'}}} , |
| 5 | nClusters={2,3,4}, |
| 6 | seed=12345; |
| 7 | RUN; |
This advanced example demonstrates fitting models with a range of cluster counts (2 to 5) and multiple covariance structures (EEE, EEV, VVV). It includes a noise component, uses the AICC for model selection, and saves both the fitted model to a store and the scored data to an output table.
| 1 | PROC CAS; |
| 2 | mbc.mbcFit / |
| 3 | TABLE='my_cas_table', |
| 4 | model={effects={{vars={'x1', 'x2', 'x3', 'x4'}}} , |
| 5 | nClusters={2,3,4,5}, |
| 6 | covStruct={'EEE', 'EEV', 'VVV'}, |
| 7 | noise='Y', |
| 8 | criterion='AICC', |
| 9 | store={name='mbc_model_store', replace=true}, |
| 10 | OUTPUT={casOut={name='mbc_output_table', replace=true}, copyVars={'i'}}, |
| 11 | seed=54321; |
| 12 | RUN; |