The exploreCorrelation action explores linear and nonlinear correlations among variables in a table. It computes various correlation statistics for pairs of variables, supporting both interval and nominal types. It can handle missing values and allows for configuring distinct count limits and estimation algorithms like Misra-Gries.
| Parameter | Description |
|---|---|
| binMissing | When set to True, missing values are included in the analysis. Default is FALSE. |
| casOut | Specifies the CAS table to store the analysis results. Includes parameters for caslib, name, replace, promote, and memory management. |
| distinctCountLimit | Specifies the distinct count limit. If exceeded, the Misra-Gries algorithm is used if enabled. Default is 10000. |
| ecdfTolerance | Specifies the tolerance value for the empirical cumulative distribution function used by the quantile sketch algorithm. Default is 0.001. |
| event | Specifies the target variable level to model. Used for casting multilevel classification into binary classification. |
| freq | Specifies the variable to use for frequency counts. |
| inputs | Specifies the variables to use for the analysis. You can specify a subset of the variables from the input table. |
| misraGries | When set to True, uses the Misra-Gries algorithm for frequency distribution estimation if the distinct count limit is exceeded. Default is TRUE. |
| nominals | Specifies the list of nominal variables. |
| stats | Specifies the correlation probes (statistics) to use for the comprehensive correlation analysis. Options include PEARSON, MI (Mutual Information), ENTROPY, etc. |
| table | Specifies the input table name, caslib, and other common parameters like where-clauses. |
| target | Specifies the target variable for the correlation analysis. |
Generates a sample dataset with correlated interval and nominal variables for analysis.
| 1 | PROC CAS; |
| 2 | dataStep.runCode / |
| 3 | code="data casuser.correlation_data; |
| 4 | call streaminit(123); |
| 5 | do i=1 to 1000; |
| 6 | x = rand('normal'); |
| 7 | y = 2*x + rand('normal', 0, 0.5); |
| 8 | if x > 0 then c = 'Positive'; else c = 'Negative'; |
| 9 | if rand('uniform') > 0.9 then call missing(x); |
| 10 | output; |
| 11 | end; |
| 12 | run;"; |
| 13 | RUN; |
Runs exploreCorrelation on the sample data specifying only the table and the target variable.
| 1 | PROC CAS; |
| 2 | dataSciencePilot.exploreCorrelation / |
| 3 | TABLE={name="correlation_data", caslib="casuser"} |
| 4 | target="c"; |
| 5 | RUN; |
Runs a comprehensive analysis including missing values, specific statistics (Pearson, MI), and saves the results to a table.
| 1 | PROC CAS; |
| 2 | dataSciencePilot.exploreCorrelation / |
| 3 | TABLE={name="correlation_data", caslib="casuser"} |
| 4 | target="c" |
| 5 | binMissing=true |
| 6 | misraGries=false |
| 7 | stats={ |
| 8 | intervalInterval={"PEARSON", "MI"}, |
| 9 | nominalInterval={"MI", "SU"} |
| 10 | } |
| 11 | casOut={name="explore_results", caslib="casuser", replace=true}; |
| 12 | RUN; |