dataSciencePilot

exploreCorrelation

Description

The exploreCorrelation action explores linear and nonlinear correlations among variables in a table. It computes various correlation statistics for pairs of variables, supporting both interval and nominal types. It can handle missing values and allows for configuring distinct count limits and estimation algorithms like Misra-Gries.

Settings
ParameterDescription
binMissingWhen set to True, missing values are included in the analysis. Default is FALSE.
casOutSpecifies the CAS table to store the analysis results. Includes parameters for caslib, name, replace, promote, and memory management.
distinctCountLimitSpecifies the distinct count limit. If exceeded, the Misra-Gries algorithm is used if enabled. Default is 10000.
ecdfToleranceSpecifies the tolerance value for the empirical cumulative distribution function used by the quantile sketch algorithm. Default is 0.001.
eventSpecifies the target variable level to model. Used for casting multilevel classification into binary classification.
freqSpecifies the variable to use for frequency counts.
inputsSpecifies the variables to use for the analysis. You can specify a subset of the variables from the input table.
misraGriesWhen set to True, uses the Misra-Gries algorithm for frequency distribution estimation if the distinct count limit is exceeded. Default is TRUE.
nominalsSpecifies the list of nominal variables.
statsSpecifies the correlation probes (statistics) to use for the comprehensive correlation analysis. Options include PEARSON, MI (Mutual Information), ENTROPY, etc.
tableSpecifies the input table name, caslib, and other common parameters like where-clauses.
targetSpecifies the target variable for the correlation analysis.
Data Preparation View data prep sheet
Create Sample Data

Generates a sample dataset with correlated interval and nominal variables for analysis.

Copied!
1PROC CAS;
2 dataStep.runCode /
3 code="data casuser.correlation_data;
4 call streaminit(123);
5 do i=1 to 1000;
6 x = rand('normal');
7 y = 2*x + rand('normal', 0, 0.5);
8 if x > 0 then c = 'Positive'; else c = 'Negative';
9 if rand('uniform') > 0.9 then call missing(x);
10 output;
11 end;
12 run;";
13RUN;

Examples

Runs exploreCorrelation on the sample data specifying only the table and the target variable.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.exploreCorrelation /
3 TABLE={name="correlation_data", caslib="casuser"}
4 target="c";
5RUN;
Result :
Generates a correlation report ranking variables based on their correlation with the target 'c'.

Runs a comprehensive analysis including missing values, specific statistics (Pearson, MI), and saves the results to a table.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.exploreCorrelation /
3 TABLE={name="correlation_data", caslib="casuser"}
4 target="c"
5 binMissing=true
6 misraGries=false
7 stats={
8 intervalInterval={"PEARSON", "MI"},
9 nominalInterval={"MI", "SU"}
10 }
11 casOut={name="explore_results", caslib="casuser", replace=true};
12RUN;
Result :
Produces detailed correlation statistics using Pearson and Mutual Information, treating missing values as a distinct bin, and saves the output to 'explore_results'.

FAQ

What is the purpose of the exploreCorrelation action?
What does the binMissing parameter do?
How can I specify the output table for the analysis results?
What is the purpose of the distinctCountLimit parameter?
What is the misraGries parameter?
Which statistics can be used for interval-interval correlation?
Which statistics can be used for nominal-nominal correlation?
What does the singlePass parameter do?
What is the target parameter used for?