exploreCorrelation

Q: What is the purpose of the exploreCorrelation action?

The exploreCorrelation action is used to explore linear and nonlinear correlation among the variables.

Q: What does the binMissing parameter do?

When set to True, the binMissing parameter includes missing values in the analysis. The default value is FALSE.

Q: How can I specify the output table for the analysis results?

You can use the casOut parameter to specify the CAS table where the analysis results will be stored.

Q: What is the purpose of the distinctCountLimit parameter?

It specifies the distinct count limit. If this limit is exceeded and the misraGries parameter is set to True, the Misra-Gries frequency sketch algorithm is used to estimate the frequency distribution. The default value is 10000.

Q: What is the misraGries parameter?

When set to True (which is the default), the misraGries parameter enables the use of the Misra-Gries algorithm for frequency distribution estimation if the distinct count limit is exceeded.

Q: Which statistics can be used for interval-interval correlation?

The available statistics are ENTROPY, MI (Mutual Information, default), NORMMI (Normalized Mutual Information), PEARSON, and SU (Symmetric Uncertainty).

Q: Which statistics can be used for nominal-nominal correlation?

The available statistics are CHISQ, CRAMERSV, ENTROPY, G2, IV (Information Value), MI (default), NORMMI, and SU.

Q: What does the singlePass parameter do?

When set to True, the singlePass parameter prevents the creation of a transient table on the server. This can be efficient, but the data might not have stable ordering upon repeated runs. The default is FALSE.

Q: What is the target parameter used for?

The target parameter specifies the target variable for the analysis.

Description

The exploreCorrelation action explores linear and nonlinear correlations among variables in a table. It computes various correlation statistics for pairs of variables, supporting both interval and nominal types. It can handle missing values and allows for configuring distinct count limits and estimation algorithms like Misra-Gries.

Settings

Parameter	Description
binMissing	When set to True, missing values are included in the analysis. Default is FALSE.
casOut	Specifies the CAS table to store the analysis results. Includes parameters for caslib, name, replace, promote, and memory management.
distinctCountLimit	Specifies the distinct count limit. If exceeded, the Misra-Gries algorithm is used if enabled. Default is 10000.
ecdfTolerance	Specifies the tolerance value for the empirical cumulative distribution function used by the quantile sketch algorithm. Default is 0.001.
event	Specifies the target variable level to model. Used for casting multilevel classification into binary classification.
freq	Specifies the variable to use for frequency counts.
inputs	Specifies the variables to use for the analysis. You can specify a subset of the variables from the input table.
misraGries	When set to True, uses the Misra-Gries algorithm for frequency distribution estimation if the distinct count limit is exceeded. Default is TRUE.
nominals	Specifies the list of nominal variables.
stats	Specifies the correlation probes (statistics) to use for the comprehensive correlation analysis. Options include PEARSON, MI (Mutual Information), ENTROPY, etc.
table	Specifies the input table name, caslib, and other common parameters like where-clauses.
target	Specifies the target variable for the correlation analysis.

Data Preparation View data prep sheet

Create Sample Data

Generates a sample dataset with correlated interval and nominal variables for analysis.

Copied!

1	PROC CAS;
2	dataStep.runCode /
3	code="data casuser.correlation_data;
4	call streaminit(123);
5	do i=1 to 1000;
6	x = rand('normal');
7	y = 2*x + rand('normal', 0, 0.5);
8	if x > 0 then c = 'Positive'; else c = 'Negative';
9	if rand('uniform') > 0.9 then call missing(x);
10	output;
11	end;
12	run;";
13	RUN;

Examples

Runs exploreCorrelation on the sample data specifying only the table and the target variable.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	dataSciencePilot.exploreCorrelation /
3	TABLE={name="correlation_data", caslib="casuser"}
4	target="c";
5	RUN;

Result :
Generates a correlation report ranking variables based on their correlation with the target 'c'.

Runs a comprehensive analysis including missing values, specific statistics (Pearson, MI), and saves the results to a table.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	dataSciencePilot.exploreCorrelation /
3	TABLE={name="correlation_data", caslib="casuser"}
4	target="c"
5	binMissing=true
6	misraGries=false
7	stats={
8	intervalInterval={"PEARSON", "MI"},
9	nominalInterval={"MI", "SU"}
10	}
11	casOut={name="explore_results", caslib="casuser", replace=true};
12	RUN;

Result :
Produces detailed correlation statistics using Pearson and Mutual Information, treating missing values as a distinct bin, and saves the output to 'explore_results'.

FAQ

What is the purpose of the exploreCorrelation action?

What does the binMissing parameter do?

How can I specify the output table for the analysis results?

What is the purpose of the distinctCountLimit parameter?

What is the misraGries parameter?

Which statistics can be used for interval-interval correlation?

Which statistics can be used for nominal-nominal correlation?

What does the singlePass parameter do?

What is the target parameter used for?

Actions associées

dataSciencePilot

analyzeMissingPatterns

The analyzeMissingPatterns action performs a missing pattern analysis. It is ...

dataSciencePilot

exploreData

The exploreData action performs data exploration, automatic variable analysis...

dataSciencePilot

featureMachine

The featureMachine action is an automated feature transformation and generati...

dataSciencePilot

generateShadowFeatures

Generate shadow features.

Table of Contents

Description

Create Sample Data

Examples

Simple Correlation Exploration

Advanced Correlation Analysis

FAQ

Actions associées

analyzeMissingPatterns

exploreData

featureMachine

generateShadowFeatures