dataSciencePilot

exploreData

Description

The exploreData action performs data exploration, automatic variable analysis, and grouping using comprehensive statistical profiling of variables. It calculates various statistics such as cardinality, entropy, kurtosis, missing values, and skewness to profile the data. This action is essential for understanding data structure and quality before proceeding with advanced modeling or feature engineering steps.

Settings
ParameterDescription
casOutSpecifies the CAS table to store the analysis results.
distinctCountLimitSpecifies the distinct count limit. Default is 10000. If exceeded, the Misra-Gries algorithm may be used.
ecdfToleranceSpecifies the tolerance value for the empirical cumulative distribution function (ECDF).
eventSpecifies the target variable level that you want to model for classification problems.
explorationPolicySpecifies the automatic variable analysis and grouping (AVAPT) policy settings for cardinality, entropy, outliers, etc.
freqSpecifies the variable used for frequency counts.
inputsSpecifies the specific variables to use for the analysis.
misraGriesSpecifies whether to use the Misra-Gries algorithm for frequency estimation if the distinct count limit is exceeded. Default is TRUE.
tableSpecifies the input table name, caslib, and other data access options.
targetSpecifies the target variable for the analysis.
weightSpecifies the variable to use for weighting observations.
Data Preparation View data prep sheet
Load Sample Data

Load the SASHelp 'Cars' dataset into the active CAS session for analysis.

Copied!
1PROC CAS;
2 SESSION casauto;
3 upload path="%sysfunc(pathname(cars, sashelp))" casout={name="cars", replace=true};
4RUN;

Examples

Performs a basic exploration of the 'cars' table to profile all variables and stores the result in 'explore_results'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.exploreData /
3 TABLE={name="cars"}
4 casOut={name="explore_results", replace=true};
5RUN;
Result :
A CAS table 'explore_results' containing statistical profiles for all variables in the 'cars' dataset.

Explores the 'cars' table with 'Origin' as the target variable. It customizes the exploration policy for skewness and missing values.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.exploreData /
3 TABLE={name="cars"}
4 target="Origin"
5 explorationPolicy={
6 skewness={momentMediumHighCutoff=5},
7 missing={lowMediumCutoff=10}
8 }
9 casOut={name="detailed_explore", replace=true};
10RUN;
Result :
A 'detailed_explore' table with variable profiles, where evaluation considers the 'Origin' target and uses custom thresholds for skewness and missingness.

FAQ

What is the purpose of the exploreData action?
What does the casOut parameter specify?
What is the function of the distinctCountLimit parameter?
What does the ecdfTolerance parameter control?
How is the event parameter used?
What is defined by the explorationPolicy parameter?
What does the freq parameter specify?
What is the inputs parameter used for?
What does the misraGries parameter do?
What information is provided by the table parameter?
What does the target parameter specify?
What is the purpose of the weight parameter?