dataSciencePilot

analyzeMissingPatterns

Description

The analyzeMissingPatterns action performs a missing pattern analysis. It is a part of the Data Science Pilot action set, designed to automate and enhance data science workflows. This action is particularly useful in the exploratory data analysis phase to understand the extent and nature of missing data, which is crucial for subsequent modeling steps. It can identify different patterns of missingness across variables and analyze their relationship with a target variable, helping to decide on an appropriate imputation strategy.

dataSciencePilot.analyzeMissingPatterns / table={...} casOut={...} <inputs={{...}, ...}> <nominals={"variable-name-1", ...}> <target="variable-name"> <freq="variable-name"> <distinctCountLimit=integer> <ecdfTolerance=double> <misraGries=TRUE | FALSE>;
Settings
ParameterDescription
tableSpecifies the input table for the analysis. This table should contain the data for which you want to analyze missing value patterns.
casOutSpecifies the output table to store the results of the missing pattern analysis. This is a required parameter.
inputsSpecifies the list of variables to be included in the analysis. If not specified, all numeric and character variables from the input table are used.
nominalsSpecifies which of the input variables should be treated as nominal (categorical). This affects how statistics are calculated for these variables.
targetSpecifies a target variable. When a target is provided, the action analyzes the relationship between missing value patterns and the target variable's distribution.
freqSpecifies a frequency variable. Each observation in the input table is treated as if it appears n times, where n is the value of the frequency variable for that observation.
distinctCountLimitSets a limit on the number of distinct values for frequency counting. If this limit is exceeded, the action may switch to an estimation algorithm (Misra-Gries) or abort.
ecdfToleranceSpecifies the tolerance for the empirical cumulative distribution function (ECDF), used by the quantile sketch algorithm for robust statistics.
misraGriesWhen set to TRUE, enables the use of the Misra-Gries algorithm for frequency estimation if the distinct count limit is surpassed.
Data Preparation View data prep sheet
Creating a Sample Dataset with Missing Values

This SAS code creates a sample dataset named 'sample_data_missing' in the 'casuser' caslib. The dataset includes several variables with intentionally placed missing values (.) to demonstrate how the analyzeMissingPatterns action works.

Copied!
1DATA casuser.sample_data_missing;
2 INPUT var1 var2 $ var3 var4 target;
3 CARDS;
41 10 A 100 1
52 . B 200 0
63 30 . 300 1
74 40 C . 0
85 . D 500 1
96 60 . 600 0
107 70 E . 1
11;
12RUN;

Examples

This example performs a fundamental missing pattern analysis on the 'sample_data_missing' table. The results, which include tables detailing missing counts and patterns, are saved to a CAS table named 'missing_patterns_summary'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='sample_data_missing'},
4 casOut={name='missing_patterns_summary', replace=true};
5RUN;
6QUIT;
Result :
The action generates several result tables. The 'MissingPatterns' table outlines the different combinations of missing values found. The 'MissingCounts' table provides the count and percentage of missing values for each variable. 'NumVarInfo' and 'CharVarInfo' provide descriptive statistics for numeric and character variables, respectively.

This detailed example analyzes missing value patterns in relation to a specific target variable ('target'). It explicitly defines 'var3' as a nominal variable and focuses the analysis on a specified list of input variables. This approach helps in understanding if the missingness of data is correlated with the outcome variable.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='sample_data_missing'},
4 inputs={{name='var1'}, {name='var2'}, {name='var3'}, {name='var4'}},
5 nominals={'var3'},
6 target='target',
7 casOut={name='missing_patterns_details', replace=true};
8RUN;
9QUIT;
Result :
The output includes several analytical tables. Notably, the 'TargetCounts' table shows the frequency distribution of the target variable for each identified missing value pattern. The 'TargetMeans' table provides the mean of the target variable for each pattern. These results are crucial for determining if the missing data mechanism is Missing Not At Random (MNAR) with respect to the target.

FAQ

What is the purpose of the analyzeMissingPatterns action?
What are the required parameters for the analyzeMissingPatterns action?
How can I specify which variables to analyze for missing patterns?
What is the role of the 'target' parameter?
How does the action handle variables with a high number of unique values?
Can I incorporate observation frequencies into the analysis?

Associated Scenarios

Use Case
Standard Case: Customer Churn Prediction - Impact of Missing Demographic Data

A telecom company wants to understand why customers are churning. They suspect that missing information in their CRM (like income or age) is not random and might be correlated w...

Use Case
Performance Case: Analyzing Missing Sensor Readings in High-Volume IoT Data

A manufacturing plant uses thousands of IoT sensors to monitor equipment. Some sensors intermittently fail to report data. The goal is to analyze these missing data patterns at ...

Use Case
Edge Case: Handling Dropouts and Sparse Data in Clinical Trial Analysis

In a clinical trial, patients may drop out, leading to missing data for all subsequent visits. One specific biomarker is experimental and fails to record for most patients. This...