dataSciencePilot

analyzeMissingPatterns

L'essentiel
At a glance
Ensuring data integrity is paramount for building robust machine learning pipelines, and the analyzeMissingPatterns action serves as a critical diagnostic tool in this phase. Data Engineers leverage this command not just to count nulls, but to uncover underlying dependencies and distinct clusters of sparsity across the dataset. By mapping out which variables tend to be empty simultaneously, analysts can determine the root cause of data loss and select the most appropriate feature engineering techniques. This page features a comprehensive FAQ section designed to help you configure the analysis parameters and interpret the resulting pattern tables effectively.

Description

The analyzeMissingPatterns action performs a missing pattern analysis. It is a part of the Data Science Pilot action set, designed to automate and enhance data science workflows. This action is particularly useful in the exploratory data analysis phase to understand the extent and nature of missing data, which is crucial for subsequent modeling steps. It can identify different patterns of missingness across variables and analyze their relationship with a target variable, helping to decide on an appropriate imputation strategy.

dataSciencePilot.analyzeMissingPatterns / table={...} casOut={...} <inputs={{...}, ...}> <nominals={"variable-name-1", ...}> <target="variable-name"> <freq="variable-name"> <distinctCountLimit=integer> <ecdfTolerance=double> <misraGries=TRUE | FALSE>;
Settings
ParameterDescription
table Specifies the input table for the analysis. This table should contain the data for which you want to analyze missing value patterns.
casOut Specifies the output table to store the results of the missing pattern analysis. This is a required parameter.
inputs Specifies the list of variables to be included in the analysis. If not specified, all numeric and character variables from the input table are used.
nominals Specifies which of the input variables should be treated as nominal (categorical). This affects how statistics are calculated for these variables.
target Specifies a target variable. When a target is provided, the action analyzes the relationship between missing value patterns and the target variable's distribution.
freq Specifies a frequency variable. Each observation in the input table is treated as if it appears n times, where n is the value of the frequency variable for that observation.
distinctCountLimit Sets a limit on the number of distinct values for frequency counting. If this limit is exceeded, the action may switch to an estimation algorithm (Misra-Gries) or abort.
ecdfTolerance Specifies the tolerance for the empirical cumulative distribution function (ECDF), used by the quantile sketch algorithm for robust statistics.
misraGries When set to TRUE, enables the use of the Misra-Gries algorithm for frequency estimation if the distinct count limit is surpassed.
Data Preparation View data prep sheet
Creating a Sample Dataset with Missing Values

This SAS code creates a sample dataset named 'sample_data_missing' in the 'casuser' caslib. The dataset includes several variables with intentionally placed missing values (.) to demonstrate how the analyzeMissingPatterns action works.

Copied!
1DATA casuser.sample_data_missing;
2 INPUT var1 var2 $ var3 var4 target;
3 CARDS;
41 10 A 100 1
52 . B 200 0
63 30 . 300 1
74 40 C . 0
85 . D 500 1
96 60 . 600 0
107 70 E . 1
11;
12RUN;

Examples

This example performs a fundamental missing pattern analysis on the 'sample_data_missing' table. The results, which include tables detailing missing counts and patterns, are saved to a CAS table named 'missing_patterns_summary'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='sample_data_missing'},
4 casOut={name='missing_patterns_summary', replace=true};
5RUN;
6QUIT;
Result :
The action generates several result tables. The 'MissingPatterns' table outlines the different combinations of missing values found. The 'MissingCounts' table provides the count and percentage of missing values for each variable. 'NumVarInfo' and 'CharVarInfo' provide descriptive statistics for numeric and character variables, respectively.

This detailed example analyzes missing value patterns in relation to a specific target variable ('target'). It explicitly defines 'var3' as a nominal variable and focuses the analysis on a specified list of input variables. This approach helps in understanding if the missingness of data is correlated with the outcome variable.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='sample_data_missing'},
4 inputs={{name='var1'}, {name='var2'}, {name='var3'}, {name='var4'}},
5 nominals={'var3'},
6 target='target',
7 casOut={name='missing_patterns_details', replace=true};
8RUN;
9QUIT;
Result :
The output includes several analytical tables. Notably, the 'TargetCounts' table shows the frequency distribution of the target variable for each identified missing value pattern. The 'TargetMeans' table provides the mean of the target variable for each pattern. These results are crucial for determining if the missing data mechanism is Missing Not At Random (MNAR) with respect to the target.

FAQ

What is the purpose of the analyzeMissingPatterns action?
What are the required parameters for the analyzeMissingPatterns action?
How can I specify which variables to analyze for missing patterns?
What is the role of the 'target' parameter?
How does the action handle variables with a high number of unique values?
Can I incorporate observation frequencies into the analysis?

Associated Scenarios

Use Case
Standard Case: Customer Churn Prediction - Impact of Missing Demographic Data

A telecom company wants to understand why customers are churning. They suspect that missing information in their CRM (like income or age) is not random and might be correlated w...

Use Case
Performance Case: Analyzing Missing Sensor Readings in High-Volume IoT Data

A manufacturing plant uses thousands of IoT sensors to monitor equipment. Some sensors intermittently fail to report data. The goal is to analyze these missing data patterns at ...

Use Case
Edge Case: Handling Dropouts and Sparse Data in Clinical Trial Analysis

In a clinical trial, patients may drop out, leading to missing data for all subsequent visits. One specific biomarker is experimental and fails to record for most patients. This...