dataSciencePilot analyzeMissingPatterns

Edge Case: Handling Dropouts and Sparse Data in Clinical Trial Analysis

Scénario de test & Cas d'usage

Business Context

In a clinical trial, patients may drop out, leading to missing data for all subsequent visits. One specific biomarker is experimental and fails to record for most patients. This scenario tests how the action handles systematic missingness (dropout patterns) and variables that are almost entirely empty.
About the Set : dataSciencePilot

Automated Machine Learning (AutoML) and pipeline generation.

Discover all actions of dataSciencePilot
Data Preparation

Creation of a patient dataset from a clinical trial. 'visit2_score', 'visit3_score', and 'experimental_marker' contain numerous missing values due to patient dropout or measurement failure.

Copied!
1DATA casuser.clinical_trial_data;
2 LENGTH patient_id $ 8. treatment_arm $ 1.;
3 INPUT patient_id $ treatment_arm $ visit1_score visit2_score visit3_score experimental_marker;
4 CARDS;
5P001 A 85 82 79 1.2
6P002 B 76 70 . .
7P003 A 91 . . .
8P004 B 65 66 68 .
9P005 A 88 85 . .
10P006 B 72 . . .
11;
12RUN;

Étapes de réalisation

1
Load the clinical trial data into CAS.
Copied!
1PROC CASUTIL;
2 load DATA=casuser.clinical_trial_data outcaslib='casuser' casout='clinical_trial_data' replace;
3RUN;
4QUIT;
2
Execute a basic analysis to identify all missing value patterns. No target is specified.
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='clinical_trial_data', caslib='casuser'},
4 casOut={name='clinical_missing_edge_case', caslib='casuser', replace=true};
5RUN;
6QUIT;

Expected Result


The 'MissingPatterns' output table should correctly identify distinct patterns of missingness. A key pattern to verify is the one where 'visit2_score', 'visit3_score', and 'experimental_marker' are all missing, which represents early dropout (2 occurrences). Another pattern is where 'visit3_score' and 'experimental_marker' are missing (2 occurrences). The 'MissingCounts' table should report a very high percentage of missing values (83.3%) for 'experimental_marker', testing its handling of sparse variables.