Edge Case: Handling Missing Data in Clinical Trial Analysis
Scénario de test & Cas d'usage
Business Context
A pharmaceutical research company is analyzing patient data from a clinical trial to predict patient recovery. The dataset is incomplete, with missing values for a key biomarker and patient-reported prior conditions. This scenario tests the model's robustness and the different strategies for handling missing data.
Create a patient dataset where 'Biomarker_Level' and 'Prior_Conditions' have systematically introduced missing values. The target is 'Recovered' (1/0).
Copied!
data clinical_trial_data;\n call streaminit(42);\n do PatientID = 1 to 500;\n Age = 30 + floor(rand('Uniform') * 45);\n Biomarker_Level = rand('Normal', 100, 15);\n Prior_Conditions = floor(rand('Uniform') * 4);\n /* Introduce missing values */\n if rand('Uniform') < 0.15 then call missing(Biomarker_Level);\n if rand('Uniform') < 0.20 then call missing(Prior_Conditions);\n\n logit_p = -3 + (Age/25) + (Biomarker_Level/50) - (Prior_Conditions * 0.4);\n p = 1 / (1 + exp(-logit_p));\n Recovered = rand('Bernoulli', p);\n output;\n end;\nrun;\n\ndata casuser.clinical_trial_data;\n set clinical_trial_data;\nrun;
1
DATA clinical_trial_data;
2
call streaminit(42);
3
DO PatientID = 1 to 500;
4
Age = 30 + floor(rand('Uniform') * 45);
5
Biomarker_Level = rand('Normal', 100, 15);
6
Prior_Conditions = floor(rand('Uniform') * 4);
7
/* Introduce missing values */
8
IF rand('Uniform') < 0.15THEN call missing(Biomarker_Level);
9
IF rand('Uniform') < 0.20THEN call missing(Prior_Conditions);
Run bart.bartProbit using the default missing value handling ('SEPARATE').
Copied!
proc cas;\n bart.bartProbit table={name='clinical_trial_data'},
target='Recovered',
inputs={'Age', 'Biomarker_Level', 'Prior_Conditions'},
missing='SEPARATE', /* Default, but explicit for clarity */
minLeafSize=3, /* Test with a small leaf size */
nTree=50,
nBI=100,
nMC=500,
seed=42,
outputTables={names={NObs='nobs_separate'}};\nrun;\nquit;
Both actions should run successfully. The first run ('missing=SEPARATE') should use all 500 observations, treating missingness as a separate analytical category. The second run ('missing=NONE') should use a reduced number of observations, as it discards any row with a missing value in the predictors. Comparing the 'nobs_separate' and 'nobs_none' tables will confirm this difference, validating the behavior of the 'missing' parameter.