Edge Case: Handling Missing Data in Clinical Trial Analysis
Scénario de test & Cas d'usage
Business Context
A pharmaceutical research company is analyzing patient data from a clinical trial to predict patient recovery. The dataset is incomplete, with missing values for a key biomarker and patient-reported prior conditions. This scenario tests the model's robustness and the different strategies for handling missing data.
Create a patient dataset where 'Biomarker_Level' and 'Prior_Conditions' have systematically introduced missing values. The target is 'Recovered' (1/0).
Copied!
data clinical_trial_data;\n call streaminit(42);\n do PatientID = 1 to 500;\n Age = 30 + floor(rand('Uniform') * 45);\n Biomarker_Level = rand('Normal', 100, 15);\n Prior_Conditions = floor(rand('Uniform') * 4);\n /* Introduce missing values */\n if rand('Uniform') < 0.15 then call missing(Biomarker_Level);\n if rand('Uniform') < 0.20 then call missing(Prior_Conditions);\n\n logit_p = -3 + (Age/25) + (Biomarker_Level/50) - (Prior_Conditions * 0.4);\n p = 1 / (1 + exp(-logit_p));\n Recovered = rand('Bernoulli', p);\n output;\n end;\nrun;\n\ndata casuser.clinical_trial_data;\n set clinical_trial_data;\nrun;
1
DATA clinical_trial_data;
2
call streaminit(42);
3
DO PatientID = 1 to 500;
4
Age = 30 + floor(rand('Uniform') * 45);
5
Biomarker_Level = rand('Normal', 100, 15);
6
Prior_Conditions = floor(rand('Uniform') * 4);
7
/* Introduce missing values */
8
IF rand('Uniform') < 0.15THEN call missing(Biomarker_Level);
9
IF rand('Uniform') < 0.20THEN call missing(Prior_Conditions);
Run bart.bartProbit using the default missing value handling ('SEPARATE').
Copied!
proc cas;\n bart.bartProbit table={name='clinical_trial_data'},
target='Recovered',
inputs={'Age', 'Biomarker_Level', 'Prior_Conditions'},
missing='SEPARATE', /* Default, but explicit for clarity */
minLeafSize=3, /* Test with a small leaf size */
nTree=50,
nBI=100,
nMC=500,
seed=42,
outputTables={names={NObs='nobs_separate'}};\nrun;\nquit;
Both actions should run successfully. The first run ('missing=SEPARATE') should use all 500 observations, treating missingness as a separate analytical category. The second run ('missing=NONE') should use a reduced number of observations, as it discards any row with a missing value in the predictors. Comparing the 'nobs_separate' and 'nobs_none' tables will confirm this difference, validating the behavior of the 'missing' parameter.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.