bart bartProbit

Edge Case: Handling Missing Data in Clinical Trial Analysis

Scénario de test & Cas d'usage

Business Context

A pharmaceutical research company is analyzing patient data from a clinical trial to predict patient recovery. The dataset is incomplete, with missing values for a key biomarker and patient-reported prior conditions. This scenario tests the model's robustness and the different strategies for handling missing data.
About the Set : bart

Bayesian Additive Regression Trees models.

Discover all actions of bart
Data Preparation

Create a patient dataset where 'Biomarker_Level' and 'Prior_Conditions' have systematically introduced missing values. The target is 'Recovered' (1/0).

Copied!
1DATA clinical_trial_data;
2 call streaminit(42);
3 DO PatientID = 1 to 500;
4 Age = 30 + floor(rand('Uniform') * 45);
5 Biomarker_Level = rand('Normal', 100, 15);
6 Prior_Conditions = floor(rand('Uniform') * 4);
7 /* Introduce missing values */
8 IF rand('Uniform') < 0.15 THEN call missing(Biomarker_Level);
9 IF rand('Uniform') < 0.20 THEN call missing(Prior_Conditions);
10 
11 logit_p = -3 + (Age/25) + (Biomarker_Level/50) - (Prior_Conditions * 0.4);
12 p = 1 / (1 + exp(-logit_p));
13 Recovered = rand('Bernoulli', p);
14 OUTPUT;
15 END;
16RUN;
17 
18DATA casuser.clinical_trial_data;
19 SET clinical_trial_data;
20RUN;

Étapes de réalisation

1
Run bart.bartProbit using the default missing value handling ('SEPARATE').
Copied!
1PROC CAS;
2 bart.bartProbit TABLE={name='clinical_trial_data'},
3 target='Recovered',
4 inputs={'Age', 'Biomarker_Level', 'Prior_Conditions'},
5 missing='SEPARATE', /* Default, but explicit for clarity */
6 minLeafSize=3, /* Test with a small leaf size */
7 nTree=50,
8 nBI=100,
9 nMC=500,
10 seed=42,
11 outputTables={names={NObs='nobs_separate'}};
12RUN;
13QUIT;
2
Run the same model but change missing value handling to 'NONE' to exclude rows with missing data.
Copied!
1PROC CAS;
2 bart.bartProbit TABLE={name='clinical_trial_data'},
3 target='Recovered',
4 inputs={'Age', 'Biomarker_Level', 'Prior_Conditions'},
5 missing='NONE',
6 minLeafSize=3,
7 nTree=50,
8 nBI=100,
9 nMC=500,
10 seed=42,
11 outputTables={names={NObs='nobs_none'}};
12RUN;
13QUIT;
3
Compare the number of observations used in both models.
Copied!
1PROC CAS;
2 TABLE.fetch TABLE={name='nobs_separate'};
3 TABLE.fetch TABLE={name='nobs_none'};
4RUN;
5QUIT;

Expected Result


Both actions should run successfully. The first run ('missing=SEPARATE') should use all 500 observations, treating missingness as a separate analytical category. The second run ('missing=NONE') should use a reduced number of observations, as it discards any row with a missing value in the predictors. Comparing the 'nobs_separate' and 'nobs_none' tables will confirm this difference, validating the behavior of the 'missing' parameter.