Edge Case: Handling Missing Data in Clinical Trial Analysis

Business Context

A pharmaceutical research company is analyzing patient data from a clinical trial to predict patient recovery. The dataset is incomplete, with missing values for a key biomarker and patient-reported prior conditions. This scenario tests the model's robustness and the different strategies for handling missing data.

About the Set : bart

Bayesian Additive Regression Trees models.

Discover all actions of bart

Data Preparation

Create a patient dataset where 'Biomarker_Level' and 'Prior_Conditions' have systematically introduced missing values. The target is 'Recovered' (1/0).

Copied!

1	DATA clinical_trial_data;
2	call streaminit(42);
3	DO PatientID = 1 to 500;
4	Age = 30 + floor(rand('Uniform') * 45);
5	Biomarker_Level = rand('Normal', 100, 15);
6	Prior_Conditions = floor(rand('Uniform') * 4);
7	/* Introduce missing values */
8	IF rand('Uniform') < 0.15 THEN call missing(Biomarker_Level);
9	IF rand('Uniform') < 0.20 THEN call missing(Prior_Conditions);
10
11	logit_p = -3 + (Age/25) + (Biomarker_Level/50) - (Prior_Conditions * 0.4);
12	p = 1 / (1 + exp(-logit_p));
13	Recovered = rand('Bernoulli', p);
14	OUTPUT;
15	END;
16	RUN;
17
18	DATA casuser.clinical_trial_data;
19	SET clinical_trial_data;
20	RUN;

Étapes de réalisation

Run bart.bartProbit using the default missing value handling ('SEPARATE').

Copied!

1	PROC CAS;
2	bart.bartProbit TABLE={name='clinical_trial_data'},
3	target='Recovered',
4	inputs={'Age', 'Biomarker_Level', 'Prior_Conditions'},
5	missing='SEPARATE', /* Default, but explicit for clarity */
6	minLeafSize=3, /* Test with a small leaf size */
7	nTree=50,
8	nBI=100,
9	nMC=500,
10	seed=42,
11	outputTables={names={NObs='nobs_separate'}};
12	RUN;
13	QUIT;

Run the same model but change missing value handling to 'NONE' to exclude rows with missing data.

Copied!

1	PROC CAS;
2	bart.bartProbit TABLE={name='clinical_trial_data'},
3	target='Recovered',
4	inputs={'Age', 'Biomarker_Level', 'Prior_Conditions'},
5	missing='NONE',
6	minLeafSize=3,
7	nTree=50,
8	nBI=100,
9	nMC=500,
10	seed=42,
11	outputTables={names={NObs='nobs_none'}};
12	RUN;
13	QUIT;

Compare the number of observations used in both models.

Copied!

1	PROC CAS;
2	TABLE.fetch TABLE={name='nobs_separate'};
3	TABLE.fetch TABLE={name='nobs_none'};
4	RUN;
5	QUIT;

Expected Result

Both actions should run successfully. The first run ('missing=SEPARATE') should use all 500 observations, treating missingness as a separate analytical category. The second run ('missing=NONE') should use a reduced number of observations, as it discards any row with a missing value in the predictors. Comparing the 'nobs_separate' and 'nobs_none' tables will confirm this difference, validating the behavior of the 'missing' parameter.

Voir la documentation technique de bartProbit