percentile boxPlot

Data Quality Assessment in Clinical Trials with Missing Data and Extreme Outliers

Scénario de test & Cas d'usage

Business Context

A clinical research organization is validating patient data. The dataset contains missing values for treatment groups and some erroneous, extreme blood pressure readings. The goal is to identify the exact outlier values for investigation and still get descriptive statistics for all cohorts, including the one with missing group information.
About the Set : percentile

Precise calculation of percentiles and quantiles.

Discover all actions of percentile
Data Preparation

Creation of a patient dataset with missing treatment group assignments and deliberately inserted extreme outliers for systolic blood pressure (systolic_bp).

Copied!
1DATA casuser.clinical_trial;
2 call streaminit(789);
3 /* Normal Data */
4 DO i = 1 to 200;
5 IF mod(i, 2) = 0 THEN treatment_group = 'A'; ELSE treatment_group = 'B';
6 systolic_bp = rand('NORMAL', 120, 10);
7 OUTPUT;
8 END;
9 /* Missing Group Data */
10 DO i = 1 to 50;
11 treatment_group = '';
12 systolic_bp = rand('NORMAL', 125, 12);
13 OUTPUT;
14 END;
15 /* Extreme Outliers */
16 treatment_group = 'A'; systolic_bp = 250; OUTPUT;
17 treatment_group = 'B'; systolic_bp = 40; OUTPUT;
18 treatment_group = 'A'; systolic_bp = 245; OUTPUT;
19RUN;

Étapes de réalisation

1
Load the messy clinical data into CAS.
Copied!
1 
2PROC CASUTIL;
3load
4DATA=casuser.clinical_trial outcaslib='casuser' casout='clinical_trial' replace;
5QUIT;
6 
2
Execute boxPlot to capture the top 2 highest and lowest outliers, define whiskers at the 5th/95th percentiles, and include the missing group in the analysis. The outliers are saved to a new table.
Copied!
1PROC CAS;
2 percentile.boxPlot /
3 TABLE={name='clinical_trial', groupBy={'treatment_group'}},
4 inputs={{name='systolic_bp'}},
5 includeMissingGroup=true,
6 outliers=true,
7 nOutLimit=2,
8 whiskerPercentile=5,
9 casOut={name='bp_outliers', replace=true};
10RUN;

Expected Result


The action produces two results. First, a summary statistics table that includes a separate row for the cohort with the missing 'treatment_group' value. Second, a new CAS table named 'bp_outliers' is created, containing exactly four rows: the two patients with the highest systolic_bp (250, 245) and the two with the lowest (40 and another low value from the normal distribution). This validates the handling of missing keys and advanced outlier detection.