bart bartScoreMargin

Edge Case: Scoring with Missing Values and Unseen Factor Levels

Scénario de test & Cas d'usage

Business Context

A utility company uses a model to predict power grid failures. The model is trained on a complete dataset. However, the real-time scoring data sometimes contains missing sensor readings. The test must verify how the `bartScoreMargin` action behaves when asked to evaluate a scenario (e.g., a specific machine type) using input data that contains missing values.
About the Set : bart

Bayesian Additive Regression Trees models.

Discover all actions of bart
Data Preparation

Creates two datasets: a 'clean' one for training with two machine types ('Gen1', 'Gen2'), and a 'dirty' one for scoring that includes missing values for a sensor and a new, unseen machine type ('Gen3'). A BART model is trained only on the clean data.

Copied!
1DATA mycas.grid_data_clean mycas.grid_data_with_issues;
2 call streaminit(111);
3 /* Training Data */
4 DO i = 1 to 1000;
5 sensor1 = rand('NORMAL', 100, 5);
6 sensor2 = rand('NORMAL', 50, 2);
7 IF rand('UNIFORM') < 0.5 THEN machine_type = 'Gen1'; ELSE machine_type = 'Gen2';
8 p = 1 / (1 + exp(-( -5 + (0.01*sensor1) + (0.02*sensor2) )));
9 failure = rand('BERNOULLI', p);
10 OUTPUT mycas.grid_data_clean;
11 END;
12 /* Scoring Data with Issues */
13 DO i = 1 to 200;
14 sensor1 = rand('NORMAL', 100, 5);
15 /* Introduce missing values */
16 IF rand('UNIFORM') < 0.2 THEN sensor2 = .; ELSE sensor2 = rand('NORMAL', 50, 2);
17 rand_type = rand('UNIFORM');
18 /* Introduce unseen factor level */
19 IF rand_type < 0.45 THEN machine_type = 'Gen1';
20 ELSE IF rand_type < 0.9 THEN machine_type = 'Gen2';
21 ELSE machine_type = 'Gen3';
22 p = 1 / (1 + exp(-( -5 + (0.01*sensor1) + (0.02*sensor2) )));
23 failure = rand('BERNOULLI', p);
24 OUTPUT mycas.grid_data_with_issues;
25 END;
26RUN;
27 
28PROC CAS;
29 bart.bartProbit TABLE={name='grid_data_clean'},
30 model={depVars={{name='failure', levelType='BINARY'}},
31 effects={{vars={'sensor1', 'sensor2', 'machine_type'}}}},
32 store={name='grid_failure_model', replace=true};
33QUIT;

Étapes de réalisation

1
Attempt to calculate the predictive margin for 'Gen1' machines using the 'dirty' dataset that contains missing values for 'sensor2' and unseen levels for 'machine_type'.
Copied!
1PROC CAS;
2 bart.bartScoreMargin
3 TABLE='grid_data_with_issues',
4 model='grid_failure_model',
5 seed=222,
6 margins={{
7 name='margin_gen1_failure',
8 at={{var='machine_type', value='Gen1'}}
9 }};
10QUIT;

Expected Result


The action is expected to handle the missing data gracefully. The BART algorithm inherently handles missing predictors by averaging predictions over the two branches of a split. The rows with missing 'sensor2' values should be included in the calculation, not dropped. The unseen level 'Gen3' will be ignored for the 'margin_gen1_failure' calculation. The action should complete and produce a valid 'Margins' table, demonstrating its robustness to imperfect real-world data.