Edge Case: Scoring with Missing Values and Unseen Factor Levels

Business Context

A utility company uses a model to predict power grid failures. The model is trained on a complete dataset. However, the real-time scoring data sometimes contains missing sensor readings. The test must verify how the `bartScoreMargin` action behaves when asked to evaluate a scenario (e.g., a specific machine type) using input data that contains missing values.

About the Set : bart

Bayesian Additive Regression Trees models.

Discover all actions of bart

Data Preparation

Creates two datasets: a 'clean' one for training with two machine types ('Gen1', 'Gen2'), and a 'dirty' one for scoring that includes missing values for a sensor and a new, unseen machine type ('Gen3'). A BART model is trained only on the clean data.

Copied!

1	DATA mycas.grid_data_clean mycas.grid_data_with_issues;
2	call streaminit(111);
3	/* Training Data */
4	DO i = 1 to 1000;
5	sensor1 = rand('NORMAL', 100, 5);
6	sensor2 = rand('NORMAL', 50, 2);
7	IF rand('UNIFORM') < 0.5 THEN machine_type = 'Gen1'; ELSE machine_type = 'Gen2';
8	p = 1 / (1 + exp(-( -5 + (0.01sensor1) + (0.02sensor2) )));
9	failure = rand('BERNOULLI', p);
10	OUTPUT mycas.grid_data_clean;
11	END;
12	/* Scoring Data with Issues */
13	DO i = 1 to 200;
14	sensor1 = rand('NORMAL', 100, 5);
15	/* Introduce missing values */
16	IF rand('UNIFORM') < 0.2 THEN sensor2 = .; ELSE sensor2 = rand('NORMAL', 50, 2);
17	rand_type = rand('UNIFORM');
18	/* Introduce unseen factor level */
19	IF rand_type < 0.45 THEN machine_type = 'Gen1';
20	ELSE IF rand_type < 0.9 THEN machine_type = 'Gen2';
21	ELSE machine_type = 'Gen3';
22	p = 1 / (1 + exp(-( -5 + (0.01sensor1) + (0.02sensor2) )));
23	failure = rand('BERNOULLI', p);
24	OUTPUT mycas.grid_data_with_issues;
25	END;
26	RUN;
27
28	PROC CAS;
29	bart.bartProbit TABLE={name='grid_data_clean'},
30	model={depVars={{name='failure', levelType='BINARY'}},
31	effects={{vars={'sensor1', 'sensor2', 'machine_type'}}}},
32	store={name='grid_failure_model', replace=true};
33	QUIT;

Étapes de réalisation

Attempt to calculate the predictive margin for 'Gen1' machines using the 'dirty' dataset that contains missing values for 'sensor2' and unseen levels for 'machine_type'.

Copied!

1	PROC CAS;
2	bart.bartScoreMargin
3	TABLE='grid_data_with_issues',
4	model='grid_failure_model',
5	seed=222,
6	margins={{
7	name='margin_gen1_failure',
8	at={{var='machine_type', value='Gen1'}}
9	}};
10	QUIT;

Expected Result

The action is expected to handle the missing data gracefully. The BART algorithm inherently handles missing predictors by averaging predictions over the two branches of a split. The rows with missing 'sensor2' values should be included in the calculation, not dropped. The unseen level 'Gen3' will be ignored for the 'margin_gen1_failure' calculation. The action should complete and produce a valid 'Margins' table, demonstrating its robustness to imperfect real-world data.

Voir la documentation technique de bartScoreMargin