Scénario de test & Cas d'usage
Creates two datasets: a 'clean' one for training with two machine types ('Gen1', 'Gen2'), and a 'dirty' one for scoring that includes missing values for a sensor and a new, unseen machine type ('Gen3'). A BART model is trained only on the clean data.
| 1 | DATA mycas.grid_data_clean mycas.grid_data_with_issues; |
| 2 | call streaminit(111); |
| 3 | /* Training Data */ |
| 4 | DO i = 1 to 1000; |
| 5 | sensor1 = rand('NORMAL', 100, 5); |
| 6 | sensor2 = rand('NORMAL', 50, 2); |
| 7 | IF rand('UNIFORM') < 0.5 THEN machine_type = 'Gen1'; ELSE machine_type = 'Gen2'; |
| 8 | p = 1 / (1 + exp(-( -5 + (0.01*sensor1) + (0.02*sensor2) ))); |
| 9 | failure = rand('BERNOULLI', p); |
| 10 | OUTPUT mycas.grid_data_clean; |
| 11 | END; |
| 12 | /* Scoring Data with Issues */ |
| 13 | DO i = 1 to 200; |
| 14 | sensor1 = rand('NORMAL', 100, 5); |
| 15 | /* Introduce missing values */ |
| 16 | IF rand('UNIFORM') < 0.2 THEN sensor2 = .; ELSE sensor2 = rand('NORMAL', 50, 2); |
| 17 | rand_type = rand('UNIFORM'); |
| 18 | /* Introduce unseen factor level */ |
| 19 | IF rand_type < 0.45 THEN machine_type = 'Gen1'; |
| 20 | ELSE IF rand_type < 0.9 THEN machine_type = 'Gen2'; |
| 21 | ELSE machine_type = 'Gen3'; |
| 22 | p = 1 / (1 + exp(-( -5 + (0.01*sensor1) + (0.02*sensor2) ))); |
| 23 | failure = rand('BERNOULLI', p); |
| 24 | OUTPUT mycas.grid_data_with_issues; |
| 25 | END; |
| 26 | RUN; |
| 27 | |
| 28 | PROC CAS; |
| 29 | bart.bartProbit TABLE={name='grid_data_clean'}, |
| 30 | model={depVars={{name='failure', levelType='BINARY'}}, |
| 31 | effects={{vars={'sensor1', 'sensor2', 'machine_type'}}}}, |
| 32 | store={name='grid_failure_model', replace=true}; |
| 33 | QUIT; |
| 1 | PROC CAS; |
| 2 | bart.bartScoreMargin |
| 3 | TABLE='grid_data_with_issues', |
| 4 | model='grid_failure_model', |
| 5 | seed=222, |
| 6 | margins={{ |
| 7 | name='margin_gen1_failure', |
| 8 | at={{var='machine_type', value='Gen1'}} |
| 9 | }}; |
| 10 | QUIT; |
The action is expected to handle the missing data gracefully. The BART algorithm inherently handles missing predictors by averaging predictions over the two branches of a split. The rows with missing 'sensor2' values should be included in the calculation, not dropped. The unseen level 'Gen3' will be ignored for the 'margin_gen1_failure' calculation. The action should complete and produce a valid 'Margins' table, demonstrating its robustness to imperfect real-world data.